0

I have a pair of Java/Tomcat web applications running on a third party (customer) server, and of late, those applications can no longer list or download objects from AmazonS3.

This is a "nothing changed" situation, where I got a bug report out of the blue on what were stable systems. Our other users hosting the software on their own Windows networks don't have this issue, and the instances we host on Amazon EC2 likewise also have no issue. I was able to identify the date it stopped working, but Customer IT likewise says "nothing changed". I do see Sophos software running on the machine in question, but not sure if that's the issue, and it appears to have been installed a while before this occurred.

To reiterate, I have two (2) applications running on this server that interact with S3, and they both started failing the exact same time. Of note, they interact via SQS messages. One app posts to SQS (this works) and another polls SQS (this works).

To debug this, I have attempted the following:

  • Install AWS CLI on problem server and attempt to list-objects. This worked.
  • Point my own development environment (outside customer network) at the problem server's DB (available via VPN) to verify properties/config setup. This worked.
  • Hardcode the references to S3 resources and redeploy, to verify the issue isn't app initialization/failure to resolve config. This still fails, but logs are outputting the correct bucket and key, so config/setup does not appear to be the issue.
  • Put explicit log statements all around the failing methods to iso the exact line that fails. This is a call to AmazonS3's listObjects(string, string) method.
  • Checked Sophos McsAgent.log and McsClient.log to see if anything obviously related to my applications was popping up.
  • Tried to run a unit test within the application's code base on the problem server that also invokes the listObjects() method. This worked.

In the live/running failure case, do not get an exception thrown by the listObjects method. It simply appears to execute indefinitely, after I set the browser timeouts that reproduce this to be fairly long = 9000000 ms

At this point I am not sure what the next debugging step would be, but I believe the evidence strongly points to an issue related to Tomcat making this request from within their four walls.

  • 1
    It sounds like Wireshark time, to me. Is there a TCP connection being established to S3? Is TLS being negotiated? Is there any network traffic to S3? Did they open outbound firewall holes to S3 but now the IP address range of the regional endpoint had changed? – Michael - sqlbot Jun 15 '19 at 02:07
  • 1
    One thing to check, first... Does the bucket log the requests? (Unlikely but possible, depending on *exactly when* the request hangs.) – Michael - sqlbot Jun 15 '19 at 02:08
  • @Michael-sqlbot Michael, I am going to check out logging the requests in the bucket. With respect to Wireshark, I think if it's not directly within Tomcat, it won't reproduce the issue? Since a unit test over this same method works, it appears to be isolated down to the process running the code (eg Tomcat vs CMD line) – Robert Gannon Jun 17 '19 at 18:57

1 Answers1

0

After a call with the Network Admins, stepping them through the process above, their suggestions were that it was related to blocked traffic, or cached DNS records. Stepping through the repro process with them monitoring the firewall's logs, there was apparently web filtering in DNS. They disabled two rules and voila, this worked again. I cannot fathom why listObjects() coming from Tomcat failed but listObjects() coming from a unit test succeeded, but with no input from myself, as they made changes, this functionality "magically" started working again.