5

I've lost a couple days to this problem and hope it sparks a thought from someone.

I am integrating several systems together using Powershell scripts. One of the two services I am connecting to (hosted JIRA) can be accessed just fine from my local system, but the script would fail when running from one of my VMs. I found, through chance, that if I opened/refreshed a browser on the server for an HTTPS URL for that host then the script would be able to access the API over HTTPS for about 20-30 seconds afterwards.

I receive a timeout error when I remote into the server and try this from a powershell console. I then verified the same behavior occurs with cUrl (verbose output included below). Refreshing a browser with that domain then allows both to access HTTPS URLs for a short period of time. It appears to be timing out on the initial connection before SSL negotiation.

Representative PoSH Command:

Invoke-RestMethod -Method Get -Uri "https://MYDOMAIN.atlassian.net/rest/api/2/issue/PLPT-1?fields=key,id,status" -Headers @{"Authorization" = "Basic "+ [System.Convert]::ToBase64String([System.Text.Encoding]::UTF8.GetBytes('USERNAME:PASSWORD'))}

Representative cUrl command:

curl.exe "https://MYDOMAIN.atlassian.net/rest/api/2/issue/PLPT-1?fields=key,id,status" -u "USERNAME:PASSWORD" -v -X GET

I've done a lot of digging on this and I'm pretty stumped. I did try using Wireshark to dig deeper, but it's been years since I used a packet sniffer and I'm rusty and having to learn the UI.

Troubleshooting:

Here are the questions/answers I could think of while trying to isolate the problem:

  • Is it powershell?
    • Using cUrl also times out
  • Is it all HTTPS?
    • https://google.com/ works fine without timeout
    • https://localhost/... works fine without timeout
  • Is it a system that has accessed JIRA via browser ever?
    • I verified my home desktop could connect via PoSH despite never having accessed JIRA
  • Is it Host, DC, or OS?
    • This is a 2008 R2 VM in Azure, I verified the PoSH and cUrl commands work fine in a 2nd Azure VM running 2008 R2
  • Firewall, Antivirus?
    • Disabled Antivirus and Firewall, cUrl + PoSH still timeout
  • User agent?
    • Including a user agent didn't make a difference on problem system or working systems
  • What does Fiddler say?
    • Fiddler w/ SSL decryption caused gateway errors to occur instead of timeouts, I haven't dug deeper
  • Maybe it's a network issue for Atlassian? Intermittent connectivity?
    • I've been consistently getting errors from my server and it's been consistently working from everywhere else I have tried
    • I performed 10 in a row calls on the server and locally and got perfect returns from the 10 local and perfect timeouts from the server. After doing the browser refresh trick on the server, I had 10 in a row perfect responses.
  • What does it look like in Wireshark?
    • With cUrl: Wireshark shows the initial TCP call go out, but it isn't ACKed, so you then see two TCP Retransmission attempts
    • With cUrl after brower priming: Wireshark shows the first TCP call is ACKed and then everything works as expected

For a short amount of time I thought I had gotten cUrl working consistently. I was using -3 -4 to force SSL3 and ipv4 addresses and it appeared to be working without me having to prime the connection with a web browser. Unfortunately after rebooting this no longer works.

Methods I have tried on the server:

  • cUrl, cUrl with -3 -4
  • PoSH: Invoke-RestMethod, Invoke-WebRequest, WebClient, WebRequest/WebResponse, setting default SSL to SSL3 via ServicePointManager, setting proxy and proxy credentials via system defaults in case there is one (not to my knowledge)
  • IE: works
  • Chrome: works

cUrl Output

Here is some sample output from cUrl. I already have a browser open to https://MYDOMAIN.atlassian.net (it's sitting on the login screen), but I've left it sitting for a while so the connection would be stale.

cUrl output before refreshing the browser:

* Hostname was NOT found in DNS cache
*   Trying 165.254.226.145...
* connect to 165.254.226.145 port 443 failed: Timed out
* Failed to connect to MYDOMAIN.atlassian.net port 443: Timed out
* Closing connection 0

cUrl output when I run right after refreshing the browser:

* Hostname was NOT found in DNS cache
*   Trying 165.254.226.145...
* Connected to MYDOMAIN.atlassian.net (165.254.226.145) port 443 (#0)
* successfully set certificate verify locations:
*   CAfile: C:\Users\Administrator\AppData\Local\Apps\cURL\bin\curl-ca-bundle.crt
  CApath: none
* SSLv3, TLS handshake, Client hello (1):
* SSLv3, TLS handshake, Server hello (2):
* SSLv3, TLS handshake, CERT (11):
* SSLv3, TLS handshake, Server key exchange (12):
... rest of handshake and HTML for a 401 error page because I didn't force pre-authentication ...

Updated

I added Wireshark results to questions above.

I've now also found that if I run the cUrl command and cancel it before it times out and immediately run it again, it is successful. if I let the cUrl command timeout then immediately run it again, it times out again.

If I run the PoSH command and cancel it before it times out and immediately run it again, I can actually run it 5+ times in a row successfully.

This is definately something networking related, I'm going to see if re-running the command eventually gets to a point where it times out again or if cancelling out of the first call somehow lets me keep making subsequent calls as long as I can (which may be possible, I think PoSH is taking advantage of keep alive once the initial connection is formed).

Tarwn
  • 151
  • 1
  • 4
  • Have you tried --sslv3 alone without -4? – Brian Adkins Sep 28 '14 at 16:18
  • I tried various combinations of -3 and -4 as well as -2 (sslv2) and -1 (tls 1.x). The -4 was to see if I was having DNS resolution isues and crossing over to ipv6 despite the IP in curls output, jiggling the ssl/tls versions was to see if I am somehow getting blocked by something analyzing traffic. – Tarwn Sep 28 '14 at 17:20
  • Based on your troubleshooting so far, it smells like a networking issue (maybe stale NAT or something between to do with the VM routing through the host). Are you able to traceroute at least past your network to the Atlassian IP address when it doesn't work via curl? I'm not sure how to view NAT tables for VMs on the host but that might be another place to start looking as well and post results. – Andy Shinn Sep 28 '14 at 17:31
  • Wireshark is showing the initial connection failing, it tries two TCP Reconnects and those fail as well. Networking was what led me to try it in the browser once, which is how I found that that somehow could prime the connection. What I just figured out (and will edit in above) is that cancelling the first cUrl or PoSH command and immediately re-executing somehow works. Running two in a row (and letting the first naturally timeout) leads to the 2nd one timing out also. – Tarwn Sep 28 '14 at 17:48

2 Answers2

0

My temporary "solution" is to use a short timeout on the initial calls and immediately retry if they fail. The timeout is short enough that on this server it fails and then retries again fast enough to start communicating successfully (just like when I ran it manually, cancelled it, then ran again).

So far it looks like having one timeout and retry is good enough to keep the connection working for the rest of the automation script to run problem-free.

This is a workaround, I'm still looking for the root cause and a better answer.

Tarwn
  • 151
  • 1
  • 4
0

For very similar symptoms (curl verbose output when failing versus passing) but for intermittent failures with just curl from the CL we appear to have found that this additional option to curl effectively resolves this problem:

--connect-timeout 30
user69072
  • 101