Different curl errors happen occasionally

1

I have a webserver running Centos7 which makes curl requests to other resources. With the rate 5-10 requests per second, everything works fine except I get different curl errors every 2-10 minutes. I think, it started happening over time as the number of requests grew, which makes me think It has something to do with Network, but I'm a total newbie at this. How to find out what causes these errors and what can I do about it?

Network: CURL error 56: TCP connection reset by peer
Network: CURL error 7: Failed to connect to ip: Network is unreachable
Network: CURL error 18: transfer closed with 1473 bytes remaining to read

Louen Leoncoeur

Posted 2019-07-24T14:53:33.393

Reputation: 13

Answers

1

More than likely, what causes these errors could be generally classified as "SNAFU"... Situation Normal, All Effed Up.

The internet is a vast network of interconnected computers and networking appliances. Those other machines, which you have no control over, don't always do what they should. They suffer power failures. They have hardware failures. They get hit by cosmic radiation. Stuff happens.

The networking technologies that underpin the internet are designed with this in mind. The reason the internet works at all is an enormous level of redundancy. If an attempt to connect to a destination via one route fails... the last "hop" in that chain that worked will remember the failure and try a different "next hop" for future communication. It's actually a lot more complicated than this... but you get the gist.

Most web applications will retry failed connections specifically to take advantage of this redundancy. Not all of them, however. The simpler the application the more likely it will simply fail. This becomes especially true of terminal applications which apply *nix principles of small, single-job tools. Retrying is another tool's job. curl is one such application. As per the curl manpage:

--retry

If a transient error is returned when curl tries to perform a transfer, it will retry this number of times before giving up. Setting the number to 0 makes curl do no retries (which is the default). Transient error means either: a timeout, an FTP 4xx response code or an HTTP 408 or 5xx response code.

I'm not sure exactly what your use case is for using curl to retrieve resources, but if you are using curl to provide resources in an automated way you definitely need to configure it with the --retry flag with a value of 3-5. Because errors like you showed are perfectly normal... and need to be accounted for.

2. Why Is The Reliability Worse For Your Production Server Than Your Local Computer?

In a perfect world a production server will always have a more reliable connection to internet based resources than any home or office internet connection. Since that's not the case here then you are right to be interested in the cause. However, it still doesn't necessarily mean you should be worried because, again, this isn't necessarily an issue caused by your server.

Understand that your local computer and your server almost certainly don't share the same route to the resources in question. For example. If I perform a traceroute from my local home server to say... superuser.com I get this:

user@home ~ $ sudo traceroute -I superuser.com
traceroute to superuser.com (151.101.1.69), 30 hops max, 60 byte packets
 1  rtr.scrapyard.local (10.5.0.1)
 2  96.120.58.37 (96.120.58.37)
 3  po94-sr22.dothan.al.pancity.comcast.net (68.85.202.165)
 4  162.151.221.209 (162.151.221.209)
 5  be-3666-cr02.56marietta.ga.ibone.comcast.net (68.86.90.209)
 6  * * *
 7  50.242.151.138 (50.242.151.138)
 8  151.101.1.69 (151.101.1.69)

But if I do the same command from one of my production servers I get this:

user@production ~ $ sudo traceroute -I superuser.com
traceroute to superuser.com (151.101.1.69), 30 hops max, 60 byte packets
 1  * * *
 2  ae-20-202.gw-distp-a.slr.lxa.us.oneandone.net (74.208.138.130)
 3  ae-10-0.bb-a.ga.mkc.us.oneandone.net (74.208.1.237)
 4  kanc-b1-link.telia.net (80.239.196.109)
 5  dls-b22-link.telia.net (62.115.125.159)
 6  fastly-ic-340339-dls-b22.c.telia.net (62.115.166.191)
 7  151.101.1.69 (151.101.1.69)

The only hop those two routes have in common is the destination. Every other machine they pass through is different. So if, say, dls-b22-link.telia.net was misbehaving, it would affect my server's attempts to communicate with superuser.com... but not my home computer's attempts to do the same.

Unfortunately, if there was a problem with dls-b22-link.telia.net there wouldn't be much I could do about it. And given the intermittent nature of the problem it wouldn't be particularly easy to determine that dls-b22-link.telia.net was the source of the problem to begin with.

So...

2b. Is It Really A Problem?

The first thing you should do is confirm that this is causing an actual problem that simply retrying the failed connections won't fix. Meaning that your production server is being impaired in doing it's job in some way. I assume you had a goal in mind when you set this up. Is that goal still being accomplished in such a way that you needn't take action? That's the key question.

Going back to what I said before, intermittent issues like this are simply part of the internet. In a perfect world they wouldn't happen but we don't live in a perfect world... which is why redundancy is a foundational principle in all the technologies the internet is built on. It's why retrying after these kinds of connection failures is standard operating procedure. And why you shouldn't worry too much about such failures unless they actively impair your server.

2c. Is It Under Your Control?

You need to narrow down the potential source of the problem. To do that, simply do the same tests you have already done (counting number of failures in a given time frame) but this time have the server request resources from somewhere radically different. I would suggest setting up a simple webserver on your home computer with a couple of files similar to what you have been working with and use curl on your server grab those.

If the server experiences no failures doing this, then the problem is very unlikely to be with your server or your server's hosting provider. And your existing tests have already eliminated your local network and isp as well as where ever the resources themselves are hosted as potential sources of the problem. That leaves the nodes in between your hosting provider and the hosting provider of the resources and falls squarely under "things you have no control over."

If the server does experience issues during the above test then, because you have already eliminated your local network/isp as the problem, you can be nearly certain the problem is either with your server or the server's hosting provider. This means it's under your control to fix. It also means you have more troubleshooting to do.

2d. What Next?

If the problem isn't with your server, your server's hosting provider, or the resources you are querying... then the cause itself is not under your control. Your best bet, in that case, is to relocate the server (contact your hosting provider and see what options they can offer you). The hope is that by doing so you will no longer need to use the route that has the faulty node on it. It's quite the ordeal though, and not guaranteed to work. It could even lead to new problems. Hence why this definitely needs to be a serious issue before you take such a step.

On the other hand, if you have narrowed the issue down to either your server or your server's hosting provider then you can probably get it fixed. If you have a managed hosting agreement then call your hosting provider and have them fix it. If you don't have a managed hosting agreement then you need to eliminate your server's configuration as a potential culprit. And that, unfortunately, is where I get off the train. We are reaching the limits of my expertise.

Generally, for it to be an intermittent issue caused by your server, it likely has something to do with network buffering or is a result of some kind of automation. Some informed guesses:

  • Have you taken any steps to harden your server against malicious probing and attacks?
  • Have you messed with your /etc/sysctl.conf or the files in /etc/sysctl.d/?
  • Have you setup any kind of stateful packet inspection or intrusion detection software (iptables/netfilter based firewalls, snort, etc)?

Regardless, if you are at the point where you are troubleshooting the server itself, my advice would be to take the information you have collected and make a new question on ServerFault. The people there have a lot more experience with server issues that people here on SuperUser and are more likely to know what to try next.

3. Regarding The Apparent Consistency Of Errors

Now, as to why you are getting the same specific error over and over and over? It's hard to say. Assuming it really is happening like clockwork every 5 minutes... could still be anything. These devices have clocks and timers in them for a wide variety of purposes. It could be something one of them is setup to do every five minutes is causing this tiny hickup.

It is possible it's a problem with your server. Or it's a problem with your hosting provider. Or it's a problem with your hosting provider's ISP. Or it's a problem with your home/office ISP. Or anywhere in between. If it's not your server, and it probably isn't based on what you've told me, then the bottom line is you can't do much about it... except make sure you are setup to retry failed connections. All modern web browsers, for example, retry several times before giving up on retrieving a resource from a web server.

EDITS

  1. Added second and third section in response to a comment requesting further clarification
  2. Rewrote second section to account for corrections.

Cliff Armstrong

Posted 2019-07-24T14:53:33.393

Reputation: 1 813

Thanks a lot for quick and detailed response. It sounds very reasonable but my own test confuses me: no response from server errors happen 50-100 times per 100k requests on the production server, guaranteed to happen every five minutes. But when I made local test with just one same request per second with 10k requests total, not a single error has happened. Why could it work so differently? – Louen Leoncoeur – 2019-07-24T20:30:14.790

I've updated my answer with additional clarification. Please let me know if I've misunderstood anything. – Cliff Armstrong – 2019-07-25T05:12:58.690

Huge thanks once again! actually I meant testing 10k requests from my local computer to the same resources over internet just like on production server. in case of my local computer, everything works without a single error which makes me think that it may have something to do with my prod server. Therefore I want to make sure the server is working fine meaning no connection limits. low network speed, code problems or something else I don't know of, but I dont really know where to look, what to check – Louen Leoncoeur – 2019-07-25T10:57:12.550

In that case, this is something you probably want to investigate further. In a perfect world, production servers would always have more reliable connections to internet resources than home/office computers to the same resources. Obviously that's not the case here. It still may not be, and I think likely isn't, your server. It's probably somewhere in between the server and the resources. I'll correct my answer and provide some suggestions on troubleshooting methods. – Cliff Armstrong – 2019-07-25T19:25:30.487