
I have a webservice that have a couple of "machine" clients connected to it that does HTTP post with data once in a while. The nature of the service is that I cant allow the service to loose a single incomming request. However I do experience that once in a while I get errors from database backend or similar resulting in errors and the post is lost. (however the response is not that critical to deliver).

Are there any best practice documents/architecture descriptions on how to handle such errors. I am thinking in the terms of queuing the incoming request and try it later or perhaps forward it to another webserver in the webfarm.

I am currently running in AWS with RDS/MySQL database backend to an IIS7.5 web application. Everything load balanced and running in multi-az mode. My idea is to put any troublesome request into SQS and process that queue regularly but I guess that there is a lot of thinking into this area already and probably some pitfalls that I will hit if rolling it on my own.

  • 111
  • 2
  • Is your server sending back a 4xx series error in response to a failed post? If so, it's the responsibility of the client to deal with it. If not, you should fix that post-haste. – mfinni Oct 26 '14 at 22:27
  • I know that 4xx should trigger that behaviour in a client but if I work with third party clients that is not under my control I have to mitigate on my end to best possible extent. (besides the devices being remotly installed equipment without over the air software uppgrades) – DavKa Oct 27 '14 at 07:21
  • 1
    @mfinni `4xx` codes are used to indicate client side problems or that the client must change the request in order to get success. For server side problems you are supposed to use `5xx` codes. – kasperd Oct 27 '14 at 09:10
  • @kasperd - thanks, wasn't sure if there was a different code for that. Either way, the server is responsible for sending errors and the client should be responsible for responding appropriately. – mfinni Oct 27 '14 at 14:21

2 Answers2


First and foremost, if you absolutely can't handle even small amounts of downtime (such as when a fail over occurs) then you should implement retry logic in your client application.

If the response to these requests aren't time sensitive (eg. it's a log and it doesn't matter if the log isn't delivered immediately so long as it's recorded) then I'd definitely consider using a queue based architecture.

SQS is the obvious choice for queues on AWS, but keep in mind:

  • While it is distributed and highly available, individual nodes do fail from time to time. You'll still need retry logic in your client if you happen to get a dud SQS node.
  • SQS only grantees "at least once" delivery therefore you may get a message more than once. In my experience this is rare and presumably most often occurs when a node fails.

Also make sure your infrastructure is replicated across availability zones, and preferably also regions. Your client could for example try SQS in another region when submitting to the primary region fails.

  • 1,849
  • 12
  • 14

Implementing a web service that never fails to process even a single HTTP request is very hard. And it is probably not worth the effort. Even if you manage to get the service to handle every single POST request and send a successful reply, there are other problems the client could experience:

  • Some middle-box between client and server tracks connection and drops its state.
  • Some short period of high packet loss cause the TCP stack on the client to time out the connection.
  • The connection times out at the application level on the client side.

All of those has to be handled in exactly the same way by the client as a 5xx error code, which is as follows:

  • Make no assumption about whether the request was processed or not. If the request has not been designed to be idempotent, the client has to perform a somewhat complicated recovery to identify whether the request need to be resubmitted.
  • Client must retry using exponential backoff to prevent the service from melting down under high load.
  • 29,894
  • 16
  • 72
  • 122