How to play traffic against a shadow network?

Question

Sorry if this is a newb question...

I've heard stories of Netflix and Twitter being able to duplicate web traffic amongst two separate infrastructures: one is the authoritative/trusted one that goes back to the user; and the other is a 'shadow' or test infrastructure that thinks it is returning to the user but doesn't. The point is to test the secondary infrastructure at real-life load and timing.

I'm pretty sure there's a word to describe this, but 'bridge' doesn't seem to be the right one, nor does 'replay'.

Can anyone help me with what this technique is called and/or what tools can be used to accomplish this?

I guess that I should add that I've heard about techniques that are effectively 'replaying logs', but that's really difficult to get at real speeds/distributions.

And, we're not trying to verify 'correctness' of the output, but just make sure that we don't see errors/stacktraces/etc in the new infrastructure.

The obvious way to do this (using a switch with a mirror port to duplicate inbound traffic) seems like it would cause problems when those "shadow" servers try to reply. Now you've got me interested in the unobvious way. — DerfK, Jul 12 '12 at 23:42
@DerfK: Replaying simple layer 2 or 3 captures would be problematic if you're not going to write code to simulate the TCP/IP stack of the remote client. Capturing up at layer 7 is more the way to go unless you want to write a lot of code. — Evan Anderson, Jul 13 '12 at 00:03
I don't think it is hard to implement it at packet-level. Please refer to tcpcopy(https://github.com/wangbin579/tcpcopy) — , Mar 19 '13 at 06:08

score 7 · Answer 1 · answered Jul 13 '12 at 00:03

I'd call it "load testing via session replaying", personally. I don't know of any simple catch-all term for this kind of testing technique.

The basic strategy that I've seen employed for this kind of load testing is to ingest log files from the production system and replay them on a test system.

You can use tools like JMeter or Apache Bench to replay requests from log files. If you're looking at replaying very complex client / server interactions (with specific timing details based on the original log stream) in hopes of really exercising the innards of your application (looking for race conditions, timing-related bugs, etc) you might look at writing application-specific testing tools that simulate clients at scale.

You're not going to be able to simply capture boatloads of raw network traffic and "replay" it with any TCP or IP-based protocol. TCP sequence numbers aren't going to match the original captured traffic and it's not going to work. IP-layer captures are going to be problematic because your simulated clients will need to answer for the captured sender's IP address. You'd be better off capturing traffic closer to layer 7 and using that to replay sessions because, otherwise, you're looking at writing a TCP simulator, too. (I could imagine using something like tshark to bust out the layer 7 data and timing from a TCP stream and replaying that, for example.)

Simply replaying network traffic simulates load but doesn't necessarily capture defects. Your simulated client would need to receive responses from the test server and parse them for correctness if you wanted load-test any test that the application is responding properly. Since your application is going to generate dynamic response data it's unlikely that your simulated client can simply compare the test server's response to the logged response from the production server. This is where you're going to get into writing a test harness specific to your application and its output.

score 1 · Answer 2 · answered Jul 13 '12 at 01:07

You use a service like BrowserMob which simulates a lot people simultaneously accessing your website at once. These services don't replay logged traffic, because then you'd be missing the client side of the conversation. E.g, your servers would be trying to send packets to computers on the Internet that aren't expecting to receive them. But what these companies do is study the logs (generally at an application-level, not packet-level) and use that information to figure out which pages people are clicking on, how often, and in what sequence. This data is used to write scripts/macros which BrowserMob then repeats.

ApacheBench, as mentioned by another user, isn't really used much these days. It was more helpful 10 years ago when you just needed to figure out how quickly a static HTML document or JPEG can be served up under a heavy load. It's not a whole lot different than a bunch of people clicking reload, reload, reload over and over again on their web browser. You need something a bit smarter when testing a web app that has a more complex workflow.

score 1 · Answer 3 · answered Jul 13 '12 at 01:52

I don't think you could do this at a network layer, though you could possibly get a specialized kernel for a hardware load balancer to handle the second server. Basically web traffic (TCP) will require an acknowledgement of each packet that is sent/received. So if a user sends a packet to your network, it would get duplicated to both your prod network, and your shadow network. The servers in each network reply, and the prod server's packet is forwarded back to your machine which shoots back an acknowledgement, and they merrily carry on their conversation. However if you drop your shadow server's packet, it won't see an acknowledgement. So, it will try resending it, and at the same time slow down its transmission speeds for all network activity (this is called windowing). It will keep retrying to send it until it times out, and the session is torn down. Honestly, you wouldn't even be able to complete a handshake to establish a connection in the first place.

About the closest you could come to this would be forwarding the original synchronization packet to your shadow server and then set the default gateway for those boxes as some non-existant location. Then anytime a user would try to set up a connection they'd get a real server on your prod network, and at the very least you'd send a syn packet to the shadow network. Darn, now you have me wondering how you could make this work too :)

score 1 · Answer 4 · answered Jun 16 '13 at 17:53

I was able to ask @adrianco about this at a Netflix meetup.

The answer was that they wrote their own tool, which is basically a ServletFilter (sorry, Java-specific terminology) that recreates the current request and does an asynchronous fire-and-forget invocation on a target server.

The benefits are:

'Real World' traffic patterns against your test ("dark") infrastructure
No need to record and then replay

The drawback:

Gotta have the threads/CPU cycles to spare on your production boxes
Latency on your test infrastructure could back up and affect your production boxes

How to play traffic against a shadow network?

4 Answers4