1

We've set up a system to send a message to a queue in SQS when there's an outage on a dependency. In order to simulate an outage and test the system, I blocked the outbound port to a database on the security group, but found that the ec2 instance still managed to retrieve data through JDBC connections from the database even after the port was blocked for quite some time (over 10 minutes, less than two hours).

What's going on? Security group changes are supposed to take effect immediately, but I assume it has to do with not shutting down live connections?

Is there a better way to simulate an outage given that we don't want to actually shut down the database?

Hazel Troost
  • 113
  • 4

1 Answers1

2

Security group rules changes do take effect almost immediately.

However, what the rules allow is the establishment of connections. Once a connection is up, the network remembers the connection's tuple (protocol, source/dest address, source/dest port) and the connection is allowed to continue to exist because it's already been created.

By contrast, network ACLs are stateless. Blocking the connections with the network ACL should have the effect you're looking for, though perhaps not precisely the same, because the database can fail in multiple ways that may manifest themselves differently.

When network ACLs deny traffic (or security groups deny new connections) the effect is a timeout -- because the denied packets are simply dropped, discarded, with no message sent in the reverse direction to indicate that there's a black hole in the network.

By contrast, real-world failures might alternately result in network errors like "destination host unreachable" or "connection refused" or "connection reset by peer." Each of these failures should tend to be faster failures than timeouts, and there isn't a way to simulated them inside the VPC infrastructure.

But simulating a failure with timeouts should be a very worthwhile test, and Network ACLs should facilitate that.

Note, of course, that if you are still able to establish new connections with the security group supposedly blocking the traffic, then your security group behavior is not what you believe it to be.

Michael - sqlbot
  • 21,988
  • 1
  • 57
  • 81