Your constraint appears to be coming from the network limits on the instance type, not EBS itself.
There's some reading between the lines required, but the EBS Optimized Instances documentation tells an interesting story -- your numbers are actually better than the estimated IOPS that the instance types claim to be able to support.
EBS Optimized instances have two network paths, with one of them dedicated to EBS connectivity, instead of having just one network path shared by all IP traffic in and out of the instance... so although the documentation is not explicit about this, the speeds appear to be the same whether the instance is EBS optimized or not -- with the difference being that for optimized instances, EBS traffic doesn't have to share the same pipe. Total bandwidth to the instance is doubled, with half allocated for EBS and half allocated for everything else.
You mentioned using an r3.large instance, and that's not shown in the table... but if we extrapolate backwards from the r3.xlarge, the numbers there are pretty small.
As noted in the docs, the IOPS estimates are “a rounded approximation based on a 100% read-only workload” and that since the connections at the listed speed are full-duplex, the numbers could be larger with a mix of read and write.
type network mbits/s mbytes/s estimated peak IOPS
r4.large 400 50 3,000
r4.xlarge 800 100 6,000
r3.large 250 31.25 2,000 (ratio-based speculation)
r3.xlarge 500 62.5 4,000
Testing one of my r3.large by scanning the first 512 MiB of a 500 GiB gp2 volume seems to confirm this network speed. This machine is not EBS Optimized and was not handling any meaningful workload at the time this test was run. This is consistent with my previous observations on the r3.large. My design assumption has been, for some time, that these machines only have about 0.25 Gbit/s of connectivity, but the test seemed worth a repeat. This is, of course, a Linux system but the underlying principles should all hold.
# sync; echo 1 > /proc/sys/vm/drop_caches; dd if=/dev/xvdh bs=1M count=512 | pv -a > /dev/null
512+0 records in
512+0 records out
536870912 bytes (537 MB) copied, 14.4457 s, 37.2 MB/s
[35.4MB/s]
That looks very much like a ~250 megabit/sec network connection, which, when you need storage throughput, is not a lot of bandwidth. Counterintuitively, if your workload is an appropriate fit for the t2 CPU credit model, you'll actually get better performance from a t2 than you'll get from an r3.