1

A part of the platform I am building requires a large data table (starting at tens of millions of records, scaling to hundreds of millions within a year or two, maybe reaching billions at some point). The structure of the data table is: int, int, float, datetime, datetime. This data table will accept data from a single source (import script) in batches of up to ten million rows. I have full control over the import script. Various applications will be pulling data via web API and probably a custom TCP server. I am expecting requests for individual rows in bursts of up to 50,000 per second. At first this sounds like a good application of a key-value design, but many of the requests will take the form:

select float where int=A and datetime < B and datetime < C order by datetime, datetime limit 0,1

the basic idea is that I'm getting the datapoint for a given series that has the latest datetime pair below a user-defined threshhold. I will probably be able to do some logic on the application layer to pull an entire series at a time, but much of that ordering will still fall to the database layer.

I'm currently running a prototype off of SQL Server 2005 and it's very responsive at up to 1,000 requests per second with 10 million records. I am concerned about scaling to hundreds of millions of rows at 50,000 requests.

What do you think? Is MySQL the tool for the job because it's more lightweight than SQL Server? Should I look into NoSQL solutions (can any even handle the sample query)? Any other ideas are welcome.

Thanks!

theserge

  • Any idea on budget and dataset size? – Chopper3 Nov 10 '10 at 16:40
  • Have you considered approaching the design from a different angle, one where returning 50k rows/sec isn't a necessity. – Chris S Nov 11 '10 at 01:33
  • Hardware budget is $30-70k depending on performance. Dataset will start at 10mil rows and grow to 100mil within a few years. Will not get to a billion during the first 5-7 years. Each row is ~30 bytes, so 1-3 gb for the first few years – Serge Aluker Nov 11 '10 at 04:02
  • Chris S, users will be making ad-hoc queries (rarely the same rows) to the database and 50k requests at a time is the peak I'd like to design for. Did you have any specific thoughts on how to avoid this? – Serge Aluker Nov 11 '10 at 04:08

1 Answers1

2

Is MySQL the tool for the job because it's more lightweight than SQL Server?

No.

I look into NoSQL solutions (can any even handle the sample query)?

In general no as... the primitivity there makes them unsuitable.

Look at typical TickerPlant solutions. Expect to pay 50k USD upward. This is a VERY special requirement. Expect to use a LOT of servers.

TomTom
  • 50,857
  • 7
  • 52
  • 134
  • Couldn't agree more, you could do this off one 'server' but it would nee memory-mapping and I'd seriously be looking at an IBM Z-series if my business relied on this kind of requirement. – Chopper3 Nov 10 '10 at 16:47
  • Thanks, TomTom. Can you give some examples of TickerPlant solutions? I looked at SpryWare, Exegy, and Mantara but they offer much more than just the database layer. What would you suggest on the database side if I were to start from scratch? Master-slave sql server configuration? Why not MySQL? – Serge Aluker Nov 10 '10 at 17:14
  • BOth suck for the amount of queries you have - they are generic databases and simply not optimized for stuff like time series data, which is what you have. Ticker plant solutions and their optimized databases have a MUCH better chance to cope with the type of query you have here. You simply are so high in requirements that brutally speaking any general solution wont work. Tick type databases are made for that type of query - ONLY. – TomTom Nov 10 '10 at 22:26
  • Check: kdb+ (kc.com), onetick.com – TomTom Nov 10 '10 at 22:27
  • TomTom, thanks. I've read up on kdb and other ticker plants, and it seems that they are all designed around a column database structure. Is this what gives the performance boost? Would you look at any of the open source options like Infobright and InfiniDB? Looking at benchmarks of those and other column-oriented databases, it looks like they are geared towards single queries that span across many rows like aggregations as opposed to many simple queries that access 1 row each. Thoughts? – Serge Aluker Nov 11 '10 at 04:12
  • Yes. Make benchmarks yourself ;) not meant snippy - but you realy need to test that out. The concept here really is column base and fixed length, both which can give a LOT of optimizations which , incidentally - a normal db can not optimize for. – TomTom Nov 11 '10 at 06:27
  • I benchmarked InnoDB(mysql5.1) vs ICE3.4 vs MSSQL2005 on a 30mil row table using the query above. All tests were on 8gig 8core xeon similarly built machines. ICE was worst at 20qps! So much for column databases. MSSQL got 12k qps. InnoDB got to 15k qps. The entire index fit into RAM in MSSQL and InnoDB (roughly 900MB) and the whole table fit into RAM on ICE at 100MB. Anything else I should try? – Serge Aluker Nov 16 '10 at 00:33