8

I am working in a project and my job is to build a database system to manage about 60,000,000,000 data entries.

The project background is I have to do real-time storage for large number of messages that read from about 30,000 RFID readers every second. Suppose every RFID reader generates 6,000 messages per one day, I have to insert 180,000,000 entries into the database.

A possible data entry is like "time_stamp, Reader_ID, Tag_ID, other_msg_content"

The will be queries (SELECT) base on time range, Reader_ID, and Tag_ID. The queries won't be very complicated.

Now I am designing the database system, and I plan to use MySQL. My dump questions are:

  1. Is it wise to use MySQL, or I should resort to Oracle (which is expensive), or HBase?

  2. If I must use MySQL, any idea how I can build the cluster?

  3. If I insert the messages into a table, soon the table will be very long. I would like to use Sharding techniques to split a long table to many short tables.

    3.a. I want to know the proper length for a MySQL InnoDB table, i.e., after how many data entries were inserted, I shall start to sharding?

    3.b. Is there any good sharding proxy solution out there? I know spock proxy and some others, need recommendations.

  4. Do I have to use MySQL Cluster? OR I just use mysql master servers and sharding slaves, and use Replication to achieve high availability?

  5. Assume I have to handle 20 TB data in MySQL (for 1 year), I plan to use 20 nodes (PC server, cheap), and to store 1 TB data per node, is it possible? Any comments are welcomed.

Many thanks.

  • [RRDtool](http://oss.oetiker.ch/rrdtool/) or [Vhayu](http://www.vhayu.com/) might be worth investigating, but sounds like many of the infrastructures detailed on [High Scalability](http://highscalability.com/) are worth investigating. – Steve-o Aug 25 '11 at 08:19
  • How do you intend to have high availability when you say your MySQL data will take 20TB and then you say each node will "split" this data into 1 TB pieces? or maybe your MySQL database will occupy 1TB and you'll replicate it in to 20 nodes? you've got me confused. – Bruno Flávio Aug 25 '11 at 08:50
  • Hi Qsp, thanks for asking. I meant firstly, use 20 nodes to handle 20 TB data; secondly, use 20 more nodes as replicate nodes to achieve HA. My major concern is how to store and manage 20 TB data using MySQL. HA is the second thing to worried about. –  Aug 25 '11 at 09:00
  • First, read gbn's answer. This is like deciding what engine to use before you know if you are building a car or plane. Having said take a look at [OpenTSDB](http://opentsdb.net/) as it seems to fit the storage requirements of what you want, no idea on how you want to query it though so I can't help you there. –  Aug 25 '11 at 17:19
  • RFID output is unstructured data. How can you store or retrieve using RDBMS? Use hbase instead –  Jul 05 '13 at 03:51

1 Answers1

11

Thoughts:

  • If you are asking these question on a public forum, hire experts to do it for you
  • Consider Postgres and SQL Server which will scale to this volume too
  • Do you need ACID? No = consider NoSQL
  • Design and hardware matter more than the platform
  • Don't virtualise or cut other hardware corners
  • What is your RPO/RTO?
  • Maintenance window? a.k.a are you really 24/7/365? a.k.k 30k rows per second all the time
  • Archiving?
  • Do you need older (say 6 months) online?
  • Budget?
  • Realistic testing required to validate architecture and design for stated load
  • 20 TB is probably too small
  • 6k per RFID per day but 30k per second? There are 86.4k seconds per day so only 1 in 14 RFIDs write per second: what about potential peak loads of 420k+ rows per second

Finally

  • This isn't a database question but an architecture question
  • You're asking the wrong questions, too soon for this requirement
gbn
  • 6,009
  • 1
  • 17
  • 21
  • Hi gbn, thank you very much for your insightful comments. I will follow up to your questions soon. –  Aug 25 '11 at 10:30