1

First, due to security reasons and others I wont be able to make use of S3, or other similar solution.

I have a a storage server in which I have a 1TB had disk. I have a Mysql server running in it. Data is contentiously begin added to the database by our computers about a GB per hour. So in about a month I would run out of storage.

I want to be able to add new hard disks and connect other systems to the network and link the storage. eg: If i link one more 1TB system I want the available storage for MySQL db to be 2TB. ie: distributed over the two systems.

Load balancing option too would be great. Ie. MySQL server on Both the system should be able to access the database.

How can I achieve this (Open source solutions preferred).

BMC
  • 123
  • 2
  • Does it have to be MySQL? Or can it be an SQL database designed for this kind of application? – David Schwartz Nov 25 '13 at 08:55
  • 2
    What is this some kind of data warehouse ? ... hmm database with so fast grow on 1TB hard disk (is this only one SATA drive with no raid...)... I think that soon very soon the amount of storage will be your second worry just behind speed and performance... please write some more details about this system (tech info) – B14D3 Nov 25 '13 at 09:11
  • 1
    @B14D3 Agree. This really smells like someone not thinking 2 steps ahead. Stuff like reliability and speed will come back and kill that once he hits 2-3 such nodes. – TomTom Nov 25 '13 at 09:36
  • We work on analyzing images from the internet to identify what are the most common images uploaded, what is the most common resolution etc. – BMC Nov 25 '13 at 10:51
  • By "contentiously begin" do you mean "continuously being"? – TomOnTime Nov 25 '13 at 11:15

4 Answers4

1

Start a new MySQL server every time you are close to full. Rewrite the client software to access the right MySQL server depending on the date of the information they need.

Of course, you need to have data that can be partitioned by datestamp. Queries that need to span servers will have to query each one and combine the results. Joins will be difficult. However, considering that you want infinite storage, you have to compromise elsewhere. You can't have infinite storage and still use MySQL.

This works great for any database that is storing logs or other archive data that accumulates but doesn't change. Such data is also easy to partition by datestamp.

This is the scheme that Twitter used initially. They had one MySQL server to archive old tweets; when it filled up they started a new server. Searches for "Everything user X ever tweeted" sent the query to each server starting with the newest and ending with the server that stored the archive when the account was created. All the old servers were set up with read-only replicas; as many needed to fulfill the amount of queries they had to handle. Therefore the system can scale in both directions: scale up (moving to the next server for more space) and scale out (adding more replicas for more load).

However what you will eventually find is that a relational database is a terrible choice for storing logs or other archive data that accumulates but doesn't change. Inserting many rows at a time involves locking that slows down the process and is wasteful if you can guarantee that all the data is "write once".

Twitter eventually moved to other storage technology and you will find that you want to do the same. You will want to select a system that is built to grow infinitely by adding more machines. The system then tracks which machines hold which data and even though you send your queries to a master node, it does the right thing to find the results. Such systems include: MongoDB, Hbase, CouchDB, and I think Riak.

If you data can not be partitioned easily, this answer won't help you. In that case you'll need to look at adding more and more storage to the existing system. Adding a lot of disk to a SAN and connecting that to the machine is one solution.

TomOnTime
  • 7,567
  • 6
  • 28
  • 51
0

I'll take a punch here and assume that you aren't really interested in combining storage hooked up to several different physical machines as is indicated in your question, but rather just want to be able to grow the storage solution on a single host as your storage needs grow.

If that is the case, I suggest that you take a very close look at ZFS. It is designed specifically to be able to deal with situations like this (among other things), and it is a general-purpose file system.

There is a Linux implementation which unfortunately is known to still experience hiccups under certain use scenarios, or if you prefer rock-solid stable you can host the files on e.g. a FreeBSD host and share them over NFS or SMB, or even just run the database on a FreeBSD system. I don't see you specifying the OS, but your mention of MySQL and preferring open source solutions do point toward *nix. The main caveat is that you really want to go 64-bit and have lots of RAM for ZFS to be really happy, but that shouldn't be as much of a concern today as it has been historically.

On ZFS, you work with what is known as zpools, which are basically somewhat like what you might otherwise think of as file systems. Each zpool is made up of one or more vdevs, which in turn each is made up of one or more physical (or logical) devices. On the entire zpool, you can create what in ZFS terminology is called file systems (separately mountable hierarchies). By adding additional physical devices to a new or existing vdev, the file system automatically makes available and will use the additional storage capacity gained (if any; e.g., if you add a mirror device to a vdev, no additional storage space is gained although you gain redundancy). Adding devices is a completely transparent, online operation; hence, if the storage device itself is hot-pluggable, it is possible to build a storage solution that has zero downtime during capacity upgrades.

user
  • 4,267
  • 4
  • 32
  • 70
-1

You could consider the use of LVM as a file system, but it imply to modify your file system wich can be critical. nice explanation here : https://wiki.ubuntu.com/Lvm

Kiwy
  • 162
  • 2
  • 17
-1

If you literally get gigabytes per hour in a non-stop, never-ending flood, then design choices are limited. You might consider queueing up all new data on one machine that feeds it to the MySQL database. This way you can take the MySQL database down for maintenance: to add disks, connect to new SANs, and so on.

The queue machine will give you as much maintenance time as it can hold data but remember that when it reconnects to the MySQL server it will need to catchup. For example you might find that if the queue machine is used to store 4 hours of backlog it might take 8 hours to empty that backlog into the MySQL server; it is now doing twice as many INSERTs.

Tip: If you build such a queue machine you will find it helpful to set up a monitoring dashboard that records how long batches stay waiting before they are pushed to the MySQL server. Statistics on the wait time will help you manage the system. For example if you graph the 7-day trailing data, the 90 percentile value will be a good indicator of over all health. When that value is high, alert. Something is wrong. You might graph the 90 percentile for each week of data; this will let you see if you are doing better or worse over time.

TomOnTime
  • 7,567
  • 6
  • 28
  • 51