7

Which database to decide upon? Any comparisions?

  • Existing: postgresql
  • Issues
    • Not easily scalable horizontal. Needs sharding etc
    • Clustering does not solve the data growth problem
  • Looking for: Any database that is easily horizontally scalable
    • Cassandra (Twitter uses that?)
    • MongoDB (rapidly gaining popularity)
    • Voldemort
    • Other?
  • Why?
    • Data growing with snowball effect
    • existing postgresql locks table etc for vaccuum tasks periodically
    • Archiving data is tideous currently
    • Human interaction involved in existing archive, vaccuum, ... process periodically
    • Need a 'set it. forget it. just add another server when data grows more.' type of solution
schwert
  • 48
  • 4

4 Answers4

9

First question: Why are you on a relational database to begin with, if you don't need ACID properties? It sounds like you're doing some kind of non-transactional work, so getting a RDMBS with transactions is probably too heavy for your environment.

Second question: What kind of data are you storing? You sound like you need a column-store database, and that it's for some kind of data warehouse project.

Third question: If you're stuck with PostgreSQL (which is a fine database as-is) is it the current version? Older pre-8.x versions are notoriously slow, but a lot of work has gone into improvements since then, and some of the issues you mention - like autovacuum - are now easily addressed with "set-and-forget" settings.

*  Data growing with snowball effect

Some additional info on this would be nice. Why is it snowballing? Can you normalize it to reduce storage?

* existing postgresql locks table etc for vaccuum tasks periodically

If this is an issue, I can tell already that you're running an older version. Newer versions have per-table controls for this and you can even turn it off entirely.

* Archiving data is tideous currently

Hard to make any kind of judgement here because there isn't much to work with. What media is the archive being dumped to? How much sustained I/O is involved? What time frame are you operating under? How much data? Does it need to be a "hot" dump or can it be "cold"?

* Human interaction involved in existing archive, vaccuum, ... process periodically

I'm trying to see how "normal" usage would require manual intervention, because it shouldn't. Vacuum is automatic now and (as mentioned before) can be set to not occur at all, and most backups are scripted (and when you can script, you can schedule). So how is this occuring?

* Need a 'set it. forget it. just add another server when data grows more.' type of solution

You're talking about a clustered-server arrangement.

It sounds like the following to me:

  1. You're on an RDBMS and the transactional nature of it is not suited for your application.
  2. Your application appears to want a mostly-read style of database. It also doesn't sound like you need it to have transactional integrity.
  3. The volume of data you're handling is most likely not normalized, nor has any attempt been made to normalize it.
  4. You're doing waaaaaay too much by hand and need more automation.
  5. You like the idea of a clustered solution, possibly "cloud style" computing.

Other than that, there's not enough info here to figure out what a good fit would be.

Avery Payne
  • 14,326
  • 1
  • 48
  • 87
3

You might consider looking into HBase and HyperTable as well; but again, as Avery Payne mentioned, you don't give us any information about your current application, just your database platform.

Some things to keep in mind:

Joins are done manually on the non-SQL platforms. They won't do things like foreign keys, aggregates, etc. All of that is manual.

Existing applications are not necessarily easy to port. Depending on what it'll cost you to port, it may be more cost-effective for you to scale your PostgreSQL server vertically (rather than horizontally).

You don't get ACID, and you have to manually manage concurrency. Depending on your application, this may be a problem. You also can't enforce global conservation rules in the traditional way, again due to lack of atomicity.

blueadept
  • 516
  • 2
  • 6
0

Cassandra is the best option where you know you need to scale.

I'd recommend some of the Case Studies articles from http://wiki.apache.org/cassandra/ArticlesAndPresentations

jbellis
  • 157
  • 2
0

What you can do to solve some of your issues is:

  • Existing postgresql locks table etc for vaccuum tasks periodically

Table is not locked, it just performs slow. This is done by postgresql to prevent transaction ID wraparound. You can reduce the frequency by writing multiple rows in batches and then commit. You could use a queue (like rabbitmq) for intermediate writes: application->queue->db. This will also increase your write performance a lot.

  • Archiving data is tideous currently

If your data is too big in the orders of several TB I would suggest you to move to the cloud, because dumping is not an option. Use AWS or Google Cloud, and use snapshots. E.g. EBS snapshots, which are very fast, are replicated accross continents and solve the need for backup.

If by archiving you mean delete data and move to an "archive", then use tablespaces, which are rotated by date. There are some implementations online for this.

sivann
  • 543
  • 5
  • 16