0

I know this question is too vague, so I would like to add some key numbers to give insights about what the scenario is

Size of each document size - 360KB
Total documents - 1.5 million
Document created/day - 2k
read intensive - YES
Availability requirement - HIGH  

With these requirements in mind, here is what I believe should be the architecture, but not too sure, please share your experiences and point me to right direction.

2 Linux Boxes (Ubuntu 11 each on a different rack setup for availability)
64-bit Mongo Database 
1 master (for read/write) and 1 slave (read-only with replication ON)
Sharding not needed at this point in time
techraf
  • 4,163
  • 8
  • 27
  • 44
learner
  • 163
  • 2
  • 9
  • What are you looking for? This isn't even really a question. – MDMarra Jul 02 '12 at 20:14
  • I need an idea about if the proposed solution looks good as per the requirement or there are some standard places I should read in order to design this – learner Jul 02 '12 at 20:44

3 Answers3

2

You're starting out with at least 500GB of data, and growing at a rate of ~700MB per day. You may want to consider sharding from the get go (perhaps just a single shard) so you can keep the per-server data manageable. We've (MongoHQ) found that 500GB is a good upper limit for a single server/replica set setup. Sharding would require that you run at least one mongos and 3 config servers in addition to the replica set, and do the research to pick a good shard key.

That said, you need to figure out how big your working set is and make sure you have enough RAM to hold it. The working set is defined as "the portion of documents + indexes you access over a given amount of time", our typical rule of thumb is about 1GB of memory per 10GB of storage with slow-ish disks. This is highly, highly dependent on your data and access patterns though. SSDs become useful when you have a pathological working set and keeping it all in memory would be expensive. Run mongostat against a simulation load and look at the "faults" column to get an idea of how often the DB is going to disk.

A simple replica set is a good start. If you are doing reads from the secondary, though, you really should have a 3 member setup rather than just two (you'll need an arbiter for two anyway). People get themselves in trouble when they load up two servers with reads, one dies, and their app overwhelms the one remaining server. Having 3 smaller servers is much more desirable than 2 larger servers.

Secondary reads can cause you app problems, too. You need to make sure your app can handle any replication lag you might encounter. You probably won't run into this right away, but it will happen if you ever take a secondary offline for maintenance and you read from it before it has time to catch up.

MrKurt
  • 436
  • 2
  • 4
1

This is a pretty vague question, so I'll give a somewhat vague answer. Almost any of these are a topic in their own right, so feel free to use this to create and ask more specific questions if something is not clear.

  1. Read intensive - make sure all docs plus indexes fit in RAM
  2. If that can't be done, get SSDs to minimize the hit taken when faulting to disk
  3. High Availability - RAID1 or RAID10 is your friend, back up your data in other ways than replication id you can
  4. Don't use master/slave, use replica sets - the master/slave code is deprecated
  5. Ubuntu 11.04 will be fine, as long as you install from the 10gen repo and not the Ubuntu ones
  6. Make sure you understand what eventual consistency is and what it means for your app when doing slave/secondary reads (also look at write preferences in your chosen driver).

Hopefully that helps you as a starting point.

Adam C
  • 5,132
  • 2
  • 28
  • 49
0

You really need to read the MongoDB Documentation. https://docs.mongodb.com/manual/administration/

Off the top of my head, you're already off to a bad start with your assumptions.

Replica sets are a minimum of 3-node clusters. Also, don't be fooled by the assumption that the secondary nodes can be built out with lesser hardware; cluster read-only Secondaries often need to work harder than write-only primaries because they are both being queried from, and receiving and processing the updates from the Primaries.

gWaldo
  • 11,887
  • 8
  • 41
  • 68