How to scale out OpenStreetMap data efficiently

Question

For over a year now, I'm running an in-house PostGIS server filled with OSM data, used for both Mapnik-based tile generation and Nominatim-based geocoding, updated with day replicates. This works pretty well.

However, as usage is growing exponentially, I would like to achieve better reliability and performance by adding additional PostgreSQL servers. And I'm kind of lost.

Since PostgreSQL doesn't seem to handle replication by itself, I would think about using a piede of middleware like PgPool-II to keep the servers in sync. But I'm afraid it would be nothing but necessary for this usage : very high read-to-write ratio, where all writes are done at the same exact time every day.

My questions are simple : What would you do to keep these servers in sync? And, what is done for this at the OpenStreetMap Foundation, MapQuest, Mapbox or CloudMade?

Thanks.

score 13 · Accepted Answer · answered Apr 04 '12 at 10:24

13

Firstly, Postgres 9 and later do have builtin replication support and that is what we (the OpenStreetMap operations team) expect to look at using now that we are (as of about 90 minutes ago) running on Postgres 9.1 for the main database.

To answer your question as to what OSM do, well to date we don't do any kind of replication - there is one large Postgres database acting as the master database behind the web site and holding the raw data. There are then secondary Postgres+Postgis databases for rendering and for geocoding, both of which are populated using the planet dumps and diffs for updating.

answered Apr 04 '12 at 10:24

TomH

1,290
7
9

Didn't know about Postgres built-in replication. This will very probably do the job. Thanks! – Pierre Apr 04 '12 at 11:56
1

Also checkout REPMGR ( http://repmgr.org/ ) which uses Postgres's streaming replication capabilities and extends functionality to allow to fallback a slave to a master automatically. – Shyam Sundar C S Apr 04 '12 at 19:13

score 4 · Answer 2 · answered Apr 04 '12 at 10:06

I can see two options here :

Database pooling - replication http://wiki.postgresql.org/wiki/Replication,_Clustering,_and_Connection_Pooling , I would PG cluster as it is synchronous and permits load balancing . You can also go with master - slaves architecture but without connection pooling and asynchronous , it all depends on the architecture and the replication speed that you want to achieve .More infor about pgpoll-2 here -small FAQ(http://www.dalibo.org/_media/pgpool.pdf)
WebOptimisation - for Mapquest and OSF as far as I know they use web caching at server level+ connection pooling. That means that it wouldn't request from Postgres /Postgis so many requests as i caches some data . 3.Postgis optimizations - preaching query's at DB level , indexing and database normalization . I would go for "PostGis in action" (http://manning.com/obe/) to learn more about postgis data types and adapt your server to them

Thanks. The first option is definitely preferred. The second is partly done, using a CDN for delivery and local Nginx / Memcached cache on the front servers. — Pierre, Apr 04 '12 at 15:25

How to scale out OpenStreetMap data efficiently

2 Answers2