How does Wikipedia generate its Sitemap?

Question

The topic interests me because of Wikipedia's size. It may be easy to create some crons to update the sitemaps periodically in a small site, but what about a big one? So:

How does Wikipedia generate its Sitemap?

score 9 · Accepted Answer · edited Apr 09 '15 at 09:48

9

It's dynamically generated by a PHP script. For big sites it's probably better to check for changes and only generate if something changed -- or generate it only every XY minutes/hours/days. It depends on the infrastructure.

The informations needed are all in the database, so it's not such a hard task.

And here is the proof: http://svn.wikimedia.org/viewvc/mediawiki/trunk/phase3/maintenance/generateSitemap.php?view=log / http://www.mediawiki.org/wiki/Manual:GenerateSitemap.php

Edit: Ah and this could be also interesting for this topic:

edited Apr 09 '15 at 09:48

peterh

4,914
13
29
44

answered Jun 26 '09 at 14:01

Gregor

286
1
3

Could you use the PHP-code to generate the sitemap for any big sites? Do you mean by the word "dynamically" that the sitemap is generated somewhat automatically and making slight changes to the code when needed? – Jul 12 '09 at 23:06
Can you clarify the sentence "The informations needed are all in the database, so it's not such a hard task."? Where can I see the database? – Jul 12 '09 at 23:09
I think he means that all of the information is in the database underlying mediawiki. Unless you're one of wikipedia's sysadmins or DBAs, you probably can't get direct access to their DB. – Cian Jul 12 '09 at 23:12
3

I also think the OP is trying to work out how to generate a Sitemap on a 'large' site, in the case of Wikipedia it is very much RDBMS-driven (MySQL) with all pages being served out of the database. Therefore your DB knows all pages, and you need a simple-ish PHP script (linked above from Subversion) to do it. In the case of other sites, driven by different technologies, then you'll find that the approach needed is different. Last time I checked Wikipedia published their databases for download, or at least, they published their content in a .SQL file. – nixgeek Jul 12 '09 at 23:23
astinus: very interesting. Can you remember where the .SQL file is located? It would be cool to practise with big sites. – Jul 12 '09 at 23:51
1

Here is the [Wikipedia DB Dump][1] :-) [1]: http://en.wikipedia.org/wiki/Wikipedia:Database_download#Latest_complete_dump_of_english_wikipedia – Gregor Jul 13 '09 at 08:50

score 2 · Answer 2 · answered Jul 19 '09 at 08:53

I was faced with the task to create a site map for our web site a while back. Although it's not the size of Wikipedia, it's still around a hundred thousand pages, and about 5% of them are changed, added or removed daily.

As putting all the page references in a single file would make it too large, I had to divide them into sections. The site map index points to an aspx page with a query string for one of 17 different sections. Depending on the query string the page returns an xml referencing several thousand pages, based on which objects exist in the database.

So, the site map is not created periodically, instead it's created on the fly when someone requests it. As we already have a system for caching database searches, this is of course used to fetch data for the site map also.

Why the downvote? If you don't explain what it is that you think is wrong, it can't improve the answer. — Guffa, Apr 09 '15 at 10:21

score 1 · Answer 3 · edited Apr 09 '15 at 09:53

Although the sitemap generation code is in MediaWiki core master and would certainly be the option chosen to produce a sitemap, I don't see any evidence that Wikipedia actually has it turned on. The robots.txt file does not point to any site maps.

Further, any maintenance script run on Wikimedia projects is controlled by puppet and there is no instance of generateSitemap.php in the puppet repository. Finally, there is no sitemap in the dumps for any Wikimedia wiki either, while there are "abstracts for Yahoo".

In any case, Wikipedia runs Squid caches in front of their app servers. They can control how often their sitemap is updated by adjusting the expiry time for the page.

Moreover, whatever Wikipedia does for indexing is not a good model for your wiki, because Google has special contacts/deals/handling of Wikipedia, see a recent example.

There's no real reason to expect robots.txt to reference a sitemap, so the absence of such a reference doesn't really prove anything. — John Gardeniers, Jul 13 '09 at 00:12

score 0 · Answer 4 · answered Jul 13 '09 at 21:16

0

I'm not positive, but I think they use the Google Sitemap extension for MediaWiki. This is supported by the Wikipedia page on Sitemaps.

answered Jul 13 '09 at 21:16

Keith

352
3
11

How does Wikipedia generate its Sitemap?

4 Answers4