Download/update webpages listed in XML sitemap

0

I'm searching a FLOSS tool that downloads all pages (and embedded resources, e.g. images) linked in a XML sitemap (built according to http://www.sitemaps.org/).

The tool should "crawl" the sitemap regularly and look for new and deleted URLs and changes in the lastmod element. So whenever a page gets added/deleted/updated, the tool should apply the changes.

Some sitemaps list sub-sitemaps in sitemapindexsitemap. The tool should understand this and load all linked sub-sitemaps and look for URLs in there.


I know there are tools that allow me to extract all URLs from the sitemap, so that I could feed them to wget or similar tools (see for example: Extract Links from a sitemap(xml)). But this wouldn't help in getting noticed about updates to pages. Tracking the webpages itself for updates doesn't work, because "secondary" content on the pages changes daily, but lastmod gets only updated when relevant content changed.

unor

Posted 2012-10-13T15:04:17.393

Reputation: 2 749

Question was closed 2014-12-31T16:31:03.433

Answers

1

Have you tried to script this with wget and cron? Look at wget's --spider flag. It looks to be all that you need, other than cron to run it occasionally.

dotancohen

Posted 2012-10-13T15:04:17.393

Reputation: 9 798