Wget mirror should treat xml as html

1

I want to make a mirror of a site that has a dynamic sitemap in XML form.

Of course I want that sitemap downloaded and processed as if it were an html file.

I tried the -F flag for this file, but it didn't work, saying that it didn't find any URLs inside the file.

Currently I assume that this won't work this way (because wget is not for xml), but wanted to ask to make sure I'm not overlooking something.

The content of the xml looks like this:

<?xml version="1.0" encoding="UTF-8"?><?xml-stylesheet type="text/xsl" href="http://MY_SITE/wp-content/plugins/google-sitemap-generator/sitemap.xsl"?><!-- sitemap-generator-url="http://www.arnebrachhold.de" sitemap-generator-version="4.0.8" -->
<!-- generated-on="June 11, 2017 6:05 pm" -->
<sitemapindex xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9 http://www.sitemaps.org/schemas/sitemap/0.9/siteindex.xsd" xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"> <sitemap>
        <loc>http://MY_SITE/sitemap-misc.xml</loc>
        <lastmod>2017-05-31T20:49:06+00:00</lastmod>
    </sitemap>
    <sitemap>
        <loc>http://MY_SITE/sitemap-pt-post-2017-04.xml</loc>
        <lastmod>2017-04-12T16:27:52+00:00</lastmod>
    </sitemap>
    <sitemap>
        <loc>http://MY_SITE/sitemap-pt-post-2017-02.xml</loc>
        <lastmod>2017-02-10T17:50:14+00:00</lastmod>
    </sitemap>
[...]
</sitemapindex>

And each subsitemap then like:

<?xml version="1.0" encoding="UTF-8"?><?xml-stylesheet type="text/xsl" href="http://MY_SITE/wp-content/plugins/google-sitemap-generator/sitemap.xsl"?><!-- sitemap-generator-url="http://www.arnebrachhold.de" sitemap-generator-version="4.0.8" -->
<!-- generated-on="June 11, 2017 6:07 pm" -->
<urlset xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9 http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd" xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"> <url>
        <loc>http://MY_SITE/32017-SOME_CONTENT/</loc>
        <lastmod>2017-04-12T16:27:52+00:00</lastmod>
        <changefreq>weekly</changefreq>
        <priority>0.8</priority>
    </url>
    <url>
        <loc>http://MY_SITE/32017-SOME_OTHER_CONTENT/</loc>
        <lastmod>2017-04-12T16:24:25+00:00</lastmod>
        <changefreq>weekly</changefreq>
        <priority>0.8</priority>
    </url>
</urlset>

Angelo Fuchs

Posted 2017-06-09T18:29:58.537

Reputation: 502

Answers

1

Your problem is that unlike the links in HTML, wget -r is unable to follow links in XML. You could work around this by retrieving the sitemap first, finding all the URLs in it and finally retrieve them with another wget, e.g.:

wget --quiet http://example.com/sitemap.xml --output-document - \
    | egrep -o "http://example\.com[^<]+" \
    | wget -i - --wait 0

Here, the key is

-i file

--input-file=file

Read URLs from a local or external file. If - is specified as file, URLs are read from the standard input. (Use ./- to read from a file literally named -.) If this function is used, no URLs need be present on the command line. If there are URLs both on the command line and in an input file, those on the command lines will be the first ones to be retrieved. If --force-html is not specified, then file should consist of a series of URLs, one per line.

We offer this "file" from standard input after modifying the XML into desired form i.e. one URL per line with egrep.

Esa Jokinen

Posted 2017-06-09T18:29:58.537

Reputation: 615

0

If the site displays the sitemap as HTML, but returns it to you as XML, there is probably an .xsl or .xslt (eXtensible Stylesheet Language Transformation) file you're missing. This defines how the XML file is actually displayed; in this case, probably in the form of HTML. If you download that and display it, it'll probably produce what you're looking for. Alternatively, you can learn XSLT and write your own.

Pak

Posted 2017-06-09T18:29:58.537

Reputation: 151

No, there is no HTML display. That is a xml format so that google can index your page faster. I'll edit an example in my question. – Angelo Fuchs – 2017-06-11T18:05:06.297