Extract Links from a sitemap(xml)

5

Lets say I have a sitemap.xml file with this data:

<url>
<loc>http://domain.com/pag1</loc>
<lastmod>2012-08-25</lastmod>
<changefreq>weekly</changefreq>
<priority>0.9</priority>
</url>
<url>
<loc>http://domain.com/pag2</loc>
<lastmod>2012-08-25</lastmod>
<changefreq>weekly</changefreq>
<priority>0.9</priority>
</url>
<url>
<loc>http://domain.com/pag3</loc>
<lastmod>2012-08-25</lastmod>
<changefreq>weekly</changefreq>
<priority>0.9</priority>
</url>

I want to extract all the locations from it (data between <loc> and </loc>).

Sample output be like:

http://domain.com/pag1
http://domain.com/pag2
http://domain.com/pag3

How to do this?

Akshat Mittal

Posted 2012-08-27T11:11:58.893

Reputation: 2 195

What OS are you using? – bobmagoo – 2012-08-27T11:35:50.347

Windows 7 Ultimate X64 / Windows 8 Pro X64 or Ubuntu 12.04 Linux. – Akshat Mittal – 2012-08-27T13:13:12.773

Nice setup. Using Terminal on the Ubuntu box, my answer below will get you what you need.

– bobmagoo – 2012-08-27T13:22:39.727

You can also use any text editor like SublimeText2 which can use regexp, you can get all data with it, or you can use python see my answer below. – Ishikawa Yoshi – 2012-08-27T14:35:46.487

Answers

2

You can use python script here

This script get any links started with http

import re

f = open('sitemap.xml','r')
res = f.readlines()
for d in res:
    data = re.findall('>(http:\/\/.+)<',d)
    for i in data:
        print i

And in your case next script find all data wraped in tags

import re

f = open('sitemap.xml','r')
res = f.readlines()
for d in res:
    data = re.findall('<loc>(http:\/\/.+)<\/loc>',d)
    for i in data:
        print i

Here nice tool to play with regexp if you not familiar with it.

if you need to load remote file you can use next code

import urllib2 as ur
import re

f = ur.urlopen(u'http://server.com/sitemap.xml')
res = f.readlines()
for d in res:
  data = re.findall('<loc>(http:\/\/.+)<\/loc>',d)
  for i in data:
    print i

Ishikawa Yoshi

Posted 2012-08-27T11:11:58.893

Reputation: 825

Very good answer! A reminder that if your links are in HTTPS, change http to https in the code). – George Chalhoub – 2018-05-20T14:04:48.253

How to load a remote file like http://server.com/sitemap.xml. I am not so known to Python – Akshat Mittal – 2012-08-28T14:09:37.870

you mean load with python? – Ishikawa Yoshi – 2012-08-28T14:14:57.310

Yup, Like you have used f = open('sitemap.xml','r') to open the file, How to open a remote file on http server? – Akshat Mittal – 2012-08-28T14:16:30.743

i update my post, you need to use urllib2 module – Ishikawa Yoshi – 2012-08-28T14:22:11.030

Shows error AttributeError: 'list' object has no attribute 'findall' – Akshat Mittal – 2012-08-28T14:33:14.307

do you import re module? – Ishikawa Yoshi – 2012-08-28T14:37:57.237

let us continue this discussion in chat

– Ishikawa Yoshi – 2012-08-28T14:38:53.240

9

If you're on a Linux box or something with the grep tool, you can just run:

grep -Po 'http(s?)://[^ \"()\<>]*' sitemap.xml

bobmagoo

Posted 2012-08-27T11:11:58.893

Reputation: 764

Thank you. For anybody wants to save to file grep -Po 'http(s?)://[^ \"()\<>]*' sitemap.xml > links.txt – trante – 2014-09-18T11:51:00.147

+1 This is actually a very simple but powerful solution. – HelloWorld – 2015-05-10T13:06:52.070

This worked but with a lot of mistakes (Incomplete URL's). – Akshat Mittal – 2012-08-28T13:46:31.777

Weird, I just ran this over Google's sitemap.xml file and didn't see any issues. Which ones did it miss?

– bobmagoo – 2012-08-29T17:46:11.300

This missed many url's that contained "?" and "+". – Akshat Mittal – 2012-08-30T09:50:58.820

2

This could be accomplished by a single sed command, which seems to be more solid than the grep solution:

sed '/<loc>/!d; s/[[:space:]]*<loc>\(.*\)<\/loc>/\1/' inputfile > outputfile

(found at: linuxquestions.org)

LarS

Posted 2012-08-27T11:11:58.893

Reputation: 220

Your solution works perfectly. – Baptiste Donaux – 2016-04-21T19:36:10.847

tried it as sed '/<loc>/!d; s/[[:space:]]<loc>(.)</loc>/\1/' sitemap.xml > links.txt but it outputs the same xml content. it worked with the above grep command but I am trying to figure out why it did not work – Mike – 2017-04-26T08:27:13.857

I think it's because you did not escape the () with ( and ). – LarS – 2017-04-26T21:03:56.053

1

Using XSLT, you can render it out with XPath

/url/loc

Siva Charan

Posted 2012-08-27T11:11:58.893

Reputation: 4 026

4Could you maybe expand your answer and show the XSLT instructions and the XPath queries needed? – slhck – 2012-08-27T14:44:59.780

@slhck Exactly what I wanted to say,The answer should be more explainatory. – Akshat Mittal – 2012-08-28T09:28:26.450

I read a few more about this and got this working at last. Upvoting but not a really good answer to be choosen. – Akshat Mittal – 2012-08-28T13:56:08.593

0

The XSLT solution:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0"
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:s="http://www.sitemaps.org/schemas/sitemap/0.9">

  <xsl:output method="text" />

  <xsl:template match="s:url">
    <xsl:value-of select="s:loc" />
    <xsl:text>
</xsl:text>
  </xsl:template>

</xsl:stylesheet>

Jan Tomka

Posted 2012-08-27T11:11:58.893

Reputation: 101

For years i've been using regex etc. for this but XSLT is so cool in this case :) For complete noobs in XSLT (like me) it'd be nice to add that only thing you have to do is: save this code as stylesheet.xsl and add a row to your xml document with link to stylesheet <?xml-stylesheet type="text/xsl" version="1.0" href="stylesheet.xsl"?> Then open your xml in browser (it won't work when opening as local file, you have to get it via http) – Łukasz Rysiak – 2016-08-08T12:41:56.187