What are good link extractors out there?

Link extractor - since I don't know a better name for it; a utility which can take a .htm file, and give me links from it, not counting and so, just direct links. Useful for files in which you have a number of html links which are in a text and so on ...

Anyone knows of some, by any chance ?

Rook

Posted 2009-11-16T19:35:01.530

Reputation: 21 622

Answers

Firefox, with the Web Developer add-on, can do this. Open the HTML file, display the Web Developer toolbar.

In the Information drop-down menu, select "View Link Information". It will open a new tab with a list of all the links in the HTML file.

enter image description here

The Firefox Accessibility Extension can also display a list of links in a windows, but it's maybe an overkill, as it's doing tons of other features meant for people with disabilities.

enter image description here

Snark

Posted 2009-11-16T19:35:01.530

Reputation: 30 147

I've needed a quick-n-dirty version of this a time or two in the past. My solution is generally this:

search and replace "http://" with "\r\nhttp://" (move all http URLS onto their own line
find/grep or otherwise filter on all lines that start with "http://" (regex something like "^http://")
sort the filtered results, with the option to delete duplicate lines

That's my quick-n-dirty solution, but I haven't used an actual tool for this before. Although, I suppose I could wrap this up in a .bat or AutoHotkey script. I just haven't needed it often enough for that.

JMD

Posted 2009-11-16T19:35:01.530

Reputation: 4 427

Yeah, know what you mean. Unfortuantelly, that's pretty much alike to what I was doing until now. Only now, I have about cca. 200 htm files, from which I have to get links, to compare some references ... long story short, I was hoping for some batch utility which can take all of them, and give me all links in one text file for me to rip apart. – Rook – 2009-11-16T19:49:11.440

Also, the links are not only html, but ftp, telnet and mail. The worse thing is I had a thing like that before, but now I can't find it anymore. – Rook – 2009-11-16T19:50:28.270

A quick Google turned up several options, including some free ones. I tend to prefer open source over "freeware", so I would probably search SourceForge.net for "URL extractor" as well. – JMD – 2009-11-16T19:57:19.007

href="(?<url>(((ht|f)tp(s?))\://)?((([a-zA-Z0-9_\-]{2,}\.)+[a-zA-Z]{2,})|((?:(?:25[0-5]|2[0-4]\d|[01]\d\d|\d?\d)(?(\.?\d)\.)){4}))(:[a-zA-Z0-9]+)?(/[a-zA-Z0-9\-\._\?\,\'/\\\+&amp;%\$#\=~]*)?)"

Would be a regex that could achieve this.

Rich Bradshaw

Posted 2009-11-16T19:35:01.530

Reputation: 6 324

Download Text Crawler (It is freeware) and install it. Launch it after it is finished installing. In the Filename/Filter box type in "*.htm *.html *.php" or whatever the extensions of the HTML files that you are parsing are. In the Start Location box browse to the directory where the files are. By default it also scans subdirectories, if you don't want this functionality then you can click on Options then unselect "Scan Subfolders". In the Find box type in:

<a.*?href\s*=\s*["'](.*?)['"].*?>(.*?)</a>

Make sure "Use Regular Expressions" has a checkmark next to it. Then click Find. It will show you all the links grouped by the files they are in. You can also click on Extract which will pop up a window with all the links from all the files. Since you stated that you want the links I figured you want the whole

<a href="something.php">Something</a>

so that you can see where the link points to and what the description is. If you only want the link without the whole tag, change the RegEx to

href=[\"\'](http:\/\/|\.\/|\/)?\w+(\.\w+)*(\/\w+(\.\w+)?)*(\/|\?\w*=\w*(&\w*=\w*)*)?[\"\']

which will return

href="something.php"

Let me know if this answers your question. TextCrawler is an awesome application and since it is free its worth a try.

Marcin

Posted 2009-11-16T19:35:01.530

Reputation: 3 414