How to export all hyperlinks on a webpage?

0

2

I need a solution to export all hyperlinks on a webpage (on a webpage, not from entire website) and a way to specify the links I want to export, for example only hyperlinks starting with https://superuser.com/questions/ excluding everything else.
Exporting as text file preferred and the results should be displayed one below another, one URL per line:

https://superuser.com/questions/1  
https://superuser.com/questions/2  
https://superuser.com/questions/3
[...]

user598527

Posted 2017-02-01T16:48:38.643

Reputation: 2 399

@JeffZeitlin: I have tried Invoke-WebRequest in Powershell 5. I use both Windows and Linux, native terminal/Powershell method is preferred. – user598527 – 2017-02-01T16:57:45.177

1

Please note that https://superuser.com is not a free script/code writing service. If you tell us what you have tried so far (include the scripts/code you are already using) and where you are stuck then we can try to help with specific problems. You should also read How do I ask a good question?.

– DavidPostill – 2017-02-01T16:58:23.020

1If Invoke-WebRequest is not returning the HTML for the page your are interested in, you will need to troubleshoot that first. Once your Invoke-WebRequest succeeds, you should be able to parse the resulting HTML to extract what you want. Do not expect us to write the script for you, as DavidPostill indicates; you will need to 'show your work'. – Jeff Zeitlin – 2017-02-01T16:59:56.610

Answers

1

If you are running on a Linux or a Unix system (like FreeBSD or macOS), you can open a terminal session and run this command:

wget -O - http://example.com/webpage.htm | \
sed 's/href=/\nhref=/g' | \
grep href=\"http://specify.com | \
sed 's/.*href="//g;s/".*//g' > out.txt

In usual cases there may be multiple <a href> tags in one line, so you have to cut them first (the first sed adds newlines before every keyword href to make sure there's no more than one of it in a single line).
To extract links from multiple similar pages, for example all questions on the first 10 pages on this site, use a for loop.

for i in $(seq 1 10); do
wget -O - http://superuser.com/questions?page=$i | \
sed 's/href=/\nhref=/g' | \
grep -E 'href="http://superuser.com/questions/[0-9]+' | \
sed 's/.*href="//g;s/".*//g' >> out.txt
done

Remember to replace http://example.com/webpage.htm with your actual page URL and http://specify.com with the preceding string you want to specify.
You can specify not only a preceding string for the URL to export, but also a Regular Expression pattern if you use egrep or grep -E in the command given above.
If you're running a Windows, consider taking advantage of Cygwin. Don't forget to select packages Wget, grep, and sed.

iBug

Posted 2017-02-01T16:48:38.643

Reputation: 5 254

This is almost the method that I use to batch download music from KHInsider without buying their VIP service. Just manually extract the links and place them in a download manager like IDM.

– iBug – 2017-02-02T02:52:26.737

0

If you are okay with using Firefox for it, you can you the addon Snap Links Plus

  1. Hold down the right mouse button and drag a selection around the links.

  2. When they are highlighted, press and hold Control while letting go of the right mouse button.

Yisroel Tech

Posted 2017-02-01T16:48:38.643

Reputation: 4 307

Wouldn't work well due to the selection method, source page can be hundreds of pages long. – user598527 – 2017-02-01T17:01:34.610

So really no method based on a page won't work, since "source page" (https://superuser.com/questions/) is only one page and you want it to save from all "hundreds of pages" (like https://superuser.com/questions?page=2)

– Yisroel Tech – 2017-02-01T17:05:09.443

That page was only an example. – user598527 – 2017-02-01T17:08:02.800

But still, what do you mean "hundreds of pages"? If you need to press something to load more pages then it isn't really one page. – Yisroel Tech – 2017-02-01T17:09:47.420

"Approximately", for example this page is that long (though it doesn't have hyperlinks, used as an example due to low size): https://easylist-downloads.adblockplus.org/easylist.txt There are more sites I may want to export links from.

– user598527 – 2017-02-01T17:15:56.797

1

Oh, got you. This extension for CXhrome seems to do the job https://chrome.google.com/webstore/detail/link-klipper-extract-all/fahollcgofmpnehocdgofnhkkchiekoo?hl=en

– Yisroel Tech – 2017-02-01T17:20:29.327