Listing links from pages — with error handling

-3

In my previous question Debunking Stroustrup's debunking of the myth “C++ is for large, complicated, programs only”, a lot of people complained that comparison with Stroustrup's program isn't fair, because almost no programs include error checking. Okay, you asked for it. Let's start the most anti-code-golf challenge on the Code Golf site, let's check for errors! Let's write an application which is suitable for giving to end users. And while doing it, let's write clean and simple code.

Input

A list of URLs. They can come from command-line arguments, from multi-line text input — whatever you want. The only requirement is that they aren't hard-coded. For example:

http://www.stroustrup.com/C++.html example.com https://example.com

Output

A list of URLs from hrefs of anchors (<a href="...">). Remove fragments from URLs (#heading part); construct absolute URLs from relative URLs (/file2.html from http://example.com/path/file1.html => http://example.com/path/file2.html); leave only HTTP, HTTPS and FTP URLs; remove duplicates; sort alphabetically. For example:

ftp://ftp.research.att.com/pub/c++std/WP/CD2
http://acts.nersc.gov/pooma
...
http://www.iana.org/domains/example
...
http://www.stroustrup.com/3rd.html
http://www.stroustrup.com/3rd_tour2.pdf
...
http://www-h.eng.cam.ac.uk/help/tpl/languages/C++.html
https://www.youtube.com/watch?v=jDqQudbtuqo&feature=youtu.be

Error handling

If any of the pages can't be downloaded, output the error which includes the error message and the URL. If it's a console program, set error code.

The following cases are considered errors:

Unresolvable domain.
Unresponsive website.
HTTP codes above 400 (for example, 404 for missing file).
MIME content type different from text/html or application/xhtml+xml.

User-friendly

If no URLs are supplied, provide usage instructions.
HTTP and HTTPS should be supported (http://example.com/, https://example.com/).
If scheme isn't provided, assume HTTP (example.com => http://example.com/).

Additional features

If your program includes any additional features (like proper HTML parsing instead of regular expressions, support for FTP and international domain names etc.), please tell about them in your answer.

Rules

Usage of third-party libraries is disallowed. For shell, assume GNU. Exception: boost for C++ is allowed (good luck).
Your code should be readable, maintainable and pleasing to the eye (as much as your language permits). Showcase the good side of your language, not bad.
It isn't code-golf, please avoid short code if it hurts readability.
It isn't code-trolling, it isn't "enterprise" application, please don't overengineer and include unneeded features.
Regular expressions you choose don't really matter. No matter how sophisticated they are, they're guaranteed to fail in some cases. Proper HTML parsers should be used in a real application, but very few languages include them in the standard library, so it would too limiting.

Overall, this challenge is about satisfying annoying, but realistic requirements which make your pretty one-line code much longer.

Athari

Posted 2015-01-20T15:45:45.827

Reputation: 2 319

Huh. Way too anti-code-golf or what? :) – Athari – 2015-01-20T15:54:48.923

Way too similar to the previous challenge. – John Dvorak – 2015-01-20T15:55:59.357

@JanDvorak If you try writing code, you'll notice it's totally different. It's the difference between writing 2+2 and a calculator. – Athari – 2015-01-20T15:57:25.227

@JanDvorak You can compare http://codegolf.stackexchange.com/a/44732 and http://codegolf.stackexchange.com/a/44283 :)

– Athari – 2015-01-20T16:17:07.077