-3
In my previous question Debunking Stroustrup's debunking of the myth “C++ is for large, complicated, programs only”, a lot of people complained that comparison with Stroustrup's program isn't fair, because almost no programs include error checking. Okay, you asked for it. Let's start the most anti-code-golf challenge on the Code Golf site, let's check for errors! Let's write an application which is suitable for giving to end users. And while doing it, let's write clean and simple code.
Input
A list of URLs. They can come from command-line arguments, from multi-line text input — whatever you want. The only requirement is that they aren't hard-coded. For example:
http://www.stroustrup.com/C++.html example.com https://example.com
Output
A list of URLs from hrefs of anchors (<a href="...">
). Remove fragments from URLs (#heading
part); construct absolute URLs from relative URLs (/file2.html
from http://example.com/path/file1.html
=> http://example.com/path/file2.html
); leave only HTTP, HTTPS and FTP URLs; remove duplicates; sort alphabetically. For example:
ftp://ftp.research.att.com/pub/c++std/WP/CD2
http://acts.nersc.gov/pooma
...
http://www.iana.org/domains/example
...
http://www.stroustrup.com/3rd.html
http://www.stroustrup.com/3rd_tour2.pdf
...
http://www-h.eng.cam.ac.uk/help/tpl/languages/C++.html
https://www.youtube.com/watch?v=jDqQudbtuqo&feature=youtu.be
Error handling
If any of the pages can't be downloaded, output the error which includes the error message and the URL. If it's a console program, set error code.
The following cases are considered errors:
- Unresolvable domain.
- Unresponsive website.
- HTTP codes above 400 (for example, 404 for missing file).
- MIME content type different from
text/html
orapplication/xhtml+xml
.
User-friendly
- If no URLs are supplied, provide usage instructions.
- HTTP and HTTPS should be supported (
http://example.com/
,https://example.com/
). - If scheme isn't provided, assume HTTP (
example.com
=>http://example.com/
).
Additional features
If your program includes any additional features (like proper HTML parsing instead of regular expressions, support for FTP and international domain names etc.), please tell about them in your answer.
Rules
- Usage of third-party libraries is disallowed. For shell, assume GNU. Exception: boost for C++ is allowed (good luck).
- Your code should be readable, maintainable and pleasing to the eye (as much as your language permits). Showcase the good side of your language, not bad.
- It isn't code-golf, please avoid short code if it hurts readability.
- It isn't code-trolling, it isn't "enterprise" application, please don't overengineer and include unneeded features.
- Regular expressions you choose don't really matter. No matter how sophisticated they are, they're guaranteed to fail in some cases. Proper HTML parsers should be used in a real application, but very few languages include them in the standard library, so it would too limiting.
Overall, this challenge is about satisfying annoying, but realistic requirements which make your pretty one-line code much longer.
Huh. Way too anti-code-golf or what? :) – Athari – 2015-01-20T15:54:48.923
Way too similar to the previous challenge. – John Dvorak – 2015-01-20T15:55:59.357
@JanDvorak If you try writing code, you'll notice it's totally different. It's the difference between writing 2+2 and a calculator. – Athari – 2015-01-20T15:57:25.227
@JanDvorak You can compare http://codegolf.stackexchange.com/a/44732 and http://codegolf.stackexchange.com/a/44283 :)
– Athari – 2015-01-20T16:17:07.077