Listing links from pages — with error handling

-3

In my previous question Debunking Stroustrup's debunking of the myth “C++ is for large, complicated, programs only”, a lot of people complained that comparison with Stroustrup's program isn't fair, because almost no programs include error checking. Okay, you asked for it. Let's start the most anti-code-golf challenge on the Code Golf site, let's check for errors! Let's write an application which is suitable for giving to end users. And while doing it, let's write clean and simple code.

Input

A list of URLs. They can come from command-line arguments, from multi-line text input — whatever you want. The only requirement is that they aren't hard-coded. For example:

http://www.stroustrup.com/C++.html example.com https://example.com

Output

A list of URLs from hrefs of anchors (<a href="...">). Remove fragments from URLs (#heading part); construct absolute URLs from relative URLs (/file2.html from http://example.com/path/file1.html => http://example.com/path/file2.html); leave only HTTP, HTTPS and FTP URLs; remove duplicates; sort alphabetically. For example:

ftp://ftp.research.att.com/pub/c++std/WP/CD2
http://acts.nersc.gov/pooma
...
http://www.iana.org/domains/example
...
http://www.stroustrup.com/3rd.html
http://www.stroustrup.com/3rd_tour2.pdf
...
http://www-h.eng.cam.ac.uk/help/tpl/languages/C++.html
https://www.youtube.com/watch?v=jDqQudbtuqo&feature=youtu.be

Error handling

If any of the pages can't be downloaded, output the error which includes the error message and the URL. If it's a console program, set error code.

The following cases are considered errors:

  • Unresolvable domain.
  • Unresponsive website.
  • HTTP codes above 400 (for example, 404 for missing file).
  • MIME content type different from text/html or application/xhtml+xml.

User-friendly

  • If no URLs are supplied, provide usage instructions.
  • HTTP and HTTPS should be supported (http://example.com/, https://example.com/).
  • If scheme isn't provided, assume HTTP (example.com => http://example.com/).

Additional features

If your program includes any additional features (like proper HTML parsing instead of regular expressions, support for FTP and international domain names etc.), please tell about them in your answer.

Rules

  • Usage of third-party libraries is disallowed. For shell, assume GNU. Exception: boost for C++ is allowed (good luck).
  • Your code should be readable, maintainable and pleasing to the eye (as much as your language permits). Showcase the good side of your language, not bad.
  • It isn't , please avoid short code if it hurts readability.
  • It isn't , it isn't "enterprise" application, please don't overengineer and include unneeded features.
  • Regular expressions you choose don't really matter. No matter how sophisticated they are, they're guaranteed to fail in some cases. Proper HTML parsers should be used in a real application, but very few languages include them in the standard library, so it would too limiting.

Overall, this challenge is about satisfying annoying, but realistic requirements which make your pretty one-line code much longer.

Athari

Posted 2015-01-20T15:45:45.827

Reputation: 2 319

Huh. Way too anti-code-golf or what? :) – Athari – 2015-01-20T15:54:48.923

Way too similar to the previous challenge. – John Dvorak – 2015-01-20T15:55:59.357

@JanDvorak If you try writing code, you'll notice it's totally different. It's the difference between writing 2+2 and a calculator. – Athari – 2015-01-20T15:57:25.227

@JanDvorak You can compare http://codegolf.stackexchange.com/a/44732 and http://codegolf.stackexchange.com/a/44283 :)

– Athari – 2015-01-20T16:17:07.077

Answers

1

C#

That's 6 times lines of code more than for the previous question.

using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Net;
using System.Reflection;
using System.Text.RegularExpressions;

class Program {
    ISet<string> AllowedContentTypes = new HashSet<string>(StringComparer.OrdinalIgnoreCase)
        { "text/html", "application/xhtml+xml" };
    ISet<string> AllowedUriSchemes = new HashSet<string>(StringComparer.OrdinalIgnoreCase)
        { Uri.UriSchemeHttp, Uri.UriSchemeHttps, Uri.UriSchemeFtp };

    void ListLinks(string[] links) {
        if (links.Length == 0) {
            string exeName = Path.GetFileName(Assembly.GetEntryAssembly().CodeBase);
            Console.Error.WriteLine("Usage: {0} url1 url2 ... urlN", exeName);
            Environment.Exit(1);
        }
        try {
            links
                .Select(link => new UriBuilder(link).Uri)
                .Select(GetHrefsFromUri).SelectMany(h => h)
                .Distinct().OrderBy(h => h).ToList()
                .ForEach(Console.WriteLine);
        }
        catch (Exception e) {
            for (Exception ei = e; ei != null; ei = ei.InnerException)
                Console.Error.WriteLine("Error: " + ei.Message);
            Environment.Exit(1);
        }
    }

    IEnumerable<string> GetHrefsFromUri(Uri link) {
        string html;
        try {
            WebResponse response = WebRequest.Create(link).GetResponse();
            if ((link.Scheme == Uri.UriSchemeHttp || link.Scheme == Uri.UriSchemeHttps) &&
                    !AllowedContentTypes.Contains(response.ContentType.Split(';')[0]))
                throw new Exception("Downloaded file must be HTML or XHTML.");
            using (var stream = new StreamReader(response.GetResponseStream() ?? Stream.Null))
                html = stream.ReadToEnd();
        }
        catch (Exception e) {
            throw new Exception(string.Format("Failed to download {0}.", link), e);
        }
        return Regex.Matches(html, @"href=""([^""]+)""").Cast<Match>()
            .Select(m => new Uri(link, m.Groups[1].ToString()))
            .Where(u => AllowedUriSchemes.Contains(u.Scheme))
            .Select(u => u.GetLeftPart(UriPartial.Query));
    }

    static void Main(string[] args) { new Program().ListLinks(args); }
}

Many annoying requirements indeed. I couldn't use WebClient and had to use WebRequest which has several quirks like ContentType property throwing exception for FTP requests or GetResponseStream method returning null in some cases. Its error messages don't include URLs, so I had to add exception throwing and processing of InnerException. When using MatchCollection in LINQ, Cast call is required. Uri classes turned out to be pretty featureful though.

Athari

Posted 2015-01-20T15:45:45.827

Reputation: 2 319