Debunking Stroustrup's debunking of the myth “C++ is for large, complicated, programs only”

160

73

Stroustrup has recently posted a series of posts debunking popular myths about C++. The fifth myth is: “C++ is for large, complicated, programs only”. To debunk it, he wrote a simple C++ program downloading a web page and extracting links from it. Here it is:

#include <string>
#include <set>
#include <iostream>
#include <sstream>
#include <regex>
#include <boost/asio.hpp>

using namespace std;

set<string> get_strings(istream& is, regex pat)
{
    set<string> res;
    smatch m;
    for (string s; getline(is, s);)  // read a line
        if (regex_search(s, m, pat))
            res.insert(m[0]);              // save match in set
    return res;
}

void connect_to_file(iostream& s, const string& server, const string& file)
// open a connection to server and open an attach file to s
// skip headers
{
    if (!s)
        throw runtime_error{ "can't connect\n" };

    // Request to read the file from the server:
    s << "GET " << "http://" + server + "/" + file << " HTTP/1.0\r\n";
    s << "Host: " << server << "\r\n";
    s << "Accept: */*\r\n";
    s << "Connection: close\r\n\r\n";

    // Check that the response is OK:
    string http_version;
    unsigned int status_code;
    s >> http_version >> status_code;

    string status_message;
    getline(s, status_message);
    if (!s || http_version.substr(0, 5) != "HTTP/")
        throw runtime_error{ "Invalid response\n" };

    if (status_code != 200)
        throw runtime_error{ "Response returned with status code" };

    // Discard the response headers, which are terminated by a blank line:
    string header;
    while (getline(s, header) && header != "\r")
        ;
}

int main()
{
    try {
        string server = "www.stroustrup.com";
        boost::asio::ip::tcp::iostream s{ server, "http" };  // make a connection
        connect_to_file(s, server, "C++.html");    // check and open file

        regex pat{ R"((http://)?www([./#\+-]\w*)+)" }; // URL
        for (auto x : get_strings(s, pat))    // look for URLs
            cout << x << '\n';
    }
    catch (std::exception& e) {
        std::cout << "Exception: " << e.what() << "\n";
        return 1;
    }
}

Let's show Stroustrup what small and readable program actually is.

  1. Download http://www.stroustrup.com/C++.html
  2. List all links:

    http://www-h.eng.cam.ac.uk/help/tpl/languages/C++.html
    http://www.accu.org
    http://www.artima.co/cppsource
    http://www.boost.org
    ...
    

You can use any language, but no third-party libraries are allowed.

Winner

C++ answer won by votes, but it relies on a semi-third-party library (which is disallowed by rules), and, along with another close competitor Bash, relies on a hacked together HTTP client (it won't work with HTTPS, gzip, redirects etc.). So Wolfram is a clear winner. Another solution which comes close in terms of size and readability is PowerShell (with improvement from comments), but it hasn't received much attention. Mainstream languages (Python, C#) came pretty close too.

Athari

Posted 10 years ago

Reputation: 2 319

3Comments purged as they were all either obsolete or off-topic. – Doorknob – 10 years ago

1Clarification: Shall the list of links be as incomplete as Stroustrup's one, i.e. skip any non-http-links that don't include www (including the https, ftp, local and anchor ones on that very site) and report false-positives, i.e. non-linked mentions of http:// as well (not here, but in general)? – Tobias Kienzler – 10 years ago

Why is pointing out that ALL of the posted answers don't apply to what the OP asked, obsolete or off-topic? – Dunk – 10 years ago

43To each his own, I've been called worse. If the OP's goal wasn't to try and somehow prove that Stroustrup is wrong, then I'd agree with your assessment. But the entire premise of the question is to show how "your favorite language" can do the same thing as this 50 lines of C++ in much less lines of code. The problem is that none of the examples do the same thing. In particular, none of the answers perform any error checking, none of the answers provide reusable functions, most of the answers don't provide a complete program. The Stroustrup example provides all of that. – Dunk – 10 years ago

19What's sad is his web page isn't even valid UTF-8. Now I've gotta work around that, despite his server advertising Content-Type: text/html; charset=UTF-8... I'm gonna email him. – Cornstalks – 10 years ago

I wish I'd thought of coming here and asking this question when I read that piece. Certainly C++ is better than it was in the past, but it's by no means optimal. – Mark Ransom – 10 years ago

27@Dunk The other examples don't provide reusable functions because they accomplish the entire functionality of those functions in a single line and it makes no sense to make that a whole function on its own, and the C++ example doesn't perform any error checking that isn't handled natively in almost an identical manner, and the phrase "complete program" is almost meaningless. – Jason – 10 years ago

16"You can use any language, but no third-party libraries are allowed." I don't think that's a fair requirement considering boost/asio is used up there which is a third-party library. I mean how will languages that don't include url/tcp fetching as part of its standard library compete? – greatwolf – 10 years ago

@greatwolf They don't. That's the point. – Athari – 10 years ago

1@Jason - upvote for "C++ example doesn't perform any error checking that isn't handled natively in almost an identical manner". – None – 10 years ago

1

Virtually all the answers fail the task (including the original lol) because they don't pick up relative links! I did mine herE: http://forum.dlang.org/thread/tqegmjcofcnwapqitrdo@forum.dlang.org#post-nxcpwmyjfbfbjxqtmrzd:40forum.dlang.org

– Adam D. Ruppe – 10 years ago

4Is nobody gonna talk about using regexes to parse HTML? Really? I mean Stroustrup does it himself but at least his regex doesn't rely on the HTML-attribute using " and only ever " to delimit its value. 9 out of 10 answers here would fail on <a href='http://htmlparsing.com/regexes.html'> – funkwurm – 10 years ago

@funkwurm Problems with the provided solutions have been mentioned many times, you just need to look through comments. The famous "parsing HTML with regex" answer from SO has been brought up too. Many comments have been removed by the mod though. – Athari – 10 years ago

@undergroundmonorail BF++ does. It is giving me strange and deviant thoughts. – ymbirtt – 10 years ago

2I have to admit, I'm surprised by Stroustrup's claim that most people believe that C++ is used for large programs. I (probably incorrectly) believe the opposite - that for large programs, it's worthwhile to use a language like Java or C# that makes it harder to shoot yourself in the foot! – Kevin – 10 years ago

11

He's... parsing... html... with... regex... twitch

– Riot – 10 years ago

1Its a little odd to me that Stroustrup's challenge is to write C++ code that imports no third party code and the first line (or so, I'm not going to page back and lose my post thus far) is an import of boost's asio library. It kind of makes OP's opinion suspect. But in any case, comparing different languages in this task is very much like comparing apples and oranges. It doesn't really make much sense to use a hammer to tap in a pin, but it can be done; it doesn't make much sense to write assembly code to extract url's from a web page but it can be done. I suspect you could write a RoR program – None – 10 years ago

2

This code snippet appears in a hacking scene on the Netflix series "Limitless"; Season 1, Episode 10, ~13:05. Proof: http://i.imgur.com/7a16H8y.png

– Dan – 8 years ago

Answers

116

Wolfram

This feels like complete cheating

Import["http://www.stroustrup.com/C++.html", "Hyperlinks"]

So just add some honest parsing on top

Cases[
 Import["http://www.stroustrup.com/C++.html", "XMLObject"],
 XMLElement["a", {___, "href" -> link_, ___}, ___] :> 
  link /; StringMatchQ[link, RegularExpression["((http://)?www([./#\\+-]\\w*)+)"]]
, Infinity]

swish

Posted 10 years ago

Reputation: 7 484

50Nope, I don't see any cheating here. This challenge is about bringing out the best of your language. And that first line is the epitome of "small and readable". – Martin Ender – 10 years ago

An answer that can ignore the silly arguments about catching ftp links. Brilliant. – Seth Battin – 10 years ago

Came here to offer this exact solution, pleased to see others have appreciated it as well. – Michael Stern – 10 years ago

@MartinBüttner In that case you might want to consider downvoting http://meta.codegolf.stackexchange.com/a/1078/12130

– David Mulder – 10 years ago

@DavidMulder Who says I didn't? ;) – Martin Ender – 10 years ago

@MartinBüttner Ah, if you're aware of it then you should probably enforce the community consensus though~ (and try to change the consensus if you feel so inclined). – David Mulder – 10 years ago

6@DavidMulder Technically, the loophole is currently not valid, since the vote breakdown is +41/-21 (and the loophole question states that loopholes are accepted if there are at least twice as many upvotes as downvotes). A close call, admittedly, but still. ;) Furthermore, this is a popularity contest, not a code golf, and in particular, it's a pop-con about showing how easily this can be done in a given language, which is why I think the loophole doesn't really apply to this challenge anyway (since the challenge basically asks for it). – Martin Ender – 10 years ago

@MartinBüttner Oh nice~, I couldn't check that. Sooo, is anything stopping people right now from adding a purpose build function to their codegolf language forks for each new question? That seems the 'sensible' next step in that case~ (It's the problem I have with most of these challenges, languages like OpenEdge, ColdFusion and Wolfram are what I tend to call framework languages, because they contain not just a language itself, but also very high level purpose build languages you would normally find in frameworks and/or libraries. The true strength of a language is not in the number of – David Mulder – 10 years ago

purpose build functions, but in the easy of access to such a function if you need it. For example with node.js this would be a npm install jsdom call, whereas with Mathematica you have to search the interwebs for the libraries you need). – David Mulder – 10 years ago

@DavidMulder That's a different issue. Only languages (and language versions) which were available before a challenge was posted, may be used in that challenge. If you want to discuss this further, feel free to join us in chat.

– Martin Ender – 10 years ago

@MartinBüttner: Actually, it's bashing C++ for not declaring "everything and the kitchen sink" to be part of the standard library, and thus making it unsuitable for really small devices, restricted environments, and other situations... – Deduplicator – 10 years ago

I suppose, this Import statement does true Html parsing rather than the original code’s naive regex matching. So it’s not equivalent. It’s closer to what you likely want in real life, but still, not doing the same as the original… – Holger – 6 years ago

115

C++

#include <boost/asio.hpp>
#include <regex>
#include <iostream>
int main() {
    std::string server = "www.stroustrup.com";
    std::string request = "GET http://" + server + "/C++.html HTTP/1.0\r\nHost: " + server + "\r\n\r\n";
    boost::asio::ip::tcp::iostream s{server, "http"};
    s << request;
    std::regex pat{R"((http://)?www([./#\+-]\w*)+)"};
    std::smatch m;
    for (std::string l; getline(s, l);)
        if (std::regex_search(l, m, pat))
            std::cout << m[0] << "\n";
}

The main shortcoming is the awkward nature of boost::asio, I'm sure it can be even shorter with a better library.

congusbongus

Posted 10 years ago

Reputation: 1 259

3I think using boost is fair, since large parts of it have been integrated into the standard library in the past, and there have been plans to integrate asio in a future version. – usm – 10 years ago

What about ftp://ftp.research.att.com/pub/c++std/WP/CD2 and https://www.youtube.com/watch?v=jDqQudbtuqo&feature=youtu.be? – Tobias Kienzler – 10 years ago

167Funny how "no third-party libraries" means Python may still import urllib2, C3 may still be using System.Net, Haskel may still import Network.HTTP, but a C++ coder must make excuses for #include <boost/asio.hpp> as if having a metric crapton of specialized, purpose-built C++ (and C!) libraries available to chose from is something to be ashamed of just because the committee didn't bother to force-feed you a specific one... – DevSolar – 10 years ago

19@DevSolar almost went for creating a 2nd account to give you another upvote for that comment – user – 10 years ago

1Ah, there's the answer I was looking for. – Navin – 10 years ago

1using std::string; would somewhat simplify the code (and will make the long line short enough so as not to require the horizontal scrolling bar). – Elazar – 10 years ago

15@DevSolar System.Net isn't forced, it's just a high-quality library following all .NET recommendations included with the language. There're alternative implementations, but having HTTP support in the standard library means writing simple apps is simple, means better interoperability between third-party libraries, means less dependencies, means easy implementation for facades etc. Imagine a world without std::string, imagine how everyone uses their own library, imagine all difficulties that come with it. – Athari – 10 years ago

4@DevSolar You can consider boost a part of the standard library for C++ if you wish, it's a staging area for the standard library anyway. However, even with boost, you still have to know inner workings of HTTP protocol. It may look simple in this sample, but imagine you need to support HTTPS, gzip, redirects etc. Will it be as simple? Bare TCP socket quickly becomes insufficient. – Athari – 10 years ago

2@DevSolar As a Python programmer, I'm fairly certain Urllib2 is a standard package. – HarryCBurn – 10 years ago

17@DevSolar: urllib2 is not 3rd party. It is in stdlib like <iostream> in C++. urllib2 in Python is always available unlike <boost/asio.hpp> in C++. If we were allowed to use 3rd party modules; I would use lxml or BeautifulSoup in Python. – jfs – 10 years ago

you don't need Host header in HTTP/1.0 – jfs – 10 years ago

2@J.F.Sebastian: You, and Iplodman as well, completely missed the point of what I was saying. – DevSolar – 10 years ago

3@DevSolar: your comment is at best misleading. How many people from 53 (so far) that upvoted your comment do know that urllib2 is not 3rd part library in Python? Could you express your point without meantioning "third-party library"? – jfs – 10 years ago

4@J.F.Sebastian: I don't think there's anything misleading in what I wrote. You're probably reading too much into it. To be more clear, Yes, urllib2 is part of the Python standard library, while C++ relies on third-party libraries like Boost, OpenSSL, the OS API, or whatever else suits the purpose. This -- together with several other things like native datatypes, directly executable output etc., is a design decision of the language. It's what makes C and C++ system languages, while Java, Python et al. are a different breed entirely. – DevSolar – 10 years ago

4@J.F.Sebastian: Or, to put it differently, I don't challenge you to write a bootable operating kernel in Python or Java either, claiming that Java or Python are in any way inferior languages because they couldn't do it with the same ease as C or C++ can. They are simply designed for different purposes. – DevSolar – 10 years ago

6@DevSolar: If you don't like the http download problem then ask why Stroustrup have chosen it instead of writing "bootable operating kernel" in an attempt to debunk "C++ is for large, complicated, programs only" "myth". If OS kernel is not large and complex by your standards then I don't know what is. – jfs – 10 years ago

1@J.F.Sebastian: What I don't like is people perpetuating these "my language is better than yours" flamewars like it's still the 1990'ies -- and with the same arguments like back then as well. EOT. – DevSolar – 10 years ago

1@DevSolar: I don't see what it has anything to do with what I said. Your (the most upvoted) comment is either factually incorrect or if we are being generous -- misleading. It is a fact. Don't make false statements. – jfs – 10 years ago

6@J.F.Sebastian: It is neither misleading nor even ambiguous. "No third party libraries" indeed means that Python may import urllib2 and C++ may not #include <boost/...> without making excuses. That is exactly what I wrote. – DevSolar – 10 years ago

1@DevSolar: if your point is that adding 3rd party c++ libraries to your code is as easy as import urllib2 in Python then it is also factually incorrect or if we are being generous -- misleading. – jfs – 10 years ago

1

Let us continue this discussion in chat.

– DevSolar – 10 years ago

4@DevSolar When I read your most-upvoted comment, I thought you were saying that it wasn't really fair that other languages got to use their 3rd-party but fairly standard url / http libraries. So I'd have to agree with J.F. to some degree, or that there was at least accidental implication of something you didn't mean. Thanks for clearing up that urllib2 and others are in fact standard. – Peter Cordes – 10 years ago

23Also, I think the most important comment here is just that C++ doesn't standardize as much stuff in its standard libraries as other languages, but there still are widely-used robust portable libraries for a lot of the same tasks that are standard in languages like python, and some of these libs are almost a de-facto standard. And some of this is the result of C++ being able to target embedded systems with small binaries and small libraries. – Peter Cordes – 10 years ago

@PeterCordes: THANK YOU. I had basically given up on trying to put my meaning into words. You nailed it. – DevSolar – 10 years ago

8Thank you for debunking the debunking of Stroustrup's debunking. – Charles – 10 years ago

@PeterCordes: The probability of success of pip install python-package is much higher if python-package is written in pure Python compared to C/C++ extensions. – jfs – 10 years ago

3@DevSolar: Boost is a third-party library. None of those others you listed are, so I don't see what's funny (unfair) about it. – BlueRaja - Danny Pflughoeft – 10 years ago

2@BlueRaja-DannyPflughoeft: The point is, to cripple those other "everything and the kitchen sink is part of the standard library"-languages as much as you cripple C and C++ by arbitrarily reducing available libraries to just those defined in the standard itself due to them being extremely widely applicable and/or essential basic building blocks, you have to reduce them to the core as well. Which means throwing out just about all their libraries as well, even if they are declared "standard". – Deduplicator – 10 years ago

1@Deduplicator: what about the simplicity of using third-party libraries. See my last comment above. It might be easier to write in house library instead of using a third-party library if packaging and deployment are a nightmare for a small, simple program. – jfs – 10 years ago

86

Pure Bash on Linux/OS X (no external utilities)

HTTP client software is notoriously bloated. We don't want those kinds of dependencies. Instead we can push the appropriate headers down a TCP stream and read the result. No need to call archaic utilities like grep or sed to parse the result.

domain="www.stroustrup.com"
path="C++.html"
exec 3<> /dev/tcp/$domain/80
printf "GET /$path HTTP/1.1\r\nhost: %s\r\nConnection: close\r\n\r\n" "$domain" >&3
while read -u3; do
    if [[ "$REPLY" =~ http://[^\"]* ]]; then
        printf '%s\n' "$BASH_REMATCH"
    fi
done

Meh - I suppose it could be more readable...

Digital Trauma

Posted 10 years ago

Reputation: 64 644

1Like this one using unix file handles for the pipes. – javadba – 10 years ago

2Wow, never thought one can do this without external utils. Although it seems my bash 3.2.17 on LFS is a tiny bit obsolete so that doesn't support mapfile :) – Ruslan – 10 years ago

@Ruslan Yep, mapfile comes with bash 4.x. The same thing is totally doable with a while read loop as well. – Digital Trauma – 10 years ago

3@Ruslan I changed it to while read instead of mapfile. More portable and more readable, I think. – Digital Trauma – 10 years ago

1Works on OS X, too! – Alex Cohn – 10 years ago

65

Python 2

import urllib2 as u, re
s = "http://www.stroustrup.com/C++.html"
w = u.urlopen(s)
h = w.read()
l = re.findall('"((http)s?://.*?)"', h)
print l

Lame, but works

eptgrant

Posted 10 years ago

Reputation: 731

9Why not chain a lot of those calls? l = re.findall('"((http)s?://.*?)"', u.urlopen(s).read()) – Fake Name – 10 years ago

13It is short but it is not idiomatic (readability counts in Python) – jfs – 10 years ago

24Hmmm...if all my code ignored errors like this example then 75% to 90% of my work would already be done on every project I work on. – Dunk – 10 years ago

What about ftp://ftp.research.att.com/pub/c++std/WP/CD2? And the output is not a plain list but a list of tuples... – Tobias Kienzler – 10 years ago

5@TobiasKienzler the reference C++ code doesn't catch the ftp urls either. – jwg – 10 years ago

@jwg Indeed. I asked for clarification on that. Generalizing the regex shouldn't be too difficult, nonetheless I prefer avoiding regex in this case...

– Tobias Kienzler – 10 years ago

20@Dunk: Suppose the example did catch some exception (e.g. from urlopen()). What should it do with such an exception, other than crash and die? If it's going to crash and die anyway, why not just let Python handle the crashing-and-dying, and leave off the exception handling altogether? – Kevin – 10 years ago

@Kevin:My nit is that this answer (and none of the others) answer the OP's challenge as asked. The Stroustrup code example does more than the answers. This gives the misleading appearance that the answers can do the same thing in less lines of code. The fact is that none of the answers are doing the same as the original code. So it isn't an apples to apples comparison. As for why not let Python handle the crashing and dying, because someone might actually want to use the code in an actual program. Stroustrup's example code allows one to do so. – Dunk – 10 years ago

8@Dunk: If I were using somebody else's Python code, I'd much rather they not catch urlopen errors than (say) catch them and call sys.exit("something's borked!"). If they do the latter, I have to catch SystemExit, which is never fun. – Kevin – 10 years ago

you don't need (http) to be in a group.(https?://.*?) works just as well – njzk2 – 10 years ago

Use raw strings in RegEx. – thefourtheye – 10 years ago

55

C#

using System;
using System.Net;
using System.Text.RegularExpressions;

class Program {
    static void Main() {
        string html = new WebClient().DownloadString("http://www.stroustrup.com/C++.html");
        foreach (Match match in Regex.Matches(html, @"https?://[^""]+"))
            Console.WriteLine(match);
    }
}

Athari

Posted 10 years ago

Reputation: 2 319

4You can use var html, and probably var match to shave off a few characters. – Superbest – 10 years ago

15@Superbest I can make names single-character and get rid of html variable altogether too, but it's not what I'm after. – Athari – 10 years ago

6@Superbest not [tag:code-golf]. :D – Kroltan – 10 years ago

5Well, it improves readability, too. Is there ever a reason not to use var when it won't impact code semantics? – Superbest – 10 years ago

6@Superbest: "it improves readability" is subjective. Personally, I think explicitly stating the type of the variable improves readability (usually, like in this code here). I don't want to debate this, though; I just want to point out that alternative views exist. – Cornstalks – 10 years ago

2If you don't know what object type is returned by a method "var" makes reading other peoples code (exploring libraries unknown to you) more difficult and time consuming. Say you read "var name = GetName()". Is it a string? Or something else? You can't know without checking. – Traubenfuchs – 10 years ago

1But var actually says "don't care", so the GetName() may change the return type to an unrelated one, as long as it still gives me what I am looking for. – Elazar – 10 years ago

@Superbest you can't use var for Match anyway, since MatchCollection only implments ICollection. This is a major PITA if you ask me, and GetEnumerator can't be changed without breaking changes.

– Lucas Trzesniewski – 10 years ago

@LucasTrzesniewski var can be used in this sample, you'll get object, which is sufficient for Console.WriteLine. – Athari – 10 years ago

@Athari oh right, I didn't read carefully enough – Lucas Trzesniewski – 10 years ago

If you care about the use of var go here. http://stackoverflow.com/questions/41479/use-of-var-keyword-in-c-sharp

– Jodrell – 10 years ago

1Personally I think this code is more readable than some of the other answers, even if it's not as short. – Pharap – 10 years ago

53

"No third-party" is a fallacy

I think the "no third-party" assumption is a fallacy. And is a specific fallacy that afflicts C++ developers, since it's so hard to make reusable code in C++. When you are developing anything at all, even if it's a small script, you will always make use of whatever pieces of reusable code are available to you.

The thing is, in languages like Perl, Python, Ruby (to name a few), reusing someone else's code is not only easy, but it is how most people actually write code most of the time.

C++, with its nearly impossible-to-maintain-compatible-ABI-requirements makes that a much tougher job, you end up with a project like Boost, which is a monstrous repository of code and very little composability outside it.

A CPAN example

Just for the fun of it, here goes a CPAN-based example, with proper parsing of the html, instead of trying to use regex to parse html

#!/usr/bin/perl
use HTML::LinkExtor;
sub callback {
   my ($tag, %links) = @_;
   print map { "$_\n" } values %links
}
$p = HTML::LinkExtor->new(\&callback, "http://www.stroustrup.com/C++.html");

Daniel Ruoso

Posted 10 years ago

Reputation: 631

That HTML answer, funny as it is, is a fallacy. "HTML is not a regular language and hence cannot be parsed by regular expressions." is true, but HTML tags definitely are a regular language, and something to turn a stream into bite-sized chunks of tags and text nodes would serve a similar purpose to a lexer. Reminds me of that supposed monster regex purportedly necessary for parsing email addresses, when what it's actually parsing is a different RFC822 element than the one we generally consider to be an email address. – Random832 – 10 years ago

From the point of view of a Perl hacker, the entire CPAN is part of Perl's standard library :) – slebetman – 10 years ago

6Upvote for addressing the point of the 3rd party libs, but: crap, making reusable code in C++ is as easy cheesy as in other language. Using and especially finding reusable code may be a tad harder, but the only thing that's seriously problematic is reusing compiled artifacts, but that's often a non-issue in interpreted languages like Perl, etc. – Martin Ba – 10 years ago

4To stretch an analogy, Boost is more like CPAN - pick and choose. You don't call CPAN a "monstrous repository of code" just because there's such a lot of stuff in there you don't use? – Martin Ba – 10 years ago

22CPAN is a 'monstrous repository of code', by any reasonable definition of those four words. – jwg – 10 years ago

1That's not necessarily a bad thing. :) As long as you can pick and choose, at least. – cHao – 10 years ago

3@MartinBa I disagree, C++ being a compiled language, requiring every executable to rebuild its full stack of dependencies because it's hard to maintain ABI compatibility seriously hinders code reusability. In order to produce a reusable library in C++ you have to go through really long lengths in order to make sure you don't force yourself into ABI-incompatible changes all the time. – Daniel Ruoso – 10 years ago

1@jwg the difference is that CPAN is extremely composable, not only you're allowed to pick and choose which libraries to use, but they're usually made compatible against different versions of their dependencies (and the cpantesters help ensure that is true).

With Boost you may select to use just a portion of it, but you have to use it all from the same version. – Daniel Ruoso – 10 years ago

You're missing a backslash on your callback. – Slade – 10 years ago

1But if you allow third-party libraries in this contest then every answer becomes thirdPartyLibraryIJustMade.getStroustrupLinksAndPrintThem() – JLRishe – 10 years ago

@DanielRuoso - why does "producing a reusable library" entail any compilation for you? If the library just provides source code files, it is still a reusable library, isn't it? – Martin Ba – 10 years ago

6@MartinBa because having to rebuild the whole universe everytime you want to implement a simple task is unbearable. – Daniel Ruoso – 10 years ago

2You have to do something to rule out third-party code, because otherwise every program in every language comes out as a one-liner (plus include and main boilerplate and whatnot), because you write a third-party module that does exactly what's asked for, publish it, and use it in "your program". The tricky question is how to draw the line in order to get a reasonable comparison between programming languages, and that in turn depends on the purpose of the comparison. Stroustrup opted not to use a proper HTTP client, which would be a most peculiar decision in real code. – Steve Jessop – 10 years ago

1@SteveJessop I disagree. I have implemented and published CPAN modules for particular tasks before and those became usable by someone else. When I started this example, I searched for HTML::Parser, I didn't even know the module for extracting the links existed, but apparently extracting the links is a common-enough problem that someone went through the trouble of posting a reusable module that does just that.

This is a fundamental aspect of the programming language, and a part where other languages (Perl in particular, but also Python and Ruby) are really good at, and C++ is really poor at. – Daniel Ruoso – 10 years ago

@DanielRuoso: I mean that if Stroustrup put that code on github with a different name for the main() function, would you then accept that it's possible to download a webpage and extract the links in C++ in one line, that one line being int main() { return stroustrups_thingy(); }? Much more concise than your Perl code. Now, my C++ one-liner rather shamelessly uses a third-party library intended for the purpose of rigging this benchmark, but you think that no restrictions on doing that are needed in order to make meaningful comparisons between languages, right? ;-) – Steve Jessop – 10 years ago

... so in the end all languages are equally concise at doing anything, because all the details required to do anything can be hidden away in a benchmark-rigger library. It's like forking CJam for a particular code golf challenge so that it solves that challenge in one character. You don't have to ban it, but if you don't ban it you know what will happen. – Steve Jessop – 10 years ago

Still, I like your solution since it points out that presumably without intending to, Stroustrup chose an example challenge that a Perl programmer wouldn't even have to address, using that CPAN module instead, since precisely the same problem has been solved before and the solution published. But if a Perl programmer did choose to address it, OK, we can look at the source of the module if we want to see how wordy it is. – Steve Jessop – 10 years ago

@SteveJessop the problem is that in C++ the amount of reusable code is just not as large: see http://stackoverflow.com/questions/822581/what-c-library-should-i-use-to-implement-a-http-client http://www.mostthingsweb.com/2013/02/parsing-html-with-c/

– Daniel Ruoso – 10 years ago

Mojolicious would also allow something like use Mojo::UserAgent; say for Mojo::UserAgent->new->get("http://www.stroustrup.com/C++.html")->res->dom->find("a[href^=http]")->map(attr => "href")->each – hobbs – 10 years ago

(linebreaks could easily be added, outside of the context of an SO comment box) – hobbs – 10 years ago

47

UNIX shell

lynx -dump http://www.stroustrup.com/C++.html | grep -o '\w*://.*'

Also finds an ftp:// link :)

Another way, without relying on :// syntax:

lynx -dump -listonly http://www.stroustrup.com/C++.html | sed -n 's/^[ 0-9.]\+//p'

Ruslan

Posted 10 years ago

Reputation: 1 283

38I can't work out whether to +1 because using a web browser to download a web page is the right tool for the job or to -1 because the challenge is to write a program to do blahblahblah and you just called a program to do the blahing. – David Richerby – 10 years ago

@DavidRicherby Arguably, a web browser is the right tool to view a webpage. The right tool to download a webpage is an http request. Either way, this solution is fun! – Gusdor – 10 years ago

2I think it's better to replace lynx with curl or wget. They are more commonly used to download a webpage. – Pavel Strakhov – 10 years ago

4@PavelStrakhov I chose lynx exactly because it can dump the links without me doing anything special :) – Ruslan – 10 years ago

@Ruslan: that's OK, Stroustrup's code doesn't do anything special with the html either. He just regexes it for urls, which doesn't require lynx :-) – Steve Jessop – 10 years ago

2@SteveJessop by "special" I mean actually parsing or regexing or whatever. With lynx I just grep out the list of links (which curl and wget don't list) and remove the numbering. You may consider it cheating or whatever, but I thought it's fun to {use the tool which almost perfectly does what is required}, just fine-tuning the output. – Ruslan – 10 years ago

1@Ruslan: possibly my snark at Stroustrup's code wasn't strong enough! He doesn't extract links (like lynx does), he extracts URLs. Compared with his, your code "misses" any URLs in the text that aren't links. Which normally would be a feature in your code and a bug in his, but he set the problem so maybe not :-) Not that there are any such urls on that page at the moment. – Steve Jessop – 10 years ago

2Yeah, I know, is not [tag:code-golf], but both grep and sed is an overkill. lynx -dump http://www.stroustrup.com/C++.html | grep -o '\w*://.*' – manatwork – 10 years ago

@manatwork thanks, this is much nicer. I indeed wanted to use one filter, but didn't know of -o option. Edited to include this improvement. – Ruslan – 10 years ago

7"but no third-party libraries are allowed". I contend that lynx is functionally equivalent to a third-party library in this scenario. – Digital Trauma – 10 years ago

43

CSS 3

* {
  margin: 0;
  padding: 0;
}
*:not(a) {
  font: 0/0 monospace;
  color: transparent;
  background: transparent !important;
}
a {
  content: "";
}
a[href*="://"]::after {
  content: attr(href);
  float: left;
  clear: left;
  display: block;
  font: 12px monospace;
  color: black;
}

This code can be used as a user style to display only absolute links on a page in an unformatted list. It may not work correctly if your browser enforces minimum font size.

It works correctly with http://www.stroustrup.com/C++.html (note !important on background). In order to work on other pages with more styles, it must be extended (reset more properties, mark properties as important etc.).

Alternative version which includes relative links except intrapage links starting with hashes (it relies on a hard-coded absolute link, unfortunately):

* {
  margin: 0;
  padding: 0;
}
*:not(a) {
  font: 0/0 monospace;
  color: transparent;
  background: transparent !important;
  float: none !important;
  width: auto !important;
  border: none !important;
}
a {
  content: "";
}
a::after {
  display: none;
}
a:not([href^="#"])::after {
  content: attr(href);
  float: left;
  clear: left;
  display: block;
  font: 12px monospace;
  color: black;
}
a:not([href*="://"])::after {
  content: "http://www.stroustrup.com/" attr(href);
}

Athari

Posted 10 years ago

Reputation: 2 319

16This is the worst thing I've ever seen. +1 – Emmett R. – 10 years ago

1This is beautiful and completely horrifying. +1 – ricdesi – 9 years ago

36

Clojure

(->> (slurp "http://www.stroustrup.com")
     (re-seq #"(?:http://)?www(?:[./#\+-]\w*)+"))

Adam

Posted 10 years ago

Reputation: 361

28Slurp?! I need to learn Clojure. – 11684 – 10 years ago

10@11684 - Clojure also has standard functions named spit, zipper, and lazy-cat... :-) – Bob Jarvis - Reinstate Monica – 10 years ago

2Wow, I think that's gonna be a late New Year's Resolution. @BobJarvis – 11684 – 10 years ago

30

Emacs Lisp

(with-current-buffer (url-retrieve-synchronously "http://www.stroustrup.com/C++.html")
  (while (re-search-forward "https?://[^\\\"]*")
    (print (match-string 0))))

Jordon Biondo

Posted 10 years ago

Reputation: 1 030

2I'm a little dissapointed, given how compact and eminently readable this code is, that it does not have more votes. Well done. – Spacemoose – 10 years ago

28

Scala

"""\"(https?://.*?)\"""".r.findAllIn(scala.io.Source.fromURL("http://www.stroustrup.com/C++.html").mkString).foreach(println)

David Xu

Posted 10 years ago

Reputation: 907

8pack everything in one line - C++ can do it too – quetzalcoatl – 10 years ago

What about ftp://ftp.research.att.com/pub/c++std/WP/CD2? – Tobias Kienzler – 10 years ago

22@quetzalcoatl - This is one expression, not just one line. You can just delete all of the line breaks from the C++ code, but that's not the same thing as doing the whole task in a single expression. – DaoWen – 10 years ago

4@DaoWen: Sorry, but starting expressions-vs-line is just going silly. Add some functors and C++ you can do it too. But that's just the question of what libs are considered to be "granted" and have "zero code inside". It doesn't change the fact that packing it into a line hurst readability. One can keep it still as a single expression and just reformat it into a few lines to gain much and loose nothing other than .. line count. That's my point. Silly packing - C++ can do it too. If someone wants to get out of the "silly packing" box, then should format the code for readability, not linecount. – quetzalcoatl – 10 years ago

But, here's my sin there, I should have written my point clearly right away, and not try to just drop a punchline. Every day everyone learns a thing, thanks for making me think about that, DaoWen! – quetzalcoatl – 10 years ago

@TobiasKienzler: link seems broken, could you recheck? – quetzalcoatl – 10 years ago

3@quetzalcoatl Tobias didn't put the link there for us to follow it. He was asking the writer of this answer why it wasn't in his results. – JLRishe – 10 years ago

You can't pack everything on one line in C++, if you have any #include, etc, you can't. – David Xu – 10 years ago

I tried compiling a one-liner C++ program.. prog.cpp:1:21: warning: extra tokens at end of #include directive #include <iostream> using namespace std; int main() { return 0; } ^ /usr/lib/gcc/i586-linux-gnu/4.9/../../../i386-linux-gnu/crt1.o: In function _start': (.text+0x18): undefined reference tomain' collect2: error: ld returned 1 exit status – David Xu – 10 years ago

25

PHP 5

<?php
preg_match_all('/"(https?:\/\/.*?)"/',file_get_contents('http://www.stroustrup.com/C++.html'),$m);
print_r($m[1]);

David Xu

Posted 10 years ago

Reputation: 907

5Suggested edits: '/"((http)s?://.*?)"/''|"((http)s?://.*?)"|' (currently an error); remove array_unshift($m); (currently an error, you likely meant array_shift instead); print_r($m);print_r($m[1]); (only output the urls). – primo – 10 years ago

fixed, thanks for your input – David Xu – 10 years ago

@DavidXu Except you didn't fix it...? – Shahar – 10 years ago

now its fixed.! – David Xu – 10 years ago

25

PowerShell

Text search for all fully-qualified URLs (including JavaScript, CSS, etc.):

[string[]][regex]::Matches((iwr "http://www.stroustrup.com/C++.html"), '\w+://[^"]+')

Or to get links in anchor tags only (includes relative URLs):

(iwr "http://www.stroustrup.com/C++.html").Links | %{ $_.href }

Shorter versions from comments:

(iwr "http://www.stroustrup.com/C++.html").Links.href
(iwr "http://www.stroustrup.com/C++.html").Links.href-match":"

Justin Dunlap

Posted 10 years ago

Reputation: 421

6If anyone wonders, iwr is an alias for Invoke-WebRequest (PS3+). – Athari – 10 years ago

8You could abuse PowerShell's eagerness to flatten collections and do: (iwr "http://www.stroustrup.com/C++.html").Links.href (or (iwr "http://www.stroustrup.com/C++.html").Links.href-match":" for only absolute URI's) – Mathias R. Jessen – 10 years ago

1That's pretty handy! – Justin Dunlap – 10 years ago

22

Node.js

var http = require('http');

http.get('http://www.stroustrup.com/C++.html', function (res) {
    var data = '';
    res.on('data', function (d) {
        data += d;
    }).on('end', function () {
        console.log(data.match(/"https?:\/\/.*?"/g));
    }).setEncoding('utf8');
});

c.P.u1

Posted 10 years ago

Reputation: 1 049

3I wonder if require('http').get works. If it does then we can ditch the var statement and shorten another line. – Unihedron – 10 years ago

@Unihedro It does. – TimWolla – 10 years ago

9@Unihedro It does, but this isn't a golfing contest. – c.P.u1 – 10 years ago

You don’t need to use any capturing groups. – Ry- – 10 years ago

I think it's JavaScript rather than a framework name. – mr5 – 10 years ago

22

D

import std.net.curl, std.stdio;
import std.algorithm, std.regex;

void main() {
foreach(_;byLine("http://www.stroustrup.com/C++.html")
    .map!((a)=>a.matchAll(regex(`<a.*?href="(.*)"`)))
    .filter!("a")){ writeln(_.front[1]); }
}

Kozzi11

Posted 10 years ago

Reputation: 321

To make the list similar to the original example, you could pipe the program's output through | sort | uniq or instead add import std.array and change the line .filter!("a")){ writeln(_.front[1]); } into this: .filter!("a").map!(a => a.front[1]).array.sort.uniq){ writeln(_); }. Note, however, that I have only tried this code and not proved it to be correct or "idiomatic". :) – Frg – 10 years ago

20

Haskell

Some troubles with "\w" in Text.Regex.Posix

import Network.HTTP
import Text.Regex.Posix
pattern = "((http://)?www([./#\\+-][a-zA-Z]*)+)"
site = "http://www.stroustrup.com/C++.html"

main = do
    file <- getResponseBody =<< simpleHTTP (getRequest site)
    let result = getAllTextMatches $ file =~ pattern
    putStr $ unlines result -- looks nicer

vlastachu

Posted 10 years ago

Reputation: 341

Why is the type of result specified explicitly? It should be fully constrained by its use in unlines. – John Dvorak – 10 years ago

1This does stretch the rules a bit, seeing as neither Network.HTTP nor TextRegex.Posix are in the base package. (Though they are in the Haskell Platform, and of course on Hackage, so...) – ceased to turn counterclockwis – 10 years ago

1@JanDvorak, I start to write in ghci (probably i should post it unchanged). But your note is relevant, thanks. – vlastachu – 10 years ago

@leftaroundabout, did not know. It looks like I could not have done, if had used the base package. – vlastachu – 10 years ago

network isn't in base either, so save for rolling your own socket bindings there's no practical way to do it with just base. – Lambda Fairy – 10 years ago

20

Ruby

require 'net/http'
result = Net::HTTP.get(URI.parse('http://www.stroustrup.com/C++.html'))
result.scan(/"((http)s?://.*?)"/)

Yahor Zhylinski

Posted 10 years ago

Reputation: 301

1Your regex will fail, you need to use %r{"(https?://[^"]+)"}. Also you can use Net::HTTP.get('www.stroustrup.com', '/C++.html') to shorten up request (and keep it readable). So whole code can be in one line (keeping it readable): puts Net::HTTP.get("www.stroustrup.com", "/C++.html").scan(%r{"(https?://[^"]+)"}). Run it with ruby -rnet/http and you don't even need require 'net/http' line. – Hauleth – 10 years ago

18

PHP

As far as I can tell, most modern PHP installations come with DOM processing, so here's one that actually traverses the anchors inside the HTML:

foreach (@DOMDocument::loadHTMLFile('http://stroustrup.com/C++.html')->getElementsByTagName('a') as $a) {
    if (in_array(parse_url($url = $a->getAttribute('href'), PHP_URL_SCHEME), ['http', 'https'], true)) {
        echo $url, PHP_EOL;
    }
}

The inner loop could be shortened to:

preg_match('~^https?://~', $url = $a->getAttribute('href')) && printf("%s\n", $url);

Jack

Posted 10 years ago

Reputation: 311

Actually wanted to come up w this (as my first answer here). You did it first, so here's your +1 (for not using an error prone Regex)! Hint: You could use a lame 1 instead of the true for the in_array strict search. You can as well omit the brackets. I'm not completely sure, but iirc you could as well drop the http and only leave the :// (go w/o the scheme). . – kaiser – 10 years ago

And: Another possibility would be to drop the if ( ) {} in favor of in_array() and print $url.PHP_EOL. But yeah, you would get another +1 (if I could) for best readability :) – kaiser – 10 years ago

Just tried your example and got an error for strict standards (PHP 5.4). Seems like in the source, there's somewhere a corrupted or wrongly formatted link with a missing semicolon. You could turn off error reporting by using @\DOMDocument. Just tried that and can confirm it works. – kaiser – 10 years ago

Nah, it's the documentation that's wrong; technically you're not supposed to call ::loadHTMLFile() statically, and adding @ only hides that artefact. – Jack – 10 years ago

Yeah, that's the thing I know. I just didn't make myself clear: Both errors (strict and missing semicolon) aren't related. Just do the DOM loop line by line to see for yourself. You can set error_reporting( E_ALL ^ E_STRICT ); and the error still is there. Edit: Here's a Gist to play around with. Just remove the @/error suppression from the loadHTMLFile() call. :)

– kaiser – 10 years ago

Although they're not strictly related, the silencing operator "fixes" both issues; also, the target HTML document is crap :P – Jack – 10 years ago

2This is definitely one of the most "correct" solutions, one of the only ones I could see in use in production. nice job – Jordon Biondo – 10 years ago

14

Unix Shell

wget -q -O - http://www.stroustrup.com/C++.html | sed -n '/http:/s/.*href="\([^"]*\)".*/\1/p' | sort

Though i have to admit this doesn't work if there's more than one link on a line.

Guntram Blohm supports Monica

Posted 10 years ago

Reputation: 241

1curl http://www.stroustrup.com/C++.html saves a few characters. – l0b0 – 10 years ago

7"but no third-party libraries are allowed". I guess since wget is GNU (as is bash), you could argue that it is not third-party. But curl definitely is third-party. – Digital Trauma – 10 years ago

What about ftp://ftp.research.att.com/pub/c++std/WP/CD2 and https://www.youtube.com/watch?v=jDqQudbtuqo&feature=youtu.be? – Tobias Kienzler – 10 years ago

4@TobiasKienzler I guess Stroustrup's original code doesn't find them either – Ruslan – 10 years ago

14

Java

import java.util.regex.*;
class M{
    public static void main(String[]v)throws Throwable{
        Matcher m = Pattern.compile( "\"((http)s?://.*?)\"" )
            .matcher(
                 new Scanner(
                         new URL( "http://www.stroustrup.com/C++.html" )
                             .openStream(),
                         "UTF-8")
                     .useDelimiter("\\A")
                     .next());
        while(m.find())
            System.out.println(m.group());
    }
}

David Xu

Posted 10 years ago

Reputation: 907

3Could you properly format code in your answers? It isn't competition for the least readable code. You can format it to avoid horizontal scrollbars at least. – Athari – 10 years ago

If you use a Scanner you can make it processing the regex pattern for links directly and iterate over the Scanner’s results. – Holger – 10 years ago

5Yep .. that's java for you. Using it for code golf is a brave undertaking. – javadba – 10 years ago

4Never thought I'd see a java solution that's actually shorter than C++! – slebetman – 10 years ago

Good job, but I think it can be made less verbose with some lambdas. – Mister Smith – 10 years ago

2Correction to my last comment: I must admit this is pretty much the shortest and cleanest code that can be written in Java. I've tried a SAX parser approach, which could be made even shorter with lambdas, but the web page is not XHTML and the parser throws exceptions. Regex is the only way to go. – Mister Smith – 10 years ago

If you can use JSoup, you can make it nicer still, but I don't know what the rules are around using external libraries for this. – David Conrad – 10 years ago

@javadba Java is not worse than other languages, it only needs to be used correctly. As hinted in my old comment, this answer is abusing the tool which can do the entire job for just getting the entire page as string, followed by unnecessary manual regex matching. With recent Java versions you can do it like: new Scanner(new URL("http://www.stroustrup.com/C++.html").openStream(), "UTF-8") .findAll("\"(https?://.*?)\"").forEach(m -> System.out.println(m.group(1))); though real code would use try(…) for cleanup…

– Holger – 6 years ago

11

Groovy

"http://www.stroustrup.com/C++.html".toURL().text.findAll(/https?:\/\/[^"]+/).each{println it}

cfrick

Posted 10 years ago

Reputation: 313

Could be improved by using ?. operator to avoid NPEs? – Chris K – 10 years ago

2@ChrisKaminski and be the first (beside Bjarne) around here to check for errors? never! beside that: i only see IO related exceptions here. where do you see a NPE? – cfrick – 10 years ago

findAll() could return null, no? Or will it return an empty list? Still a bit new to Groovy. EDIT: nm, looks like findAll() returns an empty list. Those Groovy guys were so smart. :-) – Chris K – 10 years ago

11

SQL (SQL Anywhere 16)

Define a stored procedure to fetch the web page

CREATE OR REPLACE PROCEDURE CPPWebPage()
URL 'http://www.stroustrup.com/C++.html'
TYPE 'HTTP';

Produce the result set using a single query

SELECT REGEXP_SUBSTR(Value,'"https?://[^""]+"',1,row_num) AS Link  
FROM (SELECT Value FROM CPPWebPage() WITH (Attribute LONG VARCHAR, Value LONG VARCHAR) 
      WHERE Attribute = 'Body') WebPage, 
      sa_rowgenerator( 1, 256 ) 
WHERE Link IS NOT NULL;

Limitations: This produces up to 256 links. If more links exist, then bump up the 256 to an appropriate value.

Jack at SAP Canada

Posted 10 years ago

Reputation: 111

2I didn't believe there would be golf in SQL... until now. – None – 10 years ago

I get it ... "links". :-) – Jack at SAP Canada – 10 years ago

10

CoffeeScript / NodeJS

require('http').get 'http://www.stroustrup.com/C++.html', (r) ->
    dt = '';
    r.on 'data', (d) -> dt += d
    r.on 'end' , (d) -> console.log dt.match /"((http)s?:\/\/.*?)"/g

RobAu

Posted 10 years ago

Reputation: 641

1I guess this is CoffeeScript/Node? I guess you should specify that... – John Dvorak – 10 years ago

Wow. That's very readable. – slebetman – 10 years ago

@slebetman it definitely is small though – John Dvorak – 10 years ago

@slebetman Yeah CoffeeScript is so much more readable than JavaScript :) I was glad to get rid of all the curly braces }:) – RobAu – 10 years ago

9

Perl

use LWP;
use feature 'say';

my $agent = new LWP::UserAgent();
my $response = $agent->get('http://www.stroustrup.com/C++.html');

say for $response->content =~ m<"(https?://.+?)">g;

primo

Posted 10 years ago

Reputation: 30 891

1The code would be more clear if you avoided the field-separator and record-separator variables and just did:

print map { "_\n" }response->content =~ m<"(https?://.+?)">g; – Daniel Ruoso – 10 years ago

@DanielRuoso agreed. – primo – 10 years ago

or even use v5.10; and say for $response->content... – Mark Reed – 10 years ago

To each his own, I suppose. Some of the backported perl6 features have been problematic (smart matching, I'm looking at you), but say is quite useful, and in my mind clearer here. (Also, there have been rather a lot of completely-unrelated-to-perl6ism improvements to perl5 in the last 13 years; it might be worth checking out.) – Mark Reed – 10 years ago

@MarkReed I agree that say is probably more readable in this case, particularly for those less familiar with perl. – primo – 10 years ago

9

R

html<-paste(readLines("http://www.stroustrup.com/C++.html"),collapse="\n")
regmatches(html,gregexpr("http[^([:blank:]|\\\"|<|&|#\n\r)]+",html))

...although R is written mainly in C... so probably a few lines of C code behind those 2 lines of R code.

Rusan Kax

Posted 10 years ago

Reputation: 191

2That (or something similar) is true for pretty much all the answers here. – JLRishe – 10 years ago

8

Objective-C

NSString *s;
for (id m in [[NSRegularExpression regularExpressionWithPattern:@"\"((http)s?://.*?)\"" options:0 error:nil] matchesInString:(s=[NSString stringWithContentsOfURL:[NSURL URLWithString:@"http://www.stroustrup.com/C++.html"]])]){
    NSLog(@"%@",[s substringWithRange:[m range]]);
}

David Xu

Posted 10 years ago

Reputation: 907

3What? Please write the Swift version. That square bracket nonsense is hurting my eyes :) – Mister Smith – 10 years ago

2Hurray for []! Also, we should totally add a Smalltalk version ;) – Bersaelor – 10 years ago

@MisterSmith Swift answer now available here.

– JAL – 9 years ago

7

Go

package main

import (
    "fmt"
    "io/ioutil"
    "net/http"
    "os"
    "regexp"
)

func main() {
    resp, err := http.Get("http://www.stroustrup.com/C++.html")
    if err != nil {
        fmt.Fprintln(os.Stderr, err)
        os.Exit(1)
    }
    defer resp.Body.Close()
    data, _ := ioutil.ReadAll(resp.Body)
    results := regexp.MustCompile(`https?://[^""]+`).FindAll(data, -1)
    for _, row := range results {
        fmt.Println(string(row))
    }
}

P.S. this code reads entire source into memory, so consider using regexp.FindReaderIndex to search in stream, that'll make the app bulletproof.

Maxim Kupriianov

Posted 10 years ago

Reputation: 71

7

Tcl

package require http
set html [http::data [http::geturl http://www.stroustrup.com/C++.html]]
puts [join [regexp -inline -all {(?:http://)?www(?:[./#\+-]\w*)+} $html] \n]

Damkerng T.

Posted 10 years ago

Reputation: 171

You can get away by doing http::data inside the puts. No need to create a temporary variable. And I'd also format it by putting newlines and indenting at every [. But that's a style choice. – slebetman – 10 years ago

6

CJam

CJam does not have regex so I had to use a different approach in this one:

"http://www.stroustrup.com/C++.html"g''/'"*'"/(;2%{_"http://"#!\"https://"#!e|},N*

I first convert all ' to ", then I split on all ", take every alternative string and then finally filter that list for strings starting with http:// or https://. After that, simply print each filtered string on a new line.

Try it using the Java interpreter like

java -jar cjam-0.6.2.jar file.cjam

where file.cjam has the contents of the code above.

Optimizer

Posted 10 years ago

Reputation: 25 836

9Don't know about the readable part... didn't know Cjam has web functionality – Def – 10 years ago

If you want to golf it... ''/'"f/:+ for ''/'"*'"/'"f/0f=. – jimmy23013 – 10 years ago

...wait why is '"f/0f= there? Is that supposed to do something (2% for instance)? – jimmy23013 – 10 years ago

6

F#

This code could be far shorter but I would write something like this if I ever expected to have to read or use this code again so it has many unnecessary type annotations. It demonstrates the use of an active pattern MatchValue to enable pattern-matching against the standard CLR type Match

open System.Net

let (|MatchValue|) (reMatch: Match) : string = reMatch.Value

let getHtml (uri : string) : string = 
    use webClient = WebClient() in
        let html : string = webClient.DownloadString(uri)
        html

let getLinks (uri : string) : string list =
    let html : string = getHtml uri
    let matches : MatchCollection = Regex.Matches(html, @"https?://[^""]+") 
    let links = [ for MatchValue reMatch in matches do yield reMatch ]
    links

let links = getLinks "http://www.stroustrup.com/C++.html" 
for link in links do
    Console.WriteLine(link)

Edit I made getLinks its own function

SourceSimian

Posted 10 years ago

Reputation: 61

I really like how you used type annotations. I think naming values to describe what you return is ok, but name of the function is expressive enough: getHTML and html value, getLinks and links value. Last two lines may be links |> Seq.iter (printfn "%s") – MichalMa – 10 years ago

@MichalMa I agree that the name of the function is expressive enough on its own, the html and links variables are there for pragmatic reasons: so there is somewhere to set a breakpoint. I used the for loop instead of List.iter just because I like the way it reads more, although in a repl I probably would have used List.iter. – SourceSimian – 10 years ago

6

Rebol

parse read http://www.stroustrup.com/C++.html [
    any [
        thru {<a href="} copy link to {"} (print to-string link)
    ]
]

draegtun

Posted 10 years ago

Reputation: 1 592

5

Delphi

program Links;
{$APPTYPE CONSOLE}
{$R *.res}
uses
  System.SysUtils, idHTTP, RegularExpressions;
var
  client: TidHTTP;
  match : TMatch;
begin
  client := TidHTTP.Create;
  try
    match := TRegEx.Create('<a(.*)href="(.*)">(.*)<\/a>', [roIgnoreCase, roMultiline])
      .match(client.Get('http://www.stroustrup.com/C++.html'));
    with match do
      while Success do
      begin
        if Groups.Count >= 3 then
          if copy(lowercase(Groups[2].Value), 1, 4) = 'http' then
              writeln(Groups[2].Value);
        match := NextMatch;
      end;
  finally
    client.Free;
  end;
end.

This would work in (I suppose) Delphi XE and later. It requires no component other than those that are already installed in a default setup (namedly indy and regular expressions). Even tho delphi is quite close to C++ in general structure, I guess this task turned out to be a bit shorter.

Tuncay Göncüoğlu

Posted 10 years ago

Reputation: 151

4

JS/jQuery

$.get('http://www.stroustrup.com/C++.html',{},function(s){
    $(s).find('a').each(function() {
        console.log($(this).attr('href'))
    })
})

David Xu

Posted 10 years ago

Reputation: 907

8jQuery is a third-party library. You should use plain JavaScript. – Athari – 10 years ago

6@Athari if Mr. Stroustrup allows Boost for C++, jQuery should be fine for JS :P – Nick T – 10 years ago

6This will only work if executed from the stroustrup domain since cross domain XHR requests need to be handled specifically by the server. – pllee – 10 years ago

1This answer has examples of jquey makings both easier and harder. The XHR is drastically simpler, and getting the href attribute out of the dom object is needlessly more complicated. – Seth Battin – 10 years ago

@SethBattin Could you show us a simpler approach to getting the href attributes? – JLRishe – 10 years ago

@JLRishe sure. this.href – Seth Battin – 10 years ago

@SethBattin Oh, you just meant that one little bit. Well, I wouldn't call that needlessly more complicated. Sacrificing a few extra characters for consistency isn't the end of the world. – JLRishe – 10 years ago

2@JLRishe I'm not bothered by the few characters; it's still perfectly readable and short. But it's also generating a complex jquery object to wrap around an already complex DOM object, and then accessing a property through a string lookup. It could get there in one step rather than...lots of steps. Consistency is great, but that kind of line makes me think it was applied by reflex. Everything gets wrapped in $(...) all the time, whether it needs it or not, i.e. jQuery because jQuery. That's the aspect I don't like. – Seth Battin – 10 years ago

4

Ruby

Another Ruby solution:

require 'open-uri'
open('http://www.stroustrup.com/C++.html', 'r:iso-8859-1:utf-8') do |f|
  puts f.read.scan(%r{"(https?://www[^"]*)"}).sort
end

Mark Reed

Posted 10 years ago

Reputation: 667

4

Scala

object Downloader extends App {
    val s = io.Source.fromURL("http://www.stroustrup.com/C++.html", "iso-8859-1").mkString // load URL to String
    val regex = """((http://)?www([./#\+-]\w*)+)""".r                                      // create and compile regexp
    println(regex.findAllIn(s).mkString("\n"))                                             // print matches
}

v6ak

Posted 10 years ago

Reputation: 141

4

F#

open System.Net
open System.Text.RegularExpressions

let html = (new WebClient()).DownloadString("http://www.stroustrup.com/C++.html") in
    Regex.Matches(html, @"https?://[^""]+") |> Seq.cast<Match> |> Seq.iter (printfn "%A")

Taken from the C# version.

Paulo Pinto

Posted 10 years ago

Reputation: 141

4

SmallTalk (Pharo 3)

Hurray for []! Also, we should totally add a Smalltalk version ;)

@Bersaelor at Objective-C answer.

I know basics of Smalltalk - syntax of language and some tutorials. I decide that it's good place for practice. I have already installed Pharo 3.0.

but no third-party libraries are allowed

Okay. On downloaded image I found Zinc-HTTP and regex packages. Probably I should read about "third-party" meaning.

So, code:

(ZnClient new get: 'http://www.stroustrup.com/C++.html')
   regex: '((http\://)?www([./#+-]\w*)+)' matchesDo: [ :x | Transcript show: String cr, x ].

Code with exception handlings:

|response|
[  
response := (ZnClient new url: 'http://www.stroustrup.com/C++.html'; get; response).
response isSuccess 
  ifFalse: [ Transcript show: 'Bad status : ', (response status asString) ]
  ifTrue:  [ response contents regex: '((http\://)?www([./#+-]\w*)+)' matchesDo: 
           [ :x | Transcript show: String cr, x ]]
]
    on: NameLookupFailure do: [ Transcript show: 'Connection problem...' ].

If you try to check me out you will get an exception. Something like UTF8EncoderException: errorIllegalLeadingByte. The first thing I thought that the package pretty outdated. But then realized that it downloads other sites well. Then I thought it is not always well copes with utf8. In debug received byte = 150 (1001 0110) - it's bad if it is the first byte in sequence. I spent a some time on the localization error (moved up in call stack and found parsed line). So:

Ehh

Lang.Next'14 Keynote: What � if anything � have we learned from C++? A 68 minute talk incl. Q&A.

You should see here squares or diamonds with question marks, depending on your browser. Trap from Stroustrup. At best, I had to write an exception handler in the package at the point where the line is created. But i just comment exception call and return ? character.

remove exception call

Also in code which calc the sequence length I return 1 in place of exception call (seems browsers does same).

P.S. some notes about Pharo (if anyone interested):

  • Too much GUI
  • Frendly for beginners.
  • Nice package searching
  • Nice highlighting and code autocompletion (not nice enough for serios IDE like IDEA but better than what I saw a few years ago in the same Pharo). Autocompletion sometimes trying deduce type - very thankless job.
  • Very chaotic GUI. At IDE we used to see tiled windows system, but there is only floating litle windows.
  • Where is imports, namespaces? There are many entities already. Is never a collision occurs?

vlastachu

Posted 10 years ago

Reputation: 341

4

Vimscript

function! Cpp()
    " grab the page in a new buffer in a new window in a new tab page
    tabedit http://www.stroustrup.com/C++.html

    " delete all lines that don't contain at least one 'http' hyperlink
    v/"http/d

    " only keep the hyperlink on every line
    %s/^.\+="\([^"]\+\)".\+$/\1
endfunction

romainl

Posted 10 years ago

Reputation: 141

3

Rust

Here is Rust solution:

extern crate reqwest;
extern crate select;
extern crate regex;

use select::document::Document;
use select::predicate::Name;
use regex::Regex;

fn main() {
    scrape_links("http://www.stroustrup.com/C++.html");
}

fn scrape_links(url: &str) {
    let resp = reqwest::get(url).unwrap();
    assert!(resp.status().is_success());
    let re = Regex::new(r"((http://)?www([./#\+-]\w*)+)").unwrap();
    Document::from_read(resp)
        .unwrap()
        .find(Name("a"))
        .filter_map(|n| n.attr("href"))
        .filter(|text| re.is_match(text))
        .for_each(|x| println!("{}", x));
}

Anton Pavlov

Posted 10 years ago

Reputation: 31

3

MATLAB

Is quite straightforward with urlread and regexp:

url = 'http://www.stroustrup.com/C++.html';
links = regexp(urlread(url), '<a href="http://([^"]*\.*)">', 'tokens');

James152

Posted 10 years ago

Reputation: 31

3

Python 2

I don't like using regex on HTML for established reasons, so here's an ungolfed HTMLParser approach:

from HTMLParser import HTMLParser
import urllib2 as u

class LinkFinder(HTMLParser):
    def handle_starttag(self, tag, attrs):
        if tag=='a':
            attrdct = dict(attrs)  # attrs is a list of ('key', 'value') tuples
            if 'href' in attrdct:
                href = attrdct['href']
                print href

url = "http://www.stroustrup.com/C++.html"
contents = u.urlopen(url).read()
LinkFinder().feed(contents)

Note this also gives local links such as index.html and anchors à la #learning. If you only want absolute links, prepend print href by

                if ':/' in href:  # also handles ftp, https etc.
                    print href

while for only omitting the anchors, use

                if !href.startswith('#'):
                    print href

Tobias Kienzler

Posted 10 years ago

Reputation: 179

3

Lua

Here's a lua solution complete with error checking and duplicate URL elimination like Stroustrups's C++ version.

Just made it in under 10 lines

local http, urlunique = require 'socket.http', {}

local body, resp, _, respmsg = http.request "http://www.stroustrup.com/C++.html"
assert(resp == 200, respmsg or resp)
for each in body:gmatch 'https?://[^%s<>"]+' do
  if not urlunique[each] then
    urlunique[each] = true
    print(each)
    end end

Here's another version using string.gsub fitting in just 6 lines!

local http, urlunique = require 'socket.http', {}

local body, resp, _, respmsg = http.request "http://www.stroustrup.com/C++.html"
assert(resp == 200, respmsg or resp)
body:gsub ('https?://[^%s<>"]+', function(r) urlunique[r] = true end)
for url in pairs(urlunique) do print(url) end

greatwolf

Posted 10 years ago

Reputation: 129

3

VBScript in Windows Script Host

That is, if this is stored in links.vbs file, run it via cscript /nologo links.vbs.

sub writeline( s ): WScript.StdOut.WriteLine s : end sub
function re( s ): set re = new RegExp: re.pattern = s: re.global = true: end function

set http = createobject( "Msxml2.XMLHTTP" )
http.open "GET", "http://www.stroustrup.com/C++.html", false: http.send
set links = re( "\w+://[^\""]+" ).execute( http.responseText )
for each link in links: writeline( link ): next

Addendum:

While the above lists all full links, which seems to be the goal, Stroustrup’s code additionally pares it down to unique links, and here’s a version that does that:

sub writeline( s ): WScript.StdOut.WriteLine s : end sub
function re( s ): set re = new RegExp: re.pattern = s: re.global = true: end function

set http = createobject( "Msxml2.XMLHTTP" )
http.open "GET", "http://www.stroustrup.com/C++.html", false: http.send
set links = re( "\w+://[^\""]+" ).execute( http.responseText )

set unique_links = createobject( "Scripting.Dictionary" )
on error resume next
for each link in links: unique_links.add ucase(link), link & "": next
for each link in unique_links.items(): writeline( link ): next

This reduces the number of output lines from 81 to 77.

Cheers and hth. - Alf

Posted 10 years ago

Reputation: 131

2

R

I'm super new to regex so I gave this my best shot... any improvements appreciated!

grep("(http)s?://.*?", readLines("http://www.stroustrup.com/C++.html"), value = T)

readLines() just dumps the HTML source into a character vector. I then used grep() to find the URLs. The problem I ran into was that HTML element tags as well as corresponding link text were included in the output. substring() could be used to trim some of them I guess but it wouldn't work in all cases. If anyone knows a better way please let me know - especially if I could use a better regex.

syntonicC

Posted 10 years ago

Reputation: 329

2regex is not only not mandatory, it is actually a rather bad method for parsing HTML. Anyway, your code is obviously much more concise than the C++ one :) – Tobias Kienzler – 10 years ago

2

Python 2.7

Some readable Python code.

import urllib2
import re
page = urllib2.urlopen("http://www.stroustrup.com/C++.html").read()
for link in re.findall('"(http[s]?://.*?)"', page):
    print link

Logic Knight

Posted 10 years ago

Reputation: 6 622

2

(C++)--

aka C

I'm suprised no one has done C yet. The code is nice and clean.

linkfetch.c:

#include "inet_utils.h"
main(){
  char* SITE="www.stroustrup.com";
  char* PAGE="C++.html";
  char* REGX="((http://)?www([./#\\+-]\\w*)+)";
  int s = connect_to(SITE);
  FILE* f = fetch_page(s,SITE,PAGE);
  if (f) list_matches(f,REGX);
  else return printf("Can't fetch page %s/%s\n",SITE,PAGE);
  fclose(f); close(s);
}

That is, clean assuming you also write these simple utiliies:

inet_utils.h

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/socket.h>
#include <netdb.h>
#include <regex.h>

#define SZ 1024 //good default buffer size
#define ERROR(s) (puts(s)&&0)
#define VP(e) ((void*)(long)(e))

inline int connect_to(char* w){
  //:returns socket connected to website w.
  struct sockaddr_in a;
  int s=socket(AF_INET, SOCK_STREAM, 0);                       // make socket
  struct hostent* h = gethostbyname(w);                        // lookup host
  if (!h) return ERROR("No Such Host");                        // check err
  a.sin_family=AF_INET;                                        // set ip address
  a.sin_port=htons(80);                                        // port 80
  memcpy(&a.sin_addr.s_addr, h->h_addr, h->h_length);          // of host
  if (connect(s,(struct sockaddr*)&a,sizeof(a))<0)             // connect
    return ERROR("can't connect");                             // handle error
  return s;                                                    // return socket
}

inline FILE* fetch_page(int s, char*w, char* p){
  //:returns open file handle for page p from site w connected to socket s

  FILE*f=fopen("/tmp/wcache","w+");                   // create cache file
  size_t n; char*b=malloc(SZ);                        // allocate temp buffer
  if (!s||!f||!b) return VP(ERROR("Resource Error")); // check for errors
  sprintf(b,                                          // compose request
          "GET / HTTP/1.0\r\nHost:%s\r\nAccept:*/*\r\nConnection:close\r\n\r\n",
          w);
  send(s,b,strlen(b),0);                         // send request
  while ((n=recv(s,b,SZ,0))>0)                   // receive response
    fwrite(b,1,n,f);                             // write it to cache file
  fseek(f,n=0,SEEK_SET);                         // read from beginnng
  fgets(b,SZ,f);                                 // look at first line
  if (!f||strncmp(strtok(b," "),"HTTP/",5))      // is it http?
    return VP(ERROR("Invalid Response"));        // error if not
  if (atoi(strtok(0," "))!=200)                  // check good status code
    return VP(ERROR("Bad Status Code"));         // error if not
  while (getline(&b,&n,f)>=0 && *b!='\r');       // skip headers upto blank line
  free(b);                                       // cleanup
  return f;                                      // return open handle
}

inline void list_matches(FILE* f, char* regx){
  //prints all strings from f which match regx
  regex_t r;
  size_t n = SZ; char*b=malloc(n);               // temp buffer
  if (regcomp(&r,regx,REG_NOSUB|REG_EXTENDED))   // compile regex
    puts("invalid regex");                       // handle error
  else while (getline(&b,&n,f)>0)                // fetch line
         if (!regexec(&r,b,0,0,0))               // check match
           puts(b);                              // show match
  regfree(&r); free(b);                          // cleanup
}

It compiles without warnings on out-of-the-box gcc (version 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu5) )

AShelly

Posted 10 years ago

Reputation: 4 281

Whether the code readable is highly questionble, considering weird formatting. Comments are nice, but how about properly formatting the code? You don't normally put multiple declarations with assignments, if and return on a single line, do you? – Athari – 10 years ago

Clean and readable special purpose C usually comes by hiding ugly general purpose APIs and memory management details in "here be dragons" wrapper functions. I could have prettified the 2nd part, but it still wouldn't be nice. So I went for condensed instead. – AShelly – 10 years ago

This is ridiculous. Tell Linus about C code being unreadable by design. Properly formatted code is much easier to understand, there's nothing "magic" in your mess besides crazy formatting. http://pastebin.com/WHrqJzXb

– Athari – 10 years ago

Fine, here's some formatting. I didn't say C is unreadable by design, I said general purpose APIs are usually less readable than special purpose ones. And I point to the gymnastics needed to populate the address for connecting to a web host as an example. – AShelly – 10 years ago

2

Swift (2.2)

What? Please write the Swift version. That square bracket nonsense is hurting my eyes :)

Mister Smith on the Objective-C answer.

let x = try!NSString(contentsOfURL:NSURL(string:"http://www.stroustrup.com/C++.html")!,encoding:4)
for y in try!NSRegularExpression(pattern:"\"((http)s?://.*?)\"",options:[]).matchesInString(x as String,options:[],range:NSMakeRange(0,x.length)){print(x.substringWithRange(y.range))}

Ungolfed:

let url = NSURL(string: "http://www.stroustrup.com/C++.html")!
let html = try! NSString(contentsOfURL: url, encoding: NSUTF8StringEncoding)
let regex = try! NSRegularExpression(pattern: "\"((http)s?://.*?)\"", options: [])
let results = regex.matchesInString(html as String, options: [], range: NSMakeRange(0, html.length))
for result in results {
    print(html.substringWithRange(result.range))
}

Assumes Foundation has been implicitly imported.

I can't access the original page for some reason, so this was tested with the Google Cached version: http://webcache.googleusercontent.com/search?q=cache:USk4BseSofcJ:www.stroustrup.com/C%2B%2B.html+&cd=1&hl=en&ct=clnk&gl=us

282 bytes. Slightly shorter than the 292 byte Objective-C answer. I'm falling back onto Foundation APIs, so there may be room for improvement by using pure Swift types. The Cocoa APIs have also changed since the Objective-C answer was posted.

stringWithContentsOfURL: has been deprecated on NSString in favor of stringWithContentsOfURL:encoding:error:. We lose some bytes on the encoding parameter, but gain some back because the ErrorPointer is no longer passed in with Swift. The function now throws its NSError so instead, I'm using try! to force the execution of the NSString and NSRegularExpression initializers. I also save some bytes by passing the raw value 4 as the value of the encoding parameter instead of the constant NSUTF8StringEncoding. 19 bytes saved. But I lose some bytes by having to pass in an empty array ([]) instead of 0 to represent no options. 2 bytes lost there. I also lose two bytes for every variable declaration since Swift requires whitespace characters on either side of the = character.

I lose 10 bytes by having to cast the NSString as a Swift String when calling matchesInString. This is required because I'm using the NSString method contentsOfURL to get the web page HTML, but the NSRegularExpression method matchesInString takes in a Swift String as a parameter. The implicit conversion between NSString and String isn't available here, so I am forced to use as to explicitly convert the types.

Interestingly enough matchesInString has not been completely converted to use Swift types. It still requires its range parameter to be an NSRange struct instead of a Swift Range<String>. I have to fall back and use NSMakeRange to create the range of the string. I could save 4 bytes by using x.characters.indices of type Range<String.CharacterView.Index> instead, but Swift Range structs are not compatible with Foundation NSRange structs. Additionally, if x were a Swift String, I might be able to save a few bytes by replacing substringWithRange with a subscript on String. I haven't found a great way to do that yet, as creating two Index structs is currently longer than using substringWithRange.

JAL

Posted 10 years ago

Reputation: 304

Welcome to Programming Puzzles and Code Golf. Great answer, well explained, I'm not sure you need any help at all. Quick tip though: the convention on this site is to put your byte count in the title, just after the language name. – wizzwizz4 – 9 years ago

Oh, I've just noticed: This isn't a [tag:code-golf]: its objective is to "show Stroustrup what small and readable code is". If you could provide an ungolfed (readable) version, that would be very helpful. – wizzwizz4 – 9 years ago

@wizzwizz4 Thank you for your comments. I didn't see the byte count next to the language name in the other answers, so I just added mine below. I've also added an ungolfed version to my answer. – JAL – 9 years ago

The reason you didn't see a byte count is because this is not a [tag:code-golf] challenge. – wizzwizz4 – 9 years ago

@wizzwizz4 Ah, got it. Still learning. Thank you! – JAL – 9 years ago

1Everybody is, even me. Even Dennis! – wizzwizz4 – 9 years ago

@wizzwizz4 also good to see that you're active on retrocomputing! See you around. – JAL – 9 years ago

1

ColdFusion

(using the same regex that Stroustrup uses)

<cfhttp url="http://www.stroustrup.com/C++.html" result="response" />
<cfif response.statusCode does not contain "200">
    <cfset writeOutput("Error getting the page: #response.statusCode#") />
<cfelse>
    <cftry>
        <cfset htmlLinks = REMatchNoCase("((http://)?www([./#\+-]w*)+)",response.fileContent) />
        <cfdump var="#htmlLinks#" />
    <cfcatch>
        <cfset writeOutput("There was a problem: #cfcatch.message# #cfcatch.detail#") />
    </cfcatch>
    </cftry>
</cfelse>

Matt Gutting

Posted 10 years ago

Reputation: 121

1@Athari how do you do the code highlighting? – Matt Gutting – 10 years ago

See Markdown help - Syntax highlighting for code. The syntax is usually <!-- language: lang-$LANGUAGE_NAME$ --> before the code block. If it doesn't work (support for many languages is missing), I look what CSS class is applied on StackOverflow to code blocks in questions for that language.

– Athari – 10 years ago

1

XQuery

HTML should be something similar to XML, so why not use langauges designed for this job?

If the page would've been "real" XHTML, we could run a query as beautiful as

doc("http://www.stroustrup.com/C++.html")//a/@href/data()

As this is crappy, broken HTML, let's use the BaseX-specific HTML parser (BaseX is an XQuery implementation):

html:parse(fetch:binary("http://www.stroustrup.com/C++.html"))//a/@href/data()

If limiting to URLs starting with http: is a must, lets do it:

html:parse(fetch:binary("http://www.stroustrup.com/C++.html"))//a/@href[starts-with(., 'http:')]/data()

Disclaimer: I am somewhat affiliated with the BaseX team as I wrote some code during my thesis. This would've been the tool of my choice for that kind of task, anyway. Other XQuery implementations provide similar HTML parsing capabilities, but I don't know their XQuery extensions by heart.

Jens Erat

Posted 10 years ago

Reputation: 261

1

Bash + AWK

wget -q -O http://www.stroustrup.com/C++.html \
|awk '/((http:\/\/)?www([./#\+-]\w*)+)/ {print gensub(/.*((http:\/\/)?www([./#\+-]\w*)+).*/,"\\1","g")}'

I know it probably misses a few URLs, but I chosen to use the same regex than the original Stroustrup's code, so this should returns the same output than the original piece of code.

It may be possible to add some CR to make it more visible but I don't have a Linux available ATM for testing it works... (tested on Windows)

Here is a "clean" version of the AWK part:

/((http:\/\/)?www([./#\+-]\w*)+)/ {
    print gensub(/.*((http:\/\/)?www([./#\+-]\w*)+).*/,"\\1","g")
}

LeFauve

Posted 10 years ago

Reputation: 402

1

PHP 4.3+ / 5.0+

I know there are 2 different answers regarding PHP, but I'm going to show here a similar aproach, using nothing but standard functions.

For this, you will need to have the following on a file named php.ini ON THE SAME DIRECTORY:

allow_url_fopen= On
allow_url_include= On

THAT PART IS IMPORTANT!
In case you can't change (didn't worked with XAMPP), you have a default php.ini file on the PHP installation folder.
Changing the values will solve it.
Remember to restart apache after.

Since this isn't , I made my code somewhat readable.

Here it is:

ob_start(); //creates an output buffer

//now we 'include' the file, which will output the source code.
include 'http://www.stroustrup.com/C++.html';

$html = ob_get_clean(); //stores the output buffer and closes it

$offset = 0; //initial offset to search
$links = array(); //will contain all links

//while a link is found
while($pos = strpos($html, 'href="http', $offset))
{
    //look for the closing "
    $end = strpos($html, '"', $pos + 7);
    //take it from the string, store it into the array
    $links[] = substr($html, $pos + 6, ($end - $pos) - 6);
    //increase the offset, so it doesn't find the same link again
    $offset = $end + 1;
}

print_r($links); //spits it out, with the output buffer closed

I've added some comments to try to explain the code.

No regex used or DOM parsers: only pure hard-cold string manipulation.

For this to work in other pages, you must be sure that the values of the property href are between "", or it will fail.

Ismael Miguel

Posted 10 years ago

Reputation: 6 797

1

F#

do
  use client = new System.Net.WebClient()
  let html = client.DownloadString "http://www.stroustrup.com/C++.html"
  System.Text.RegularExpressions.Regex.Matches(html, @"https?://[^""]+")
  |> Seq.cast<System.Text.RegularExpressions.Match>
  |> Seq.iter (printfn "%O")

Jon Harrop

Posted 10 years ago

Reputation: 111

0

CSS - idea

In Firefox, for example, for any page you are on, you can go to Tools | Web Developer | Style Editor, and use CSS to display anchors only:

* {display:none;}
a {display:block;}

However, the above will not work because display of parent elements overrides children.

Still working on a CSS solution, but suggestions welcome!

user15259

Posted 10 years ago

Reputation:

1Maybe using positioning? Push everything off the page to the left, then push links back on to the right? – Izkata – 10 years ago

@Izkata - looks like I've been scooped by Athari who provided a CSS solution! – None – 10 years ago

You could use "* {font-size:0pt} a {font-size:8pt}" to display only the links, but you won't see the URLs – LeFauve – 10 years ago

It seems Athari nailed it :) – LeFauve – 10 years ago

0

C++/U++

Here is the U++ (which is 'another' C++ library, slightly different approach from boost) version:

#include <Core/Core.h>
#include <plugin/pcre/Pcre.h>

using namespace Upp;

CONSOLE_APP_MAIN
{
    String s = ToCharset(CHARSET_UTF8, HttpRequest("http://www.stroustrup.com/C++.html").Execute(),
                         CHARSET_ISO8859_1);
    RegExp x("href *= *\"(.*?)\"");
    while(x.GlobalMatch(s))
        Cout() << x.GetStrings()[0] << "\n";
}

(BTW, funny how most versions here are IMO not implementing the original specification: extract all links. There are many links on that page that do not start with http...)

Mirek Fidler

Posted 10 years ago

Reputation: 9

Kudos, I just posted the same below (deleted now), and equally tried to reply to the original post ( BTW: http://www.ultimatepp.org/forums/index.php?t=msg&th=9167&start=0& ) but it won't let me to post links as a first message.

– user1284631 – 10 years ago

0

Lua

More concise, without checking for errors or unique URLs: In readable...

http = require 'socket.http'
page = http.request 'http://www.stroustrup.com/C++.html'
page:gsub('http://www[%w./#\\+-]+', print)

...and less-readable...

require'socket.http'.request'http://www.stroustrup.com/C++.html':gsub('http://www[%w./#\\+-]+',print)

thenumbernine

Posted 10 years ago

Reputation: 341

0

Javascript / DOM

var r = new XMLHttpRequest();
r.open('GET', 'http://www.stroustrup.com/C++.html', false);
r.send(null);
var div = document.createElement('div');
div.innerHTML = r.response;
[].slice.call(div.getElementsByTagName('a')).forEach( function(a) {
  if (a.href && a.getAttribute('href').charAt(0) != '#') console.log(a.href);
} );

(Tested in the Google Chrome javascript console.)

Of course, CORS blocks this by default - but its easy to disable CORS in Google Chrome for development purposes.

Usas

Posted 10 years ago

Reputation: 1

r.response is a string. You have to set r.responseType = "document" first. But then, you get an error because the request is synchronous. – gilly3 – 10 years ago

Thanks @gilly3. I've modified it to take the text response and put it in a div's innerHTML instead. When running the browser with CORS disabled, this definitely now works. – Usas – 10 years ago

-1

Bash

 GET http://www.stroustrup.com/C++.html | grep -o "https\?://[^ \"]\+

needs libwww-perl and grep packages :)

Zoltán Szeder

Posted 10 years ago

Reputation: 11