The World's Smallest Web Browser

71

23

Backstory:

You enjoy your new programming job at a mega-multi-corporation. However, you aren't allowed to browse the web since your computer only has a CLI. They also run sweeps of all employees' hard drives, so you can't simply download a large CLI web browser. You decide to make a simple textual browser that is as small as possible so you can memorize it and type it into a temporary file every day.

Challenge:

Your task is to create a golfed web browser within a command-line interface. It should:

  • Take a single URL in via args or stdin
  • Split the directory and host components of the URL
  • Send a simple HTTP request to the host to request the said directory
  • Print the contents of any <p> paragraph </p> tags
  • And either exit or ask for another page

More Info:

A simple HTTP request looks like this:

GET {{path}} HTTP/1.1
Host: {{host}}
Connection: close
\n\n

Ending newlines emphasized.

A typical response looks like:

HTTP/1.1 200 OK\n
<some headers separated by newlines>
\n\n
<html>
....rest of page

Rules:

  • It only needs to work on port 80 (no SSL needed)
  • You may not use netcat
  • Whatever programming language is used, only low-level TCP APIs are allowed (except netcat)
  • You may not use GUI, remember, it's a CLI
  • You may not use HTML parsers, except builtin ones (BeautifulSoup is not a builtin)
  • Bonus!! If your program loops back and asks for another URL instead of exiting, -40 chars (as long as you don't use recursion)
  • No third-party programs. Remember, you can't install anything.
  • , so the shortest byte count wins

TheDoctor

Posted 2015-10-26T16:26:26.493

Reputation: 7 793

7Python, import webbrowser;webbrowser.open(url) – Blue – 2015-10-26T16:36:06.673

8@muddyfish read the rules – TheDoctor – 2015-10-26T16:36:33.080

1Another fuzzy point is the request itself. Some websites will accept incomplete or non-standard requests. I suggest you include an example request (to e.g. example.com) and the expected output. – mınxomaτ – 2015-10-26T16:41:09.380

4Can you provide a sample web page of some sort for testing this? It is difficult to find places that use <p> :P – a spaghetto – 2015-10-26T16:43:19.640

@quartata try Wikipedia – TheDoctor – 2015-10-26T16:43:43.177

@quartata example.com would be perfect. It is guaranteed to never change its content and is relatively small. – mınxomaτ – 2015-10-26T16:44:05.660

@minxomat sure, when I get back on my computer – TheDoctor – 2015-10-26T16:44:06.023

@minxomat it shouldn't be too hard. I'll clarify the request soon – TheDoctor – 2015-10-26T16:49:01.027

@TheDoctor Nevermind, that was a dumb question that you've already answered ... – mınxomaτ – 2015-10-26T16:50:18.667

52

Are we allowed to parse HTML using regex? ;-)

– Digital Trauma – 2015-10-26T16:57:38.533

3The restriction to *low-level socket interfaces* seems to prohibit the TCP-level APIs of most languages which have TCP-level APIs. – Peter Taylor – 2015-10-26T16:58:23.123

@DigitalTrauma good luck – TheDoctor – 2015-10-26T16:58:30.777

@PeterTaylor I intended that to mean only a simple TCP API was allowed... Clarification soon – TheDoctor – 2015-10-26T17:00:59.110

3Wouldn't all headings h1 … h6 be important, too? If you actually aren't allowed to read you may need to hurry and rush through the content. – insertusernamehere – 2015-10-26T17:02:27.297

Do the contents of each <p>...</p> need to be printed on separate lines or can the output be dumped all in one log line? – Digital Trauma – 2015-10-26T18:29:27.923

1@DigitalTrauma it should be newline separated – TheDoctor – 2015-10-26T19:52:26.883

1Is HTTP 1.1 mandatory? A simple HTTP request is even simpler: "GET $path HTTP/0.9\r\n\r\n" – slebetman – 2015-10-27T03:05:48.373

Is IO:Socket::INET considered low-level enough?

– Dom Hastings – 2015-10-27T06:32:41.500

The second and fourth lines do the same thing: replace every gap in the succeeding line with the appropriate character. They jump to the next line at the end. – user15308 – 2015-10-27T05:52:02.383

1@DomHastings Seems as low as the Bash and PHP entries. Open a socket, write and read. – Schwern – 2015-10-27T08:02:18.470

2Totally off-topic: any mega-multi-corporation that makes it this hard for their developers to access the internet is not worth working for IMHO. As a developer, I need Google and Stackoverflow on a daily, sometimes even hourly basis to search for solutions. Not having access to these essential tools is like not giving a commercial pilot access to his radio. – Nzall – 2015-10-27T16:40:18.760

Just download it again. Every day. – None – 2015-10-27T21:56:03.227

1Nitpicking: the newlines in an HTTP request are actually \r\n. – Josiah Keller – 2015-10-28T13:50:21.843

If the CLI is bash, wget might be preinstalled. – Cees Timmerman – 2015-10-28T16:40:16.000

@CeesTimmerman wget isn't a socket API – TheDoctor – 2015-10-28T16:43:13.307

@TheDoctor So the bold line wouldn't apply to it, hence the no install rule, which also doesn't apply if it's pre-installed. – Cees Timmerman – 2015-10-28T17:03:12.307

@CeesTimmerman But wget handles all the HTTP request internally, which isn't allowed. – TheDoctor – 2015-10-28T17:21:17.587

What's this "(as long as you don't use recursion)" about? – Bergi – 2015-10-29T01:03:19.967

@Bergi Using recursion would eventually cause a Stack Overflow given enough browsing. – TheDoctor – 2015-10-29T01:36:21.873

2@TheDoctor: That would depend on the language and its ability of tail call optimisation. A recursive approach is totally standard in Haskell or JS – Bergi – 2015-10-29T09:32:06.167

small in what sense?? code lines or executable size?? – Ehsan Sajjad – 2015-10-29T11:18:35.813

So small that you can remember it. Which is hard. Some people can remember pages of code, but I would have trouble remembering just wget, less and grep to perform this task, even though they let you build a full fledged browser in under 10 lines. – GolezTrol – 2015-10-30T06:32:50.983

Answers

63

Pure Bash (no utilities), 200 bytes - 40 bonus = 160

while read u;do
u=${u#*//}
d=${u%%/*}
exec 3<>/dev/tcp/$d/80
echo "GET /${u#*/} HTTP/1.1
host:$d
Connection:close
">&3
mapfile -tu3 A
a=${A[@]}
a=${a#*<p>}
a=${a%</p>*}
echo "${a//<\/p>*<p>/"
"}"
done

I think this is up to the spec, though of course watch out for parsing HTML using regex I think the only thing worse than parsing HTML using regex is parsing HTML using shell pattern matching.

This now deals with <p>...</p> spanning multiple lines. Each <p>...</p> is on a separate line of output:

$ echo "http://example.com/" | ./smallbrowse.sh
This domain is established to be used for illustrative examples in documents. You may use this     domain in examples without prior coordination or asking for permission.
<a href="http://www.iana.org/domains/example">More information...</a>
$ 

Digital Trauma

Posted 2015-10-26T16:26:26.493

Reputation: 64 644

35You need to have this memorized by tomorrow. – Conor O'Brien – 2015-10-26T17:45:14.677

It doesn't give me paragraph text. Only tags inside the paragraph. example.com >>> <a href="http://www.iana.org/domains/example">More information...</a> – TheDoctor – 2015-10-26T17:53:53.643

@TheDoctor fixed for <p>...</p> spanning multiple lines - and a bit shorter too! – Digital Trauma – 2015-10-26T18:15:42.463

14+∞ for "parsing HTML using shell pattern matching" – SztupY – 2015-10-26T18:20:39.493

76-1 because your avatar is subliminal messaging – TheDoctor – 2015-10-26T19:48:28.527

This will take any scheme like alkdfjlkdj://example.org saving a few bytes. That should be fixed. – Schwern – 2015-10-27T05:28:01.410

1...you can make TCP connections from Bash? Now I am truly terrified! – MathematicalOrchid – 2015-10-27T13:08:58.957

2Note: /dev/tcp is an optional extension and may not be present in your build of bash. You need to compile with --enable-net-redirections to have it. – Chris Down – 2015-10-27T14:45:41.820

@Schwern I'm not sure if I totally get your point. There is nothing in the spec that I see about sanitising the input URL – Digital Trauma – 2015-10-27T16:22:45.080

@ChrisDown Looking in the bash-4.3 configure script, I see opt_net_redirs=yes, i.e. while this is an option, it is enabled by default. Its also a documented feature. So I don't think the fact that this feature may be disabled requires that it needs to be handled in any special way from the [tag:code-golf] scoring point of view. The question of interpreter build-time options is interesting though and I don't think has been brought up on meta...

– Digital Trauma – 2015-10-27T16:28:26.777

@ChrisDown OK, so I posted this meta question to get the community consensus on this. I hope I've written it fairly and unbiasedly, though please feel free to edit or comment as you see fit.

– Digital Trauma – 2015-10-27T17:03:10.040

@MathematicalOrchid If that terrifies you, you should not, I repeat, should not check out testssl.sh.

– Jonas Schäfer – 2015-10-31T09:30:03.557

21

PHP, 175 bytes (215 - 40 bonus) 227 229 239 202 216 186 bytes

Have fun browsing the web:

for(;$i=parse_url(trim(fgets(STDIN))),fwrite($f=fsockopen($h=$i[host],80),"GET $i[path] HTTP/1.1
Host:$h
Connection:Close

");preg_match_all('!<p>(.+?)</p>!si',stream_get_contents($f),$r),print join("
",$r[1])."
");

Reads URLs from STDIN like http://www.example.com/. Outputs paragraphs separated by newline "\n".


Ungolfed

for(; $i=parse_url(trim(fgets(STDIN))); ) {
    $h = $i['host'];
    $f = fsockopen($h, 80);

    fwrite($f, "GET " . $i['path'] . " HTTP/1.1\nHost:" . $h . "\nConnection:Close\n\n");

    $c = stream_get_contents($f)

    preg_match_all('!<p>(.+?)</p>!si', $c, $r);
    echo join("\n", $r[1]) . "\n";
}

First version supporting one URL only

$i=parse_url($argv[1]);fwrite($f=fsockopen($h=$i[host],80),"GET $i[path] HTTP/1.1\nHost:$h\nConnection:Close\n\n");while(!feof($f))$c.=fgets($f);preg_match_all('!<p>(.+?)</p>!sim',$c,$r);foreach($r[1]as$p)echo"$p\n";

enter image description here


Edits

  • As pointed out in the comments by Braintist, I totally forgot to include the path. That's fixed now, thanks. Added 30 bytes.
  • Saved 3 bytes by resetting $c (holds the page content) with $c=$i=parse_url(trim(fgets(STDIN))); instead of $c=''.
  • Saved 12 bytes by replacing \n with new lines (5 bytes), one while-loop with for (2 bytes), placing nearly everything into the expressions of for (2 bytes) and by replacing foreach with join (3 bytes). Thanks to Blackhole.
  • Saved 3 bytes by replacing fgets with stream_get_contents Thanks to bwoebi.
  • Saved 5 bytes by removing the re-initialization of $c as it isn't needed anymore $c at all.
  • Saved 1 byte by removing the pattern modifier m from the Regex. Thanks to manatwork

insertusernamehere

Posted 2015-10-26T16:26:26.493

Reputation: 4 551

6

Related: http://stackoverflow.com/a/1732454/4766556

– a spaghetto – 2015-10-26T19:22:20.393

So you can only ever read the home page (/) with this? – briantist – 2015-10-26T19:39:42.107

1@briantist Oh man, I totally missed that. :D Thanks, it's fixed now. – insertusernamehere – 2015-10-26T19:57:44.133

1I can't stand that Perl beats PHP, so don't forget: while is forbidden when golfing (for is often shorter but never longer), and to do a newline, just press enter (1 byte instead of 2 for \n)! Here is your (untested) code a bit more golfed (227 bytes), with the newline replaced by : for(;$c=$i=parse_url(trim(fgets(STDIN))),fwrite($f=fsockopen($h=$i[host],80),"GET $i[path] HTTP/1.1↵Host:$h↵Connection:Close↵↵");preg_match_all('!<p>(.+?)</p>!sim',$c,$r),print join('↵',$r[1]).'↵')for(;!feof($f);)$c.=fgets($f); – Blackhole – 2015-10-27T00:16:41.033

1I don't mean "forbidden" as "against the rules", I just mean that's not useful at all, since a for-loop is always better than a while-loop ;). – Blackhole – 2015-10-27T00:25:29.470

@Blackhole thanks, this reminded me to change all my line endings to linefeeds only and that PowerShell also accepts embedded line breaks in string literals, so I was able to save 14 bytes on my answer. – briantist – 2015-10-27T00:30:22.357

@Blackhole Saved at last 9 bytes (\n, for and join). Thanks again. – insertusernamehere – 2015-10-27T08:45:42.560

In you last foreach can't you replace that with echo join($r[1],'↵');? with a newline instead of – Tschallacka – 2015-10-27T10:13:13.100

1@MichaelDibbets Actually I did that already as written in the edit. Hm. Let me see. Haha, I forgot to copy and count the final snippet. Duh :D Things like that happen, if you update your code before breakfast. Thanks for pointing it out. – insertusernamehere – 2015-10-27T10:25:21.853

Why don't you put all the expression but the while one in your first for loop, as I've suggested? This way, you can remove the angular brackets and save 2 bytes. Additionally, you can save 4 bytes by replacing this while loop with while($l=fgets($f))$c.=$l;. And this Connection:Close↵ doesn't seem to be necessary (-17 bytes). – Blackhole – 2015-10-27T11:50:50.287

@Blackhole I got it working now, putting everything into the expressions of for. Thanks for those 2 bytes. If I omit Connection:Close it breaks. This was actually one of the first things I tried before publishing the answer. – insertusernamehere – 2015-10-27T12:23:52.697

“If there are no "\n" characters in a subject string, or no occurrences of ^ or $ in a pattern, setting this modifier has no effect.” – Pattern Modifiers, m (PCRE_MULTILINE)

– manatwork – 2015-10-28T17:15:20.810

PHP has this nice function called stream_get_contents() which reads until EOF, bit shorter than your while. – bwoebi – 2015-10-28T23:31:41.503

@manatwork I wasn't sure whether some content might break if I remove it. It passed all my test cases at last, so thanks for saving 1 byte. :) – insertusernamehere – 2015-10-28T23:57:31.667

@bwoebi Good point. In the beginning I discarded the idea of using file_get_contents (low level and stuff ;) ) - it seems I suppressed stream_get_contents as well that moment. :D It saved 3 bytes, which also led to another 3 bytes I could save. Thanks a lot. – insertusernamehere – 2015-10-29T00:02:29.057

and now save yet another 5 bytes by inlining the variable $c (put the stream_get_contents as arg to preg_match_all()) for(;...;print join("↵",$r[1])."↵"))preg_match_all('!<p>(.+?)</p>!si',stream_get_contents($f), $r); – bwoebi – 2015-10-29T00:55:54.983

@bwoebi True that - it was way after my bedtime yesterday. – insertusernamehere – 2015-10-29T10:12:17.473

14

Perl, 132 bytes

155 bytes code + 17 for -ln -MIO::Socket - 40 for continually asking for URLs

As with @DigitalTrauma's answer, regex parsing HTML, let me know if that's not acceptable. Doesn't keep parsing URLs any more... I'll look at that later... Close to Bash though! Big thanks to @Schwern for saving me 59 (!) bytes and to @skmrx for fixing the bug to allow a claim of the bonus!

m|(http://)?([^/]+)(/(\S*))?|;$s=new IO::Socket::INET"$2:80";print$s "GET /$4 HTTP/1.1
Host:$2
Connection:close

";local$/=$,;print<$s>=~m|<p>(.+?)</p>|gs

Usage

$perl -ln -MIO::Socket -M5.010 wb.pl 
example.com
This domain is established to be used for illustrative examples in documents. You may use this
    domain in examples without prior coordination or asking for permission.<a href="http://www.iana.org/domains/example">More information...</a>
example.org
This domain is established to be used for illustrative examples in documents. You may use this
    domain in examples without prior coordination or asking for permission.<a href="http://www.iana.org/domains/example">More information...</a>

Dom Hastings

Posted 2015-10-26T16:26:26.493

Reputation: 16 415

I fixed a bug and shortened the code by removing the need to declare $h and $p or have a default path. It also no longer requires a trailing / on the host. – Schwern – 2015-10-27T05:22:08.197

1We're the one to beat now. :) – Schwern – 2015-10-27T06:20:42.367

I think I'm done for the night. :) – Schwern – 2015-10-27T06:54:56.677

Since the script asks for another URL instead of exiting, you can claim an additional -40 bytes – svsd – 2015-10-27T13:13:27.547

@skmrx I kept getting errors when trying to run repeated URLs to I removed the bonus... Is it working ok for you? I'm getting an error where it apepars $s is undefined on the second run... – Dom Hastings – 2015-10-27T14:21:26.330

@DomHastings oops sorry, it doesn't work for me as well. But I think I found the problem - $/ is set to undef which carries over to the next loop. Setting it to "\n" at the end works :) – svsd – 2015-10-27T14:35:37.660

@skmrx Aaah of course! Thank you! I'll amend that later on! – Dom Hastings – 2015-10-27T15:03:55.827

So close ;-) I think using MIO::Socket is perfectly fine. I don't really know much perl, but I'd be surprised if you can't claim the 40 point bonus to your advantage as well. – Digital Trauma – 2015-10-27T19:00:36.510

1@DigitalTrauma you are indeed correct! I've claimed the bonus thanks to skmrx fixing my bug with ‘$/‘ and I wouldn't be near yours if it weren't for Schwern! – Dom Hastings – 2015-10-28T06:19:51.240

13

PowerShell, 315 294 268 262 254 bytes

355 334 308 302 294 - 40 for prompt

$u=[uri]$args[0]
for(){
$h=$u.Host
$s=[Net.Sockets.TcpClient]::new($h,80).GetStream()
$r=[IO.StreamReader]::new($s)
$w=[IO.StreamWriter]::new($s)
$w.Write("GET $($u.PathAndQuery) HTTP/1.1
HOST: $h

")
$w.Flush()
($r.ReadToEnd()|sls '(?s)(?<=<p>).+?(?=</p>)'-a).Matches.Value
[uri]$u=Read-Host
}

Requires PowerShell v5

All line endings (including the ones embedded in the string) are newlines only \n (thanks Blackhole) which is fully supported by PowerShell (but if you're testing, be careful; ISE uses \r\n).

briantist

Posted 2015-10-26T16:26:26.493

Reputation: 3 110

4+1 for making my server admin duties appear much more productive – thanby – 2015-10-27T11:20:35.757

HTTP requires CRLF, not LF! [HTTPSYNTAX] – Toothbrush – 2015-10-29T19:40:40.593

2

@toothbrush Ha! Point taken, but the tolerance provision seems to be in full effect. Clearly this task is about what works and not what's correct (otherwise we wouldn't be parsing HTML with regex and using low level TCP libraries instead of well-tested existing libraries).

– briantist – 2015-10-29T19:44:27.897

1

@briantist https://greenbytes.de/tech/webdav/rfc7230.html#rfc.section.3.5 says that "a recipient MAY recognize a single LF as a line terminator and ignore any preceding CR". I read that as meaning most web servers would implement it, and the question definitely doesn't say it must generate correct GET requests… :)

– Toothbrush – 2015-10-29T19:52:04.187

8

Groovy script, 89, 61 bytes

Loop back for bonus 101- 40 = 61

System.in.eachLine{l->l.toURL().text.findAll(/<p>(?s)(.*?)<\/p>/).each{println it[3..it.length()-5]}}

With just args, 89 bytes

this.args[0].toURL().text.findAll(/<p>(?s)(.*?)<\/p>/).each{println it[3..it.length()-5]}

Rnet

Posted 2015-10-26T16:26:26.493

Reputation: 277

1Groovy outgolfed everyone. As it should be. – a spaghetto – 2015-10-28T19:09:17.670

1

@quartata If it stays that way, it'll be the first time ever, so... ;)

– Geobits – 2015-10-28T19:11:26.223

11"only low-level TCP APIs are allowed" – Digital Trauma – 2015-10-28T21:19:55.953

Yeah i'm going to agree with @DigitalTrauma that this isn't using a low-level TCP API. The rules state that you have to split the host and path on your own. – TheDoctor – 2015-11-08T15:37:25.567

6

Bash (might be cheating but seems to be within rules) 144-40=105

while read a;do
u=${a#*//}
d=${u%%/*}
e=www.w3.org
exec 3<>/dev/tcp/$d/80
echo "GET /services/html2txt?url=$a HTTP/1.1
Host:$d
">&3
cat <&3
done

Thanks to Digital Trauma.

Since I don't need to split URL, this also works: 122-40=82

while read a;do
d=www.w3.org
exec 3<>/dev/tcp/$d/80
echo "GET /services/html2txt?url=$a HTTP/1.1
Host:$d
">&3   
cat <&3
done

philcolbourn

Posted 2015-10-26T16:26:26.493

Reputation: 501

8

I would argue that using this online html2txt converter is a standard loophole

– Digital Trauma – 2015-10-28T16:58:45.833

1Yes. And I also use cat so your solution is safe. – philcolbourn – 2015-10-29T00:06:31.930

5

C 512 Bytes

#include <netdb.h>
int main(){char i,S[999],b[99],*p,s=socket(2,1,0),*m[]={"<p>","</p>"};long n;
gets(S);p=strchr(S,'/');*p++=0;struct sockaddr_in a={0,2,5<<12};memcpy(&a.
sin_addr,gethostbyname(S)->h_addr,4);connect(s,&a,16);send(s,b,sprintf(b,
"GET /%s HTTP/1.0\r\nHost:%s\r\nAccept:*/*\r\nConnection:close\r\n\r\n",p,S),0);
p=m[i=0];while((n=recv(s,b,98,0))>0)for(char*c=b;c<b+n;c++){while(*c==*p &&*++p)
c++;if(!*p)p=m[(i=!i)||puts("")];else{while(p>m[i]){if(i)putchar(c[m[i]-p]);p--;}
if(i)putchar(*c);}}} 

Based loosely on my entry here, It takes the web address without a leading "https://". It will not handle nested <p> pairs correctly :(

Tested extensively on www.w3.org/People/Berners-Lee/
It works when compiled with Apple LLVM version 6.1.0 (clang-602.0.53) / Target: x86_64-apple-darwin14.1.1
It has enough undefined behavior that it may not work anywhere else.

AShelly

Posted 2015-10-26T16:26:26.493

Reputation: 4 281

I was going down roughly the same track (this segfaults when compiled with gcc), but it should be possible to get under 400 bytes in C. Not sure about clang, but you shouldn't have to declare the return type of main. You can also remove the include and "access" the structs as integer arrays instead. I've also been getting responses with "GET /%s HTTP/1.1\r\n\r\n", but mileage on that may vary based on the site... – Comintern – 2015-10-29T02:19:31.653

5

Ruby, 118

147 bytes source; 11 bytes '-lprsocket'; -40 bytes for looping.

*_,h,p=$_.split'/',4
$_=(TCPSocket.new(h,80)<<"GET /#{p} HTTP/1.1
Host:#{h}
Connection:close

").read.gsub(/((\A|<\/p>).*?)?(<p>|\Z)/mi,'
').strip

Usage example:

$ ruby -lprsocket wb.rb
http://example.org/
This domain is established to be used for illustrative examples in documents. You may use this
    domain in examples without prior coordination or asking for permission.
<a href="http://www.iana.org/domains/example">More information...</a>
http://www.xkcd.com/1596/
Warning: this comic occasionally contains strong language (which may be unsuitable for children), unusual humor (which may be unsuitable for adults), and advanced mathematics (which may be unsuitable for liberal-arts majors).

This work is licensed under a
<a href="http://creativecommons.org/licenses/by-nc/2.5/">Creative Commons Attribution-NonCommercial 2.5 License</a>.


This means you're free to copy and share these comics (but not to sell them). <a rel="license" href="/license.html">More details</a>.

ezrast

Posted 2015-10-26T16:26:26.493

Reputation: 491

4

AutoIt, 347 bytes

Func _($0)
$4=StringTrimLeft
$0=$4($0,7)
$3=StringSplit($0,"/")[1]
TCPStartup()
$2=TCPConnect(TCPNameToIP($3),80)
TCPSend($2,'GET /'&$4($0,StringLen($3))&' HTTP/1.1'&@LF&'Host: '&$3&@LF&'Connection: close'&@LF&@LF)
$1=''
Do
$1&=TCPRecv($2,1)
Until @extended
For $5 In StringRegExp($1,"(?s)\Q<p>\E(.*?)(?=\Q</p>\E)",3)
ConsoleWrite($5)
Next
EndFunc

Testing

Input:

_('http://www.autoitscript.com')

Output:

You don't have permission to access /error/noindex.html
on this server.

Input:

_('http://www.autoitscript.com/site')

Output:

The document has moved <a href="https://www.autoitscript.com/site">here</a>.

Remarks

  • Doesn't support nested <p> tags
  • Supports only <p> tags (case-insensitive), will break on every other tag format
  • Panics Loops indefinitely when any error occurs

mınxomaτ

Posted 2015-10-26T16:26:26.493

Reputation: 7 398

4

C#, 727 Bytes - 40 = 687 Bytes

using System.Text.RegularExpressions;class P{static void Main(){a:var i=System.Console.ReadLine();if(i.StartsWith("http://"))i=i.Substring(7);string p="/",h=i;var l=i.IndexOf(p);
if(l>0){h=i.Substring(0,l);p=i.Substring(l,i.Length-l);}var c=new System.Net.Sockets.TcpClient(h,80);var e=System.Text.Encoding.ASCII;var d=e.GetBytes("GET "+p+@" HTTP/1.1
Host: "+h+@"
Connection: close

");var s=c.GetStream();s.Write(d,0,d.Length);byte[]b=new byte[256],o;var m=new System.IO.MemoryStream();while(true){var r=s.Read(b,0,b.Length);if(r<=0){o=m.ToArray();break;}m.Write(b,0,r);}foreach (Match x in new Regex("<p>(.+?)</p>",RegexOptions.Singleline).Matches(e.GetString(o)))System.Console.WriteLine(x.Groups[1].Value);goto a;}}

It's a little bit of training but surely memorable :)

Here is an ungolfed version:

using System.Text.RegularExpressions;
class P
{
    static void Main()
    {
    a:
        var input = System.Console.ReadLine();
        if (input.StartsWith("http://")) input = input.Substring(7);
        string path = "/", hostname = input;
        var firstSlashIndex = input.IndexOf(path);
        if (firstSlashIndex > 0)
        {
            hostname = input.Substring(0, firstSlashIndex);
            path = input.Substring(firstSlashIndex, input.Length - firstSlashIndex);
        }
        var tcpClient = new System.Net.Sockets.TcpClient(hostname, 80);
        var asciiEncoding = System.Text.Encoding.ASCII;
        var dataToSend = asciiEncoding.GetBytes("GET " + path + @" HTTP/1.1
Host: " + hostname + @"
Connection: close

");
        var stream = tcpClient.GetStream();
        stream.Write(dataToSend, 0, dataToSend.Length);
        byte[] buff = new byte[256], output;
        var ms = new System.IO.MemoryStream();
        while (true)
        {
            var numberOfBytesRead = stream.Read(buff, 0, buff.Length);
            if (numberOfBytesRead <= 0)
            {
                output = ms.ToArray();
                break;
            }
            ms.Write(buff, 0, numberOfBytesRead);
        }
        foreach (Match match in new Regex("<p>(.+?)</p>", RegexOptions.Singleline).Matches(asciiEncoding.GetString(output)))
        {
            System.Console.WriteLine(match.Groups[1].Value);
            goto a;
        }
    }
}

As you can see, there are memory leak issues as a bonus :)

Stephan Schinkel

Posted 2015-10-26T16:26:26.493

Reputation: 596

Where is the memory leak? I see no using statements around streams but that does not make a leak. – Gusdor – 2015-10-29T14:41:23.993

You can trim a few more bytes: input = input.trimStart("http://") will replace the "if" clause, and you should be able to use System.Text.Encoding.ASCII.GetBytes() directly without having to store it in asciiEncoding first. I think you'd even come out ahead with a "Using System;" line and getting rid of a handful of "System."s. – minnmass – 2015-10-30T05:04:10.297

3

JavaScript (NodeJS) - 187 166

s=require("net").connect(80,p=process.argv[2],_=>s.write("GET / HTTP/1.0\nHost: "+p+"\n\n")&s.on("data",d=>(d+"").replace(/<p>([^]+?)<\/p>/g,(_,g)=>console.log(g))));

187:

s=require("net").connect(80,p=process.argv[2],_=>s.write("GET / HTTP/1.1\nHost: "+p+"\nConnection: close\n\n")&s.on("data",d=>(d+"").replace(/<p>([^]+?)<\/p>/gm,(_,g)=>console.log(g))));

Usage:

node file.js www.example.com

Or formatted

var url = process.argv[2];
s=require("net").connect(80, url ,_=> {
     s.write("GET / HTTP/1.1\nHost: "+url+"\nConnection: close\n\n");
     s.on("data",d=>(d+"").replace(/<p>([^]+?)<\/p>/gm,(_,g)=>console.log(g)))
});

Benjamin Gruenbaum

Posted 2015-10-26T16:26:26.493

Reputation: 219

1Caveat: this will work for small pages - bigger pages emit multiple data events. – Benjamin Gruenbaum – 2015-10-28T09:04:15.177

3

Python 2 - 212 209 bytes

import socket,re
h,_,d=raw_input().partition('/')
s=socket.create_connection((h,80))
s.sendall('GET /%s HTTP/1.1\nHost:%s\n\n'%(d,h))
p=''
while h:h=s.recv(9);p+=h
for g in re.findall('<p>(.*?)</p>',p):print g

Zac Crites

Posted 2015-10-26T16:26:26.493

Reputation: 201

You can save two bytes by stripping out whitespace after the colon on while h: and before print g. – Skyler – 2015-10-28T16:25:32.330

And another byte with 'GET /%s HTTP/1.1\nHost:%s\n\n'. – Cees Timmerman – 2015-10-28T16:48:19.183

3

Python 2, 187 - 40 = 147 (141 in a REPL)

Compressed and looped version of Zac's answer:

import socket,re
while 1:h,_,d=raw_input().partition('/');s=socket.create_connection((h,80));s.sendall('GET /%s HTTP/1.1\nHost:%s\n\n'%(d,h));print re.findall('<p>(.*?)</p>',s.recv(9000))

Example:

dictionary.com
['The document has moved <a href="http://dictionary.reference.com/">here</a>.']
dictionary.reference.com
[]
paragraph.com
[]
rare.com
[]

Actually useful is this:

207 - 40 = 167

import socket,re
while 1:h,_,d=raw_input().partition('/');s=socket.create_connection((h,80));s.sendall('GET /%s HTTP/1.1\nHost:%s\n\n'%(d,h));print'\n'.join(re.findall('<p>(.*?)</p>',s.recv(9000),re.DOTALL))

Example:

example.org
This domain is established to be used for illustrative examples in documents. You may use this
    domain in examples without prior coordination or asking for permission.
<a href="http://www.iana.org/domains/example">More information...</a>
www.iana.org/domains/example
The document has moved <a href="/domains/reserved">here</a>.
www.iana.org/domains/reserved

dictionary.com
The document has moved <a href="http://dictionary.reference.com/">here</a>.
dictionary.reference.com

catb.org

      <a href="http://validator.w3.org/check/referer"><img
          src="http://www.w3.org/Icons/valid-xhtml10"
          alt="Valid XHTML 1.0!" height="31" width="88" /></a>

This is catb.org, named after (the) Cathedral and the Bazaar. Most
of it, under directory esr, is my personal site.  In theory other
people could shelter here as well, but this has yet to occur.
catb.org/jargon
The document has moved <a href="http://www.catb.org/jargon/">here</a>.
www.catb.org/jargon/
This page indexes all the WWW resources associated with the Jargon File
and its print version, <cite>The New Hacker's Dictionary</cite>. It's as
official as anything associated with the Jargon File gets.
On 23 October 2003, the Jargon File achieved the
dubious honor of being cited in the SCO-vs.-IBM lawsuit.  See the <a
href='html/F/FUD.html'>FUD</a> entry for details.
www.catb.org/jargon/html/F/FUD.html
 Defined by Gene Amdahl after he left IBM to found his own company:
   &#8220;<span class="quote">FUD is the fear, uncertainty, and doubt that IBM sales people
   instill in the minds of potential customers who might be considering
   [Amdahl] products.</span>&#8221; The idea, of course, was to persuade them to go
   with safe IBM gear rather than with competitors' equipment.  This implicit
   coercion was traditionally accomplished by promising that Good Things would
   happen to people who stuck with IBM, but Dark Shadows loomed over the
   future of competitors' equipment or software.  See
   <a href="../I/IBM.html"><i class="glossterm">IBM</i></a>.  After 1990 the term FUD was associated
   increasingly frequently with <a href="../M/Microsoft.html"><i class="glossterm">Microsoft</i></a>, and has
   become generalized to refer to any kind of disinformation used as a
   competitive weapon.
[In 2003, SCO sued IBM in an action which, among other things,
   alleged SCO's proprietary control of <a href="../L/Linux.html"><i class="glossterm">Linux</i></a>.  The SCO
   suit rapidly became infamous for the number and magnitude of falsehoods
   alleged in SCO's filings.  In October 2003, SCO's lawyers filed a <a href="http://www.groklaw.net/article.php?story=20031024191141102" target="_top">memorandum</a>
   in which they actually had the temerity to link to the web version of
   <span class="emphasis"><em>this entry</em></span> in furtherance of their claims. Whilst we
   appreciate the compliment of being treated as an authority, we can return
   it only by observing that SCO has become a nest of liars and thieves
   compared to which IBM at its historic worst looked positively
   angelic. Any judge or law clerk reading this should surf through to
   <a href="http://www.catb.org/~esr/sco.html" target="_top">my collected resources</a> on this
   topic for the appalling details.&#8212;ESR]

Cees Timmerman

Posted 2015-10-26T16:26:26.493

Reputation: 625

1

gawk, 235 - 40 = 195 bytes

{for(print"GET "substr($0,j)" HTTP/1.1\nHost:"h"\n"|&(x="/inet/tcp/0/"(h=substr($0,1,(j=index($0,"/"))-1))"/80");(x|&getline)>0;)w=w RS$0
for(;o=index(w,"<p>");w=substr(w,c))print substr(w=substr(w,o+3),1,c=index(w,"/p>")-2)
close(x)}

Golfed it down, but this is a more unforgiving version, which needs the web address without http:// at the beginning. And if you want to access the root directory you have to end the address with a /. Furthermore the <p> tags have to be lower case.

My earlier version actually didn't handle lines containing </p><p> correctly. This is now fixed.

Output for input example.com/

This domain is established to be used for illustrative examples in documents. You may use this
    domain in examples without prior coordination or asking for permission.
<a href="http://www.iana.org/domains/example">More information...</a>

Still doesn't work with Wikipedia. I think the reason is that Wikipedia uses https for everything. But I don't know.

The following version is a little more forgiving with the input and it can handle upper case tags as well.

IGNORECASE=1{
    s=substr($0,(i=index($0,"//"))?i+2:0)
    x="/inet/tcp/0/"(h=(j=index(s,"/"))?substr(s,1,j-1):s)"/80"
    print"GET "substr(s,j)" HTTP/1.1\nHost:"h"\nConnection:close\n"|&x
    while((x|&getline)>0)w=w RS$0
    for(;o=index(w,"<p>");w=substr(w,c))
        print substr(w=substr(w,o+3),1,c=index(w,"/p>")-2)
    close(x)
}

I'm not sure about the "Connection:close" line. Doesn't seem to be mandatory. I couldn't find an example that would work different with or without it.

Cabbie407

Posted 2015-10-26T16:26:26.493

Reputation: 1 158

1

Powershell(4) 240

$input=Read-Host ""
$url=[uri]$input
$dir=$url.LocalPath
Do{
$res=Invoke-WebRequest -URI($url.Host+"/"+$dir) -Method Get
$res.ParsedHtml.getElementsByTagName('p')|foreach-object{write-host $_.innerText}
$dir=Read-Host ""
}While($dir -NE "")

Ungolfed (proxy is not required)

$system_proxyUri=Get-ItemProperty -Path "HKCU:\Software\Microsoft\Windows\CurrentVersion\Internet Settings" -Name ProxyServer
$proxy = [System.Net.WebRequest]::GetSystemWebProxy()
$proxyUri = $proxy.GetProxy($system_proxyUri.ProxyServer)
$input = Read-Host "Initial url"
#$input="http://stackoverflow.com/questions/tagged/powershell"
$url=[uri]$input
$dir=$url.LocalPath
Do{
$res=Invoke-WebRequest -URI($url.Host+"/"+$dir) -Method Get -Proxy($proxyUri)
$res.ParsedHtml.getElementsByTagName('p')|foreach-object{write-host $_.innerText}
$dir=Read-Host "next dir"
}While($dir -NE "")

edit* also not to hard to memorize ^^

dwana

Posted 2015-10-26T16:26:26.493

Reputation: 531

-1

Java 620 B

import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.net.URL;

public class JavaApplication12 {

    public static void main(String[] args) {
        try {             
            BufferedReader i = new BufferedReader(new InputStreamReader(new URL(args[0]).openStream()));
            String l;
            boolean print = false;
            while ((l = i.readLine()) != null) {
                if (l.toLowerCase().contains("<p>")) {
                    print = true;
                }
                if (print) {
                    if (l.toLowerCase().contains("</p>")) {
                        print = false;
                    }
                    System.out.println(l);
                }
            }

        } catch (Exception e) {

        }
    }

}

Shalika Ashan

Posted 2015-10-26T16:26:26.493

Reputation: 1

2Welcome to Programming Puzzles & Code Golf! Unfortunately, this submission is invalid. The question only allows only low-level TCP APIs, so you cannot use InputStreamReader. – Dennis – 2015-10-30T05:35:27.247

1Oh i am so sorry and thank you for pointing it. will do better in next answer – Shalika Ashan – 2015-11-02T03:25:58.580