Simple script parsing text, what is wrong here?

I'm a big user of https://www.grc.com/passwords.htm to get strong passwords. However, having to go to the site and manually copy the password every time gets old fast, so I decided to do a little script to do it for me. Every time you reload the page, it presents you with new passwords in plain text, so this is the script.

curl 'https://www.grc.com/passwords.htm' | grep '63 random printable ASCII characters:' | sed 's/^.*size=2>//' | sed 's/<\/font>.*$//' | pbcopy

Since there's no identifying classes or IDs, I get the page through curl, pipe it to grep to get the line I want ("63 random printable ASCII characters:"), and then I use sed to delete everything up to the password, as well as everything after it, finally copying to the clipboard with pbcopy.

This all works fine, except for one small detail. The string I get in the end should always be 63 characters long, but it's not. It usually varies between 64, 67, 70 and 73 and I have no idea why.

Can anyone shed any light on this?

user137369

Posted 2012-12-29T18:31:27.783

Reputation: 916

I seriously wouldn't trust a server-side generated password. – Dennis – 2012-12-29T20:02:35.633

That is of course your choice, but we're not talking about amateurs here, we're talking about extensive knowledge and research. Just read the first few lines on the page, and you'll see why it's safe. – user137369 – 2012-12-29T20:14:37.210

The design seems sound. The problem is that the server generated the password. That means they know it. – Dennis – 2012-12-29T20:16:15.283

@user137369 The danger is in the remote provider storing the generated passwords (for any reason) or someone intercepting network traffic (again, for any reason). No matter how clever the generation algorithm is in getting really random bits, you're opening yourself to e.g. "dictionary" ("all passwords generated by GRC in 2012") attacks. – Daniel Beck – 2012-12-29T20:16:17.970

Please read the website (most of it is interesting for this subject), but again, even the first lines answer those concerns. Cracking 63 characters is not even realistically feasible by today's standards, and even if you could get "all passwords generated by GRC in 2012" (which you cannot), since they're used in different websites that themselves hash it, you'd need impossibly long rainbow tables and combinations to do it. Furthermore, the page can only be shown if a secure connection is available. – user137369 – 2012-12-29T20:35:18.000

Has for them storing it, that would serve no purpose, but even if they did, they'd still be incredibly though to crack, it's the same as trusting a service like LastPass to generate secure passwords, they could also keep those if they wanted, but it'd be pretty useless. Most of us are not the president of a nation, and there's no reason to be specific targets. Most website password attacks in 2012 were possible due to the website's lack of security, and there's enough weak passwords to go around, no one bothers with the incredibly strong ones, there's no return on time/resources invested. – user137369 – 2012-12-29T20:35:52.760

Answers

As mousio already said, the problem is the HTML encoding of some special characters.

Perl can convert those reliably and easily:

curl 'https://www.grc.com/passwords.htm' | \
    grep 'ASCII characters:' \
    perl -MHTML::Entities -ne 's/.*2>|<.*//g; print decode_entities($_)'

Dennis

Posted 2012-12-29T18:31:27.783

Reputation: 42 934

grep gives me an error (the -P option does not exist). Probably we're using different versions of it (I'm on OSX). – user137369 – 2012-12-29T20:42:54.687

Interesting. Well, the perl part is the important part. You can either keep using sed or perform the replace with perl (see updated answer). – Dennis – 2012-12-29T21:02:31.550

Although I never used perl, right now I can read that command and just about understand what it is doing, and it is much more efficient than my extensively verbose command. Thank you, marked as the new best answer. – user137369 – 2012-12-29T21:14:40.913

It might have to do with HTML encoding, where e.g. a < in the 63 characters is actually represented as the entity < in the source, making your string a bit more lengthier.

mousio

Posted 2012-12-29T18:31:27.783

Reputation: 771

Thank you. That was exactly the case. The command is a lot more verbose, now, but it seems to work every time. curl "https://www.grc.com/passwords.htm" | grep "63 random printable ASCII characters:" | sed "s/^.*size=2>//" | sed "s/<\/font>.*$//" | sed "s/"/\"/g" | sed "s/'/'/g" | sed "s/&/&/g" | sed "s/</</g" | sed "s/>/>/g" | pbcopy. If you know of a less verbose way to do it, I'd appreciate it (as long as it does not need any tools not installed by default). – user137369 – 2012-12-29T19:47:06.007

Here is something for you to consider - the resulting string of your script may often contain certain characters that are not being escaped properly in the sed operations.

For example, these characters may be suspect: brackets, single and double quotes, curley braces, exclamation points and the forward and back slashes, and asterisks.

I would try stripping one of these characters from the returned string in a series of tests, comparing the results to see if the removal of one of any number of these characters will bring the count to 63.

Indy Jones

Posted 2012-12-29T18:31:27.783

Reputation: 1

The sed operations were done one by one and tested as I went along. See mousio's answer, he found the problem, it had to do with the HTML characters that have special encoding. – user137369 – 2012-12-29T19:57:08.527

Based on your comment to mousio, your chain of grep/sed... can be reduced to one command:

Perl is usually installed by default

perl -ne 'next unless /63 random printable ASCII characters:/; s/^.*size=2>//; s/<\/font>.*$//; s/&quot;/"/g; s/&apos;/'\''/g; s/&amp;/&/g; s/&lt;/</g; s/&gt;/>/g; print; exit'

Not all sed's understand this syntax

sed -n '/63 random printable ASCII characters:/{s/^.*size=2>//; s/<\/font>.*$//; s/&quot;/"/g; s/&apos;/'\''/g; s/&amp;/\&/g; s/&lt;/</g; s/&gt;/>/g; p;q}'

glenn jackman

Posted 2012-12-29T18:31:27.783

Reputation: 18 546