Inserting Crackstation's Wordlist into a database table

1

I'm trying to insert Crackstation's 1.5 billion wordlist (15gig) into a database table. I've tried SQLite and Postgres both with no luck so far via the command line.

Both get upset that the data seems to contain some binary data values scattered throughout the file. The data doesn't seem to be just UTF-8 characters.

UltraEdit detects it as UTF-8.
However, the linux file command says:

$  file crackstation.txt -I
crackstation.txt: application/octet-stream; charset=binary

How can I get this data into Postgres or SQLite so I can analyze/query it?

Is there a way I can cleanse/convert the data first so the insert works... or a different way I can do the import into the databases so it works?

Here's what I tried:

SQLite:

sqlite> .mode csv
sqlite> .separator "\n"
sqlite> .import crackstation.txt passwords

The response to this is millions of unescaped " character messages and only 1 row inserted in the table (which happens to just be an empty string value).

Postgres:

psql> COPY passwords FROM 'crackstation.txt' WITH (FORMAT TEXT, NULL '', ENCODING 'UTF-8');

The response to this is ERROR: invalid byte sequence for encoding "UTF8": 0xf6 blah blah with a line number it was found on. Thing is, the txt file is 15gig so I've only been able to open it in UltraEdit and even this it's painfully slow to get to that line, delete it, run the psql command again, rinse and repeat... hence why I'm now asking the question.

Turgs

Posted 2015-01-13T13:27:01.757

Reputation: 153

Answers

0

If that file doesn't have any specified encoding, you might want to import it as such.

If the target is a dedicated database, in case of PostgreSQL, you may create it without enforcing any encoding.

As a SQL command:

CREATE DATABASE dbname
  encoding 'SQL_ASCII'
  lc_collate= 'C'
  lc_ctype='C'
  template=template0;

Tables created inside that database will accept non-zero bytes without checking if they correspond to characters in any encoding. String comparisons will be done byte-wise.

If the file happens to contains nul bytes too, you'll need to remove them by filtering it with tr -d '\0' or an equivalent command for your OS.

Daniel Vérité

Posted 2015-01-13T13:27:01.757

Reputation: 1 225