1
I'm trying to insert Crackstation's 1.5 billion wordlist (15gig) into a database table. I've tried SQLite and Postgres both with no luck so far via the command line.
Both get upset that the data seems to contain some binary data values scattered throughout the file. The data doesn't seem to be just UTF-8 characters.
UltraEdit detects it as UTF-8.
However, the linux file command says:
$ file crackstation.txt -I
crackstation.txt: application/octet-stream; charset=binary
How can I get this data into Postgres or SQLite so I can analyze/query it?
Is there a way I can cleanse/convert the data first so the insert works... or a different way I can do the import into the databases so it works?
Here's what I tried:
SQLite:
sqlite> .mode csv
sqlite> .separator "\n"
sqlite> .import crackstation.txt passwords
The response to this is millions of unescaped " character
messages and only 1 row inserted in the table (which happens to just be an empty string value).
Postgres:
psql> COPY passwords FROM 'crackstation.txt' WITH (FORMAT TEXT, NULL '', ENCODING 'UTF-8');
The response to this is ERROR: invalid byte sequence for encoding "UTF8": 0xf6 blah blah
with a line number it was found on. Thing is, the txt file is 15gig so I've only been able to open it in UltraEdit and even this it's painfully slow to get to that line, delete it, run the psql command again, rinse and repeat... hence why I'm now asking the question.