Create a MySQL database with a single table which has a single field. Then import your file into the database. This will make it very easy to look up a certain line.
I don't think anything else could be faster (if head
and tail
already fail). In the end, the application that wants to find line n
has to seek through the whole file until is has found n
newlines. Without some sort of lookup (line-index to byte offset into file) no better performance can be achieved.
Given how easy it is to create a MySQL database and import data into it, I feel like this is a viable approach.
Here is how to do it:
DROP DATABASE IF EXISTS helperDb;
CREATE DATABASE `helperDb`;
CREATE TABLE `helperDb`.`helperTable`( `lineIndex` BIGINT UNSIGNED NOT NULL AUTO_INCREMENT, `lineContent` MEDIUMTEXT , PRIMARY KEY (`lineIndex`) );
LOAD DATA INFILE '/tmp/my_large_file' INTO TABLE helperDb.helperTable (lineContent);
SELECT lineContent FROM helperTable WHERE ( lineIndex > 45000000 AND lineIndex < 45000100 );
/tmp/my_large_file
would be the file you want to read.
The correct syntax to import a file with tab-delimited values on each line, is:
LOAD DATA INFILE '/tmp/my_large_file' INTO TABLE helperDb.helperTable FIELDS TERMINATED BY '\n' (lineContent);
Another major advantage of this is, if you decide later on to extract another set of lines, you don't have to wait hours for the processing again (unless you delete the database of course).
Thinking of writing a quick Python script to read the lines and print the ones I need to file, but I can imagine this taking a long time... – nicorellius – 2012-02-21T00:04:17.017
Are all the lines the same length? – Paul – 2012-02-21T00:10:15.853
@Paul - unfortunately, they are not the same length. – nicorellius – 2012-02-21T00:15:07.173
You can try
– iglvzx – 2012-02-21T00:24:05.323split
to make the large file easier to work with.1Ok. Any processing of a file that large will take time, so the answers below will help that. If you want to extract just the part you are looking for and can estimate approximately where it is you can use
dd
to get the bit you are after. For exampledd if=bigfile of=extractfile bs=1M skip=10240 count=5
will extract 5MB from the file starting from the 10GB point. – Paul – 2012-02-21T01:38:14.820Yes, I agree with you Paul. I wrote a Python script and it definitely took forever to process the file. I have the
sed
job running now and I imagine it will take quite a while to complete. But testing with the beginning of the file appears promising. Thanks. – nicorellius – 2012-02-21T07:12:22.500