Using 'head' or 'tail' on HUGE text file - 19 GB

15

2

I have a problem with viewing chunks of a very large text file. This file, approximately 19 GB, is obviously too big to view by any traditional means.

I have tried head 1 and tail 1 (head -n 1 and tail -n 1) with both commands piped together in various ways (to get at a piece in the middle) with no luck. My Linux machine running Ubuntu 9.10 cannot process this file.

How do I handle this file? My ultimate goal is to hone in on lines 45000000 and 45000100.

nicorellius

Posted 2012-02-20T23:59:19.923

Reputation: 5 865

Thinking of writing a quick Python script to read the lines and print the ones I need to file, but I can imagine this taking a long time... – nicorellius – 2012-02-21T00:04:17.017

Are all the lines the same length? – Paul – 2012-02-21T00:10:15.853

@Paul - unfortunately, they are not the same length. – nicorellius – 2012-02-21T00:15:07.173

You can try split to make the large file easier to work with.

– iglvzx – 2012-02-21T00:24:05.323

1Ok. Any processing of a file that large will take time, so the answers below will help that. If you want to extract just the part you are looking for and can estimate approximately where it is you can use dd to get the bit you are after. For example dd if=bigfile of=extractfile bs=1M skip=10240 count=5 will extract 5MB from the file starting from the 10GB point. – Paul – 2012-02-21T01:38:14.820

Yes, I agree with you Paul. I wrote a Python script and it definitely took forever to process the file. I have the sed job running now and I imagine it will take quite a while to complete. But testing with the beginning of the file appears promising. Thanks. – nicorellius – 2012-02-21T07:12:22.500

Answers

11

You should use sed.

sed -n -e 45000000,45000100p -e 45000101q bigfile > savedlines

This tells sed to print lines 45000000-45000100 inclusive, and to quit on line 45000101.

Kyle Jones

Posted 2012-02-20T23:59:19.923

Reputation: 5 706

1It's still very slow, almost like head -45000000,45000100p bigfile | tail -100 > savedlines – Dmitry Polushkin – 2015-07-27T08:30:33.857

tail+|head is faster by a good 10-15%. – Erich – 2018-03-16T19:26:49.563

4

Create a MySQL database with a single table which has a single field. Then import your file into the database. This will make it very easy to look up a certain line.

I don't think anything else could be faster (if head and tail already fail). In the end, the application that wants to find line n has to seek through the whole file until is has found n newlines. Without some sort of lookup (line-index to byte offset into file) no better performance can be achieved.

Given how easy it is to create a MySQL database and import data into it, I feel like this is a viable approach.

Here is how to do it:

DROP DATABASE IF EXISTS helperDb;
CREATE DATABASE `helperDb`;
CREATE TABLE `helperDb`.`helperTable`( `lineIndex` BIGINT UNSIGNED NOT NULL AUTO_INCREMENT, `lineContent` MEDIUMTEXT , PRIMARY KEY (`lineIndex`) );
LOAD DATA INFILE '/tmp/my_large_file' INTO TABLE helperDb.helperTable (lineContent);
SELECT lineContent FROM helperTable WHERE ( lineIndex > 45000000 AND lineIndex < 45000100 );

/tmp/my_large_file would be the file you want to read.

The correct syntax to import a file with tab-delimited values on each line, is:

LOAD DATA INFILE '/tmp/my_large_file' INTO TABLE helperDb.helperTable FIELDS TERMINATED BY '\n' (lineContent);

Another major advantage of this is, if you decide later on to extract another set of lines, you don't have to wait hours for the processing again (unless you delete the database of course).

Der Hochstapler

Posted 2012-02-20T23:59:19.923

Reputation: 77 228

So this is a good solution, indeed. I got it to work with the sed command below, and identified my lines. But now I have a follow up question that the database method may be better suited for. I now need to delete a couple hundred lines from the file. – nicorellius – 2012-02-21T18:18:39.550

I'm sure sed could do that as well. Of course, if you had the data in the database it would be trivial to export a new file with just the lines you want. – Der Hochstapler – 2012-02-21T18:22:20.590

Thanks again. I took the sed answer (because it gave me more immediate pleasure ;--) but gave you an up-vote because I will use your method in the future. I appreciate it. – nicorellius – 2012-02-21T18:37:50.480

I attempted to use you SQL code above and it seemed to process but then when I ran the query to view my lines, it just gave me the first column of the tab delimited line. Each of the lines is tab delimited. Is there any advice you could give me to get all the lines into the table, as expected? – nicorellius – 2012-02-21T21:05:48.450

1

You could try adding a FIELDS TERMINATED BY '\n' to the LOAD DATA line.

– Der Hochstapler – 2012-02-21T22:35:41.117

OK, thanks. Not too familiar with this syntax, but am getting error when using this: LOAD DATA INFILE '/tmp/my_large_file' INTO TABLE helperTable (lineContent) FIELDS TERMINATED BY '\n'; I've searched around through docs and nothing is popping out. Any thoughts? Sorry to bother you with this. – nicorellius – 2012-02-21T23:53:56.740

1I'm sorry, there was a mistake in my code. I also added the correct syntax for your case (tested this time). – Der Hochstapler – 2012-02-22T00:20:33.263

Awesome - thanks - I will test this later today. Appreciate your help. – nicorellius – 2012-02-23T00:47:22.907

1

Two good old tools for big files are joinand split. You can use split with --lines=<number> option that cut file to multiple files of certain size.

For example split --lines=45000000 huge_file.txt. The resulted parts would be in xa, xb, etc. Then you can head the part xb which would include the the lines you wanted. You can also 'join' files back to single big file.

Anssi

Posted 2012-02-20T23:59:19.923

Reputation: 111

Awesome, thank you, I totally forgot about the split command. – siliconrockstar – 2017-10-30T21:15:20.350

0

You have the right tools but are using them incorrectly. As previously answered over at U&L, tail -n +X file | head -n Y (note the +) is 10-15% faster than sed for Y lines starting at X. And conveniently, you don't have to explicitly exit the process as with sed.

tail will read and discard the first X-1 lines (there's no way around that), then read and print the following lines. head will read and print the requested number of lines, then exit. When head exits, tail receives a SIGPIPE signal and dies, so it won't have read more than a buffer size's worth (typically a few kilobytes) of lines from the input file.

Erich

Posted 2012-02-20T23:59:19.923

Reputation: 226