Linux shell command to filter a text file by line length

Question

I have a 30gb disk image of a borked partition (think dd if=/dev/sda1 of=diskimage) that I need to recover some text files from. Data carving tools like foremost only work on files with well defined headers, i.e. not plain text files, so I've fallen back on my good friend strings.

strings diskimage > diskstrings.txt produced a 3gb text file containing a bunch of strings, mostly useless stuff, mixed in with the text that I actually want.

Most of the cruft tends to be really long, unbroken strings of gibberish. The stuff I'm interested in is guaranteed to be less than 16kb, so I'm going to filter the file by line length. Here's the Python script I'm using to do so:

infile  = open ("infile.txt" ,"r");
outfile = open ("outfile.txt","w");
for line in infile:
    if len(line) < 16384:
        outfile.write(line)
infile.close()
outfile.close()

This works, but for future reference: Are there any magical one-line incantations (think awk, sed) that would filter a file by line length?

Janne Pikkarainen · Accepted Answer · 2012-01-31T08:38:22.333

33

awk '{ if (length($0) < 16384) print }' yourfile >your_output_file.txt

would print lines shorter than 16 kilobytes, as in your own example.

Or if you fancy Perl:

perl -nle 'if (length($_) < 16384) { print }' yourfile >your_output_file.txt

edited Jan 31 '12 at 08:38

answered Jan 31 '12 at 08:29

Janne Pikkarainen

31,454
4
56
78

Well, that was embarrasingly simple. Thank you. :) – Li-aung Yip Jan 31 '12 at 08:32
Added also Perl version :-) – Janne Pikkarainen Jan 31 '12 at 08:38
1

And the awk script can be written as `awk 'length($0) < 16384' file > output`, as the default action is to print the line. – glenn jackman Jan 31 '12 at 16:19
Note that some preinstalled awk versions don't support diacritics – gfpacheco Feb 07 '22 at 21:14

Dennis Williamson · Answer 2 · 2013-09-04T16:10:31.133

This is similar to Ansgar's answer, but slightly faster in my tests:

awk 'length($0) < 16384' infile >outfile

It's the same speed as the other awk answers. It relies on the implicit print of a true expression, but doesn't need to take the time to split the line as Ansgar's does.

Note that AWK gives you an if for free. The command above is equivalent to:

awk 'length($0) < 16384 {print}' infile >outfile

There's no explicit if (or its surrounding set of curly braces) as in some of the other answers.

Here is a way to do it in sed:

sed '/.\{16384\}/d' infile >outfile

or:

sed -r '/.{16384}/d' infile >outfile

which delete any line that contains 16384 (or more) characters.

For completeness, here's how you'd use sed to save lines longer than your threshold:

sed '/^.\{0,16383\}$/d' infile >outfile

score 3 · Answer 3 · answered Jan 31 '12 at 09:29

3

Not really different from the answers already given, but shorter still:

awk -F '' 'NF < 16384' infile >outfile

answered Jan 31 '12 at 09:29

Ansgar Esztermann

291
2
3

Khaled · Answer 4 · 2012-01-31T08:47:39.840

2

You can awk such as:

$ awk '{ if (length($0) < 16384) { print } }' /path/to/text/file

This will print the lines longer shorter than 16K characters (16 * 1024).

You can use grep also:

$ grep ".\{,16384\}" /path/to/text/file

This will print the lines at most 16K characters.

edited Jan 31 '12 at 08:47

answered Jan 31 '12 at 08:26

Khaled

35,688
8
69
98

Not sure `grep` is such a good idea - it's a simple regexp, to be sure, but more computationally expensive than `awk`. "A man with problem says "I'll use regular expressions!" Now he has two problems." ;) – Li-aung Yip Jan 31 '12 at 08:54
It is just another way of doing it. The first option I posted was using `awk`. – Khaled Jan 31 '12 at 08:58
1

+1 for the regexp, because it golfs better, and it does not make me read awk manpages =) – Ciro Santilli OurBigBook.com Mar 12 '14 at 21:29

Linux shell command to filter a text file by line length

4 Answers4