31
13
I have a very large file (~400 GB), and I need to remove the last 2 lines from it. I tried to use sed
, but it ran for hours before I gave up. Is there a quick way of doing this, or am I stuck with sed
?
31
13
I have a very large file (~400 GB), and I need to remove the last 2 lines from it. I tried to use sed
, but it ran for hours before I gave up. Is there a quick way of doing this, or am I stuck with sed
?
31
I haven't tried this on a large file to see how fast it is, but it should be fairly quick.
To use the script to remove lines from the end of a file:
./shorten.py 2 large_file.txt
It seeks to the end of the file, checks to make sure the last character is a newline, then reads each character one at a time going backwards until it's found three newlines and truncates the file just after that point. The change is made in place.
Edit: I've added a Python 2.4 version at the bottom.
Here is a version for Python 2.5/2.6:
#!/usr/bin/env python2.5
from __future__ import with_statement
# also tested with Python 2.6
import os, sys
if len(sys.argv) != 3:
print sys.argv[0] + ": Invalid number of arguments."
print "Usage: " + sys.argv[0] + " linecount filename"
print "to remove linecount lines from the end of the file"
exit(2)
number = int(sys.argv[1])
file = sys.argv[2]
count = 0
with open(file,'r+b') as f:
f.seek(0, os.SEEK_END)
end = f.tell()
while f.tell() > 0:
f.seek(-1, os.SEEK_CUR)
char = f.read(1)
if char != '\n' and f.tell() == end:
print "No change: file does not end with a newline"
exit(1)
if char == '\n':
count += 1
if count == number + 1:
f.truncate()
print "Removed " + str(number) + " lines from end of file"
exit(0)
f.seek(-1, os.SEEK_CUR)
if count < number + 1:
print "No change: requested removal would leave empty file"
exit(3)
Here's a Python 3 version:
#!/usr/bin/env python3.0
import os, sys
if len(sys.argv) != 3:
print(sys.argv[0] + ": Invalid number of arguments.")
print ("Usage: " + sys.argv[0] + " linecount filename")
print ("to remove linecount lines from the end of the file")
exit(2)
number = int(sys.argv[1])
file = sys.argv[2]
count = 0
with open(file,'r+b', buffering=0) as f:
f.seek(0, os.SEEK_END)
end = f.tell()
while f.tell() > 0:
f.seek(-1, os.SEEK_CUR)
print(f.tell())
char = f.read(1)
if char != b'\n' and f.tell() == end:
print ("No change: file does not end with a newline")
exit(1)
if char == b'\n':
count += 1
if count == number + 1:
f.truncate()
print ("Removed " + str(number) + " lines from end of file")
exit(0)
f.seek(-1, os.SEEK_CUR)
if count < number + 1:
print("No change: requested removal would leave empty file")
exit(3)
Here is a Python 2.4 version:
#!/usr/bin/env python2.4
import sys
if len(sys.argv) != 3:
print sys.argv[0] + ": Invalid number of arguments."
print "Usage: " + sys.argv[0] + " linecount filename"
print "to remove linecount lines from the end of the file"
sys.exit(2)
number = int(sys.argv[1])
file = sys.argv[2]
count = 0
SEEK_CUR = 1
SEEK_END = 2
f = open(file,'r+b')
f.seek(0, SEEK_END)
end = f.tell()
while f.tell() > 0:
f.seek(-1, SEEK_CUR)
char = f.read(1)
if char != '\n' and f.tell() == end:
print "No change: file does not end with a newline"
f.close()
sys.exit(1)
if char == '\n':
count += 1
if count == number + 1:
f.truncate()
print "Removed " + str(number) + " lines from end of file"
f.close()
sys.exit(0)
f.seek(-1, SEEK_CUR)
if count < number + 1:
print "No change: requested removal would leave empty file"
f.close()
sys.exit(3)
our system is running python 2.4, and I'm not sure if any of our services rely on it, will this work in that? – Russ Bradberry – 2010-04-06T13:28:01.677
@Russ: I've added a version for Python 2.4. – Paused until further notice. – 2010-04-06T14:38:12.630
1absolutely amazing! worked like a charm and in less than a second! – Russ Bradberry – 2010-04-06T14:49:42.250
12
you can try GNU head
head -n -2 file
It is the best solution since it is simple. – xiao – 2012-01-16T02:48:04.200
1This will show him the last two lines of the file, but not remove them from his file..an doesn't even work on my system head: illegal line count -- -2
– SooDesuNe – 2012-04-14T23:07:10.287
2@SooDesuNe: No it will print all lines from the beginning to 2 lines from the end, as per the manual. However, this would need to be redirected to a file, and then there is the problem with this file being giant, so it's not the perfect solution for this problem. – Daniel Andersson – 2012-04-25T21:38:00.103
+1 Why isn't this being accepted as the correct answer? It's fast, simple and does work as expected. – aefxx – 2012-09-12T12:27:06.400
@DanielAndersson Why not? You can head -n -2 file > output
... – mreq – 2013-03-03T19:04:01.613
6@PetrMarek and others: The problem was that it concerned a giant file. This solution would require the entire file to be fed through a pipe and rewriting all data to a new location — and the whole point of the question is to avoid that. An in-place solution is needed, such as the one in the accepted answer. – Daniel Andersson – 2013-03-04T07:31:33.127
7
I see my Debian Squeeze/testing systems (but not Lenny/stable) include a "truncate" command as part of the "coreutils" package.
With it you could simply do something like
truncate --size=-160 myfile
to remove 160 bytes from the end of the file (obviously you need to figure out exactly how many characters you need to remove).
This will be the fastest route since it modifies file in-place, and therefore doesn't require neither copying nor parsing the file. However, you'll still need to check how many bytes to remove... I /guess/ that a simple dd
script will do that (you need to specify input offset to get f.e. last kilobyte and then use tail -2 | LANG= wc -c
, or sth like that). – liori – 2010-04-06T13:11:54.907
I'm using CentOS, so no I do not have truncate. However, this is exactly what I am looking for. – Russ Bradberry – 2010-04-06T13:37:46.143
tail
is efficient for large files, too -- can use tail | wc -c
to compute number of bytes to be trimmed. – krlmlr – 2016-09-19T09:33:53.930
6
The problem with sed is that it is a stream editor -- it will process the entire file even if you only want to make modifications near the end. So no matter what, you are creating a new 400GB file, line by line. Any editor that operates on the whole file will probably have this problem.
If you know the number of lines, you could use head
, but again this creates a new file instead of altering the existing one in place. You might get speed gains from the simplicity of the action, I guess.
You might have better luck using split
to break the file into smaller pieces, editing the last one, and then using cat
to combine them again, but I'm not sure if it will be any better. I would use byte counts rather than lines, otherwise it will probably be no faster at all -- you're still going to be creating a new 400GB file.
2
Try VIM...I'm not sure if it will do the trick or not, as I've never used it on such a big file, but I've used it on smaller larger files in the past give it try.
I do believe vim only loads what's immediately around the buffer when editing, however I've no idea how it saves. – Phoshi – 2010-04-05T23:30:10.930
vim hangs while it tries to load the file – Russ Bradberry – 2010-04-05T23:38:20.223
Well if it hangs, ah wait for it. Start it loading, go to work, come home, see if it is done. – leeand00 – 2010-04-06T00:35:20.270
2
See this: http://stackoverflow.com/questions/159521/text-editor-to-open-big-giant-huge-large-text-files
– leeand00 – 2010-04-06T00:38:59.4271
What kind of file and in what format? May be easier to use something like Perl dependent on what kind of file it is - text, graphics, binary? How is it formatted - CSV, TSV...
it is formatted pipe delimeted text, however the last 2 lines are one column each which will break my import so I need them removed – Russ Bradberry – 2010-04-05T23:39:40.767
is fixing whatever does the "import" to deal with this case an option ? – timday – 2010-04-06T12:22:47.037
no the import is infobright's "load data infile" – Russ Bradberry – 2010-04-06T13:11:31.037
1
If you know the size of the file to the byte (400000000160 say) and you know that you need to remove exactly 160 characters to strip the last two lines, then something like
dd if=originalfile of=truncatedfile ibs=1 count=400000000000
should do the trick. It's been ages since I used dd in anger though; I seem to remember things go faster if you use a bigger block size, but whether you can do that depends on whether the lines you want to drop are at a nice multiple.
dd has some other options to pad text records out to a fixed size which might be useful as a preliminary pass.
i tried this, but it was going about the same speed as sed. It had written approx 200MB in 10 mins, at this rate it would literally take hundreds of hours to complete. – Russ Bradberry – 2010-04-06T13:35:33.777
1
If "truncate" command isn't available on your system (see my other answer), look at the "man 2 truncate" for the system call to truncate a file to a specified length.
Obviously you need to know how many characters you need to truncate the file to (size minus the length of the problem two lines; don't forget to count any cr/lf characters).
And make a backup of the file before you try this!
1
If you prefer unix-style solutions, you can have save and interactive line truncation using three lines of code (Tested on Mac and Linux).
small + safe unix-style line truncation (asks for confirmation):
n=2; file=test.csv; tail -n $n $file &&
read -p "truncate? (y/N)" -n1 key && [ "$key" == "y" ] &&
perl -e "truncate('$file', `wc -c <$file` - `tail -n $n $file | wc -c` )"
This solution relies on a few common unix-tools, but still uses perl -e "truncate(file,length)"
as closest replacement for truncate(1)
, which is not available on all systems.
You may also use the following comprehensive reusable shell program, which provides a usage info and features truncation confirmation, option parsing, and error handling.
comprehensive line truncation script:
#!/usr/bin/env bash
usage(){
cat <<-EOF
Usage: $0 [-n NUM] [-h] FILE
Options:
-n NUM number of lines to remove (default:1) from end of FILE
-h show this help
EOF
exit 1
}
num=1
for opt in $*; do case $opt in
-n) num=$2; shift;;
-h) usage; break;;
*) [ -f "$1" ] && file=$1; shift;;
esac done
[ -f "$file" ] || usage
bytes=`wc -c <$file`
size=`tail -n $num $file | wc -c`
echo "using perl 'truncate' to remove last $size of $bytes bytes:"
tail -n $num $file
read -p "truncate these lines? (y/N)" -n1 key && [ "$key" == "y" ] &&
perl -e "truncate('$file', $bytes - $size )"; echo ""
echo "new tail is:"; tail $file
Here is a usage example:
$ cat data/test.csv
1 nice data
2 cool data
3 just data
GARBAGE to be removed (incl. empty lines above and below)
$ ./rmtail.sh -n 3 data/test.csv
using perl 'truncate' to remove last 60 of 96 bytes:
GARBAGE to be removed (incl. empty lines above and below)
truncate these lines? (y/N)y
new tail is:
1 nice data
2 cool data
3 just data
$ cat data/test.csv
1 nice data
2 cool data
3 just data
0
You can use Vim in Ex mode:
ex -sc '-,d|x' file
-,
select last 2 lines
d
delete
x
save and close
0
#!/bin/sh ed "$1" << HERE $ d d w HERE
changes are made in place. This is simpler and more efficient than the python script.
On my system, using a text file consisting of a million lines and over 57MB, ed
took 100 times as long to execute than my Python script. I can only imagine how much more the difference would be for the OP's file which is 7000 times bigger. – Paused until further notice. – 2010-04-07T04:48:44.627
0
Modified the accepted answer to solve a similar problem. Could be tweaked a little bit to remove n lines.
import os
def clean_up_last_line(file_path):
"""
cleanup last incomplete line from a file
helps with an unclean shutdown of a program that appends to a file
if \n is not the last character, remove the line
"""
with open(file_path, 'r+b') as f:
f.seek(0, os.SEEK_END)
while f.tell() > 0: ## current position is greater than zero
f.seek(-1, os.SEEK_CUR)
if f.read(1) == '\n':
f.truncate()
break
f.seek(-1, os.SEEK_CUR) ## don't quite understand why this has to be called again, but it doesn't work without it
And the corresponding test:
import unittest
class CommonUtilsTest(unittest.TestCase):
def test_clean_up_last_line(self):
"""
remove the last incomplete line from a huge file
a line is incomplete if it does not end with a line feed
"""
file_path = '/tmp/test_remove_last_line.txt'
def compare_output(file_path, file_data, expected_output):
"""
run the same test on each input output pair
"""
with open(file_path, 'w') as f:
f.write(file_data)
utils.clean_up_last_line(file_path)
with open(file_path, 'r') as f:
file_data = f.read()
self.assertTrue(file_data == expected_output, file_data)
## test a multiline file
file_data = """1362358424445914,2013-03-03 16:53:44,34.5,151.16345879,b
1362358458954466,2013-03-03 16:54:18,34.5,3.0,b
1362358630923094,2013-03-03 16:57:10,34.5,50.0,b
136235"""
expected_output = """1362358424445914,2013-03-03 16:53:44,34.5,151.16345879,b
1362358458954466,2013-03-03 16:54:18,34.5,3.0,b
1362358630923094,2013-03-03 16:57:10,34.5,50.0,b
"""
compare_output(file_path, file_data, expected_output)
## test a file with no line break
file_data = u"""1362358424445914,2013-03-03 16:53:44,34.5,151.16345879,b"""
expected_output = "1362358424445914,2013-03-03 16:53:44,34.5,151.16345879,b"
compare_output(file_path, file_data, expected_output)
## test a file a leading line break
file_data = u"""\n1362358424445914,2013-03-03 16:53:44,34.5,151.16345879,b"""
expected_output = "\n"
compare_output(file_path, file_data, expected_output)
## test a file with one line break
file_data = u"""1362358424445914,2013-03-03 16:53:44,34.5,151.16345879,b\n"""
expected_output = """1362358424445914,2013-03-03 16:53:44,34.5,151.16345879,b\n"""
compare_output(file_path, file_data, expected_output)
os.remove(file_path)
if __name__ == '__main__':
unittest.main()
6you can give GNU head a try.
head -n -2 file
– user31894 – 2010-04-06T05:54:44.460There were a couple of one line Perl and Java suggestions given in http://stackoverflow.com/questions/2580335/matlab-delete-elements-of-binary-files-without-loading-entire-file
– mtrw – 2010-04-07T01:02:12.343