remove duplicate lines after first word in sentace linux terminal

2

I was looking to remove duplicates of lines of text but after a space from the first word in this format;

apples blue
apples green
apples are sometimes red
pairs green
pairs black
potato brown
lemon
...

Anything after a space on each line would get disregarded, then removed duplicates.

Would end up with;

apples
pairs
potato
lemon

I was hopefully looking for a way this could be done in linux terminal like;

command file_in.txt single_sout.txt

Thanks guys!

mark

Posted 2014-09-19T22:42:18.840

Reputation: 33

Answers

2

Try this one-liner:

cut -d " " -f 1 file_in.txt | uniq > file_out.txt

First you use the cut command with the delimiter -d " ", and then get the first field with -f 1 from the file file_in.txt. Then pipe the result into the uniq command, and that will remove duplicates from the list. Finally you redirect the result to file_out.txt.

Adam

Posted 2014-09-19T22:42:18.840

Reputation: 1 510

will this work on large files gb's in size?, thanks for the reply also – mark – 2014-09-19T23:40:46.617

For very large files it might be worth doing in 2 steps, so cut -d " " -f 1 file_in.txt > file_tmp.txt and then uniq file_tmp.txt > file_out.txt. That will help narrow down the issue if something fails. I don't know of any file size restrictions for either cut or uniq, so the only real way to find out would be to test it. Running the commands is non-destructive though, so giving it a shot won't hurt. – Adam – 2014-09-21T16:08:14.950