how to merge large txt files of 10GB+ into 1 txt file and remove duplicates lines from this 1 txt file fastest way?

-1

1

simple my question is different because i need to merge the files into one too then remove the duplicates lines from the that file which will be over 50GB txt i have large .txt from 10GB+ files

i want to merge them into 1 .txt file

then remove all the duplicates lines from that 1 large .txt file combined which will be around 50GB txt file or 100GB txt file

so what can handle that kind of large file and remove the duplicates from it smoothly ?

i need the fastest way because i tryied bouth notepad++ and emeditor they work super heavy with them for merge or duplicates removing and take forever

i have 12GB RAM

DeathRival

Posted 2017-09-15T21:00:10.553

Reputation: 113

Question was closed 2017-09-29T07:53:55.383

Scripting is probably going to be the fastest, but do note, working with files this large means this is going to take forever regardless. Therefor, getting the fastest method is really a matter of opinion. Its going to take more time to find the fastest than just to get it done. – LPChip – 2017-09-15T21:08:51.757

that didn't help me and i didn't understand anything of it and my question typecially different since i look for more large files its 10GB minimum and can go to 100GB and i have free space for work already over 300GB – DeathRival – 2017-09-15T21:09:39.753

ok i have found a way from How Does One Remove Duplicate Text Lines From Files Larger Than 4GB you can delete my question if you want what i found is: http://www.pilotedit.com/index.html thanks for who post it

– DeathRival – 2017-09-15T21:44:14.883

Ramhound already pointed you to a good answer. But let me add a few things. You can join several text files together using the copy command. Open a command prompt, use cd to move to the folder with your text files and then type copy file1.txt + file2.txt combined_file.txt. That will join both files and will take about 3 seconds per GB if you are working on an SSD. It will be slower on a hard disk. – SpiderPig – 2017-09-15T21:47:18.080

The sort -u command mentioned in the other thread is also very fast and can handle 0.1 GB per second. – SpiderPig – 2017-09-15T21:55:32.293

Answers

2

If you are using Linux, you could do it like that:

cat aa.txt bb.txt | sort -u > newfile.txt

Here aa.txt is the first text file and bb.txt the second one.

sort -u sorts the file alphabetically and with -u (see also here https://stackoverflow.com/a/9377125/7311363) you're eliminating duplicates. With > newfile.txt you're writing that to newfile.txt.

chloesoe

Posted 2017-09-15T21:00:10.553

Reputation: 627

2And if you aren't using Linux, boot with a LiveCD/DVD and mount your drive and then do it. – ivanivan – 2017-09-16T00:19:06.313