2
I have about 2 TB worth of data files formatted like
12/20/2015 somerandomdata
12/20/2015 somerandomdata
12/20/2015 somerandomdata
12/20/2015 somerandomdata
12/21/2015 somerandomdata
12/21/2015 somerandomdata
12/21/2015 somerandomdata
12/21/2015 somerandomdata
12/22/2015 somerandomdata
12/22/2015 somerandomdata
12/22/2015 somerandomdata
12/22/2015 somerandomdata
and I want to pull out certain dates. For example, I might want to generate the files for 12/20/2015 and 12/22/2015.
12/20/2015 somerandomdata
12/20/2015 somerandomdata
12/20/2015 somerandomdata
12/20/2015 somerandomdata
and
12/22/2015 somerandomdata
12/22/2015 somerandomdata
12/22/2015 somerandomdata
12/22/2015 somerandomdata
I could easily do this with grep in linux by doing grep '12/20/2015' filein > fileout20
and grep '12/22/2015' filein > fileout22
but this has two problems.
First and more importantly, it needs to loop through the input file twice to generate the output. With 2 TB of data and several dates per file, this is a significant problem. (Related: I also don't want solutions that break up the file into every possible date because I won't want the data from most dates, just about 10% from each input file)
The second issue is that I need to run this on Windows. (I realize most linux commands have a Windows equivalent using GnuWin32 or the like, so this is not as big of an issue)
Are there any ways that this could be done efficiently?
EDIT: The answers so far have one of two problem, so I'll clarify a little bit. The first problem is that I don't want to run through each of the input files more than once. So, having a loop to iterate through each of the dates will not work. This is because if I have 200 dates and 8000 files, it would take 1,600,000 iterations.
The second problem is that I want to split each of the output files into one file per date.
So, with 200 dates and 8000 files, there should be 1,600,000 files, but with only 8000 iterations of the searching command.
EDIT 2: here is a solution in with linux commands. I'll probably end up just using this unless someone has a better way
grep -f 12/20/2015 12/22/2015 filein1 > intermediate
awk -F, '{print > $1".out"}' intermediate
This is a two-stage process that first filters on the dates and then splits the result based on date.
I'm not too familiar with batch script, but I think I understand. It looks like a double nested for loop. So, if I had 40 files and 30 dates, FINDSTR would be run 1200 times. I would like something that only runs a FINDSTR or something similar 40 times, othewise the script will take way too long. – Jay – 2015-12-24T21:18:49.970
I just edited the question to clarify. – Jay – 2015-12-24T21:27:54.270
Great! I tried to be clear in my original question, but I guess I wasn't. Thanks! – Jay – 2015-12-25T17:23:25.467