Filtering and splitting huge file based on column Windows

2

I have about 2 TB worth of data files formatted like

12/20/2015 somerandomdata
12/20/2015 somerandomdata
12/20/2015 somerandomdata
12/20/2015 somerandomdata
12/21/2015 somerandomdata
12/21/2015 somerandomdata
12/21/2015 somerandomdata
12/21/2015 somerandomdata
12/22/2015 somerandomdata
12/22/2015 somerandomdata
12/22/2015 somerandomdata
12/22/2015 somerandomdata

and I want to pull out certain dates. For example, I might want to generate the files for 12/20/2015 and 12/22/2015.

12/20/2015 somerandomdata
12/20/2015 somerandomdata
12/20/2015 somerandomdata
12/20/2015 somerandomdata

and

12/22/2015 somerandomdata
12/22/2015 somerandomdata
12/22/2015 somerandomdata
12/22/2015 somerandomdata

I could easily do this with grep in linux by doing grep '12/20/2015' filein > fileout20 and grep '12/22/2015' filein > fileout22 but this has two problems.

First and more importantly, it needs to loop through the input file twice to generate the output. With 2 TB of data and several dates per file, this is a significant problem. (Related: I also don't want solutions that break up the file into every possible date because I won't want the data from most dates, just about 10% from each input file)

The second issue is that I need to run this on Windows. (I realize most linux commands have a Windows equivalent using GnuWin32 or the like, so this is not as big of an issue)

Are there any ways that this could be done efficiently?

EDIT: The answers so far have one of two problem, so I'll clarify a little bit. The first problem is that I don't want to run through each of the input files more than once. So, having a loop to iterate through each of the dates will not work. This is because if I have 200 dates and 8000 files, it would take 1,600,000 iterations.

The second problem is that I want to split each of the output files into one file per date.

So, with 200 dates and 8000 files, there should be 1,600,000 files, but with only 8000 iterations of the searching command.

EDIT 2: here is a solution in with linux commands. I'll probably end up just using this unless someone has a better way

grep -f 12/20/2015 12/22/2015 filein1 > intermediate
awk -F, '{print > $1".out"}' intermediate

This is a two-stage process that first filters on the dates and then splits the result based on date.

Jay

Posted 2015-12-23T20:39:21.423

Reputation: 199

Answers

5

I don't want to run through each of the input files more than once.

To iterate through each of the dates will not work. I want to split each of the output files into one file per date.

Oh, why didn't you just say no iterations from a LOOP! ! . . .

PowerShell Solution

(Set your variables accordingly in the below example: list of strings, output file path, and maybe string variable name in MMDDYYYY format)

Select-String -Path "C:\Path\*.txt" -Pattern 12/20/2015,12/23/2015,12/30/2015 -AllMatches | foreach-object {
   $RS = $_.Matches[0].Groups[0].Value.Replace("/","")
   $RS | Out-File "C:\Path\$RS.txt" -Append
}

(Above is a PowerShell solution to search all text file contents in a specific directory against a list of strings. It will also append all found string values to an output file with a valid file name that matches the matched string pattern, and that's unique to that string.)


Here is a solution in with Linux commands. I'll probably end up just using this

grep -f 12/20/2015 12/22/2015 filein1 > intermediate

awk -F, '{print > $1".out"}' intermediate

This is a two-stage process that first filters on the dates and then splits the result based on date.

So you're searching file contents of text files in a certain directory with a list of strings. For every matching string found in each file, you need it's value written (and appended if applicable) to a file with a "valid" file name that's unique to the string—no different strings in the same files.

As far as a being a "better" way, it's really a matter of opinion depending on what you're measuring statistic and resource wise I suppose. I wasn't aware you were super worried about an optimal performant solution compared to the way something you were leaning toward works.

(I'll keep the batch script solution below in case anyone every finds it useful though.)


I want to pull out certain dates. For example, I might want to generate the files for 12/20/2015 and 12/22/2015.

Comment

However, unless I misunderstand findstr, there will only be one output file instead of one per date, which is not the behavior I am looking for.

This will expand on the other solution to give you a string file name to append to the original file name plus the MMDDYYYY or YYYYMMDD or even perhaps any combination or single characters of the YYYY, MM, or DD parts of the matching string to append to the file name of THOSE strings found per file—just as you explained.

Basically you just set your variables accordingly, save as a batch script [.cmd], and then just run it. Most the variables are going to be explicit to the specific full path of the applicable directories.

Batch Script Variable Explanations

  • The SET FilePath= value will be the full path to the location where the .TXT files you'll search are located

  • The SET StringList= value will be a full path location and file name where you will save a text file with each string (i.e. 12/20/2015, 12/22/2015, etc.) to search for in the .txt files. You will put a string on each line in the file that'll be used as the string to find in the .txt files (see below example).

  • I'm going to assume the other variables and the rest of the batch script logic makes sense; otherwise, let me know and I'll be glad to add further bullets to clarify, etc.


EXAMPLE BATCH SCRIPT

@ECHO ON

SET FilePath=C:\Path\<Location containing .TXT files to search>
SET StringList=C:\Path\DateList.txt

FOR %%A IN ("%FilePath%\*.txt") DO FOR /F "TOKENS=*" %%B IN (%StringList%) DO CALL :FindConCat "%%~B" "%%~NFXA" "%%~NA" "%%~XA"
GOTO EOF

:FindConCat
SET SearchStr=%~1
SET SearchFile=%~2
SET OutFName=%~3
SET FileExt=%~4
SET ParseStr=%SearchStr%
:::: --// MMDDYYY format
SET ParseStr=%ParseStr:~0,2%%ParseStr:~3,2%%ParseStr:~6,4%
:::: --// YYYYMMDD format
:: SET ParseStr=%ParseStr:~6,4%%ParseStr:~0,2%%ParseStr:~3,2%

FINDSTR /c:"%SearchStr%" "%SearchFile%">>"%FilePath%\%OutFName%_%ParseStr%%FileExt%"
GOTO EOF

EXAMPLE STRING LIST FILE CONTENT (SET StringList=C:\Path\DateList.txt)

12/15/2015
12/22/2015
12/23/2015
12/24/2015

Pimp Juice IT

Posted 2015-12-23T20:39:21.423

Reputation: 29 425

I'm not too familiar with batch script, but I think I understand. It looks like a double nested for loop. So, if I had 40 files and 30 dates, FINDSTR would be run 1200 times. I would like something that only runs a FINDSTR or something similar 40 times, othewise the script will take way too long. – Jay – 2015-12-24T21:18:49.970

I just edited the question to clarify. – Jay – 2015-12-24T21:27:54.270

Great! I tried to be clear in my original question, but I guess I wasn't. Thanks! – Jay – 2015-12-25T17:23:25.467

2

One-liner using PowerShell:

get-content c:\filein.txt | where-object { $_ -match "12/22/2015" } | out-file c:\fileout22.txt

bentek

Posted 2015-12-23T20:39:21.423

Reputation: 594

1Unless I'm missing something, I will need to do this for each date that I need. So, if I have, say, 200 to pull out of each file, this will take 200 times as long as it should. – Jay – 2015-12-24T01:53:27.133

You can add more dates in the filter, or create a simple loop to do pretty much whatever you need. – bentek – 2015-12-24T15:49:25.653

But, a loop would take significantly longer. See my edit. – Jay – 2015-12-24T21:27:26.450

2

findstr -rc:"12/2[02]/2015" filein > fileout

SΛLVΘ

Posted 2015-12-23T20:39:21.423

Reputation: 1 157

I don't think findstr will work very well with a 4TB file. See this answer What are the undocumented features and limitations of the Windows FINDSTR command?.

– DavidPostill – 2015-12-23T21:16:23.107

There are about 8000 files, each is about .25 GB. So, it might not be an issue. – Jay – 2015-12-24T01:49:21.423

However, unless I misunderstand findstr, there will only be one output file instead of one per date, which is not the behavior I am looking for. – Jay – 2015-12-24T01:54:37.963