Performing PowerShell analog of *nix grep <multiple files by mask>

3

1

Is there a way to optimize PowerShell code below (it greps particular lines by string contained from a bunch of text files into a single one):

$ErrorActionPreference = "Continue"
Start-Transcript -path D:\0xAC1CC07A.log -append
$OutFile = "D:\0xAC1CC07A.txt"
echo "filtering 0xAC1CC07A"
ForEach ($filenm in ((get-childitem -Path D:\FILES\* -include ubuntlive1mb_?????_201509*.txt -recurse -force))) 
{
 $filenm.fullName;
 (Get-Content $filenm) | select-string "0xAC1CC07A" | Add-Content $OutFile
}
Stop-Transcript

It does well on small workloads but on 160K text files (over 200GB in total) it works more than 4 days on my Win2008R2 VM. Surprisingly Ubuntu 14.04 on the similar virtual hardware did the job within 4 hours:

grep --no-filename "0xac1cc07a" ./FILES/ubuntlive1mb_?????_201509*.txt >>./0xAC1CC07A.txt

Or more precisely:

find ./FILES -name "ubuntlive1mb_?????_201509*.txt" -type f -print0 | xargs -0 grep --no-filename "0xac1cc07a" $1 >>./0xAC1CC07A.txt

I am neither good at PowerShell nor at *nix, all above scripts were created by googling and copy-pasting.

Windows box has been file-system optimized by disabling dos file names and directory update on list. Ubuntu was just installed out of the box.

Anton Krouglov

Posted 2015-10-12T09:12:09.947

Reputation: 338

You'll have slightly different output (but that can be sorted afterwards ofcourse) but from what I've seen it's quite a bit faster using something like Select-String "0xAC1CC07A" -Path $filenm.FullName instead of reading the contents first. – notjustme – 2015-10-12T10:19:13.133

notjustme: Sort order does not matter. From the log file it looks like for-each directory listing and filtering takes most of the time - maybe I have wrote it in a wrong way? – Anton Krouglov – 2015-10-12T11:13:21.533

I meant 'sorted' as in manipulated to your liking. Listing files is in PowerShell is notoriously slow. Did you understand my example? Replace your line (Get-Content $filenm) | select [etc.]with the one I suggested. If you're OK with the output you could add the | Add-Content $OutFile bit after. – notjustme – 2015-10-12T11:21:25.110

notjustme: Yes I did. I will try it and get back tomorrow with results. – Anton Krouglov – 2015-10-12T11:36:09.983

22 hours passed and still no single file as it is stuck at ForEach ($filenm in ((get-childitem -Path D:\FILES\* -include ubuntlive1mb_?????_2015090101*.txt -recurse -force))) – Anton Krouglov – 2015-10-13T10:10:44.507

As for speed with Get-ChildItem I believe something like Get-ChildItem -Path "D:\FILES" -Filter "ubuntlive1mb_?????_2015090101*.txt" -Recurse -Force should perform better than Include and get the same result. – notjustme – 2015-10-13T11:49:23.790

notjustme: I will try it today. Can you please post your comments as a reply (in order to get credits)? I am also thinking of using dir >file and then reading this file by PowerShell. – Anton Krouglov – 2015-10-14T09:18:50.923

notjustme: -Filter flag does not seem to do the job as it finds no files. I have done the test with no filtering: ForEach ($filenm in ((get-childitem -Path D:\FILES\* -force))) and with Select-String "0xAC1CC07A" -Path $filenm.FullName. It does work fast but output lines contain filename: prefix which is not good. With grep filenames are removed by using --no-filename option. Is there a PowerShell analog for that? – Anton Krouglov – 2015-10-14T12:34:55.667

You noticed the difference in the path for -Filter, right? "D:\Files\*" vs "D:\Files". For the output (this is what I was talking about in my first comment) you'd have to do something like Select-String "0xAC1CC07A" -Path $filenm.FullName | Select-Object -ExpandProperty Line – notjustme – 2015-10-14T12:51:47.407

Answers

2

This very simple Powershell script should do what you're looking for:

$OutFile = "D:\0xAC1CC07A.txt"
Get-ChildItem -Path D:\FILES\ubuntlive1mb_?????_201509*.txt -Recurse | Foreach-Object { Select-String -Path $_ -Pattern "0xAC1CC07A" } | Foreach-Object { Add-Content -Path $OutFile -Value $_.Line }

This will just add the matched lines into the $OutFile text file. You could also use this to get the file names or the line numbers of the matched lines as well, by using the Filename, Path, and LineNumber properties, instead of just the Line property.

If you want to test a script which will run against many files, but don't want to wait for it to finish checking all of them, then you can use the Select-Object cmdlet to limit the number of files it will check.

Example:

Get-ChildItem -Path D:\FILES\ubuntlive1mb_?????_201509*.txt | Select-Object -First 100 | Foreach-Object { Select-String -Path $_ -Pattern "0xAC1CC07A" } | Foreach-Object { Add-Content -Path $OutFile -Value $_.Line }

This will run the above script only against the first 100 text files that are returned from Get-ChildItem.

Matt

Posted 2015-10-12T09:12:09.947

Reputation: 331

It worked fast. – Anton Krouglov – 2015-10-19T06:53:39.917

2

You'll have slightly different output (but that can be taken care of should there be a need) but from what I've seen it's quite a bit faster just going for the Select-String directly on the file instead of getting the file contents first.

Select-String "0xAC1CC07A" -Path $filenm.FullName | Add-Content $OutFile

Just remember to check the output first before appending it to file so you get it in the way you desire.

As for speed; Get-ChildItem is notoriously slow in PowerShell (since PowerShell likes to fetch objects rather than just text representation of objects) and there are various workarounds for this.

The Get-ChildItem-line in your code can be optimized however. From what I've seen using Filter is roughly 3,5 times faster than using includes/excludes on a regular consumergrade 7.2k HDD.

Get-ChildItem -Path "D:\FILES" -Filter "ubuntlive1mb_?????_2015090101*.txt" -Recurse -Force

If memory serves me right, earlier versions of PowerShell had some problems with filter, such as if you wanted to all files with extension .htm it would also pick up the files with the extension .html (as if you had filtered *.htm* and not *.htm), so you might wanna keep an eye out for that.

notjustme

Posted 2015-10-12T09:12:09.947

Reputation: 213

@AntonKruglov Confusing when you say this is perfect but mark another answer as the solution. – notjustme – 2015-10-19T06:54:44.767

A bit misleading too, to be honest, since my Filter-example is 3 to 5ish times faster than the other guys example. – notjustme – 2015-10-19T12:45:54.203

notjustme: I have to be honest: you were the 1st replier and your solution worked, but Matt's solution was a complete one and it did the job within reasonable time frame, so I have selected his post as an answer. – Anton Krouglov – 2015-10-21T14:56:32.893

I gave you complete solutions. I guess it's a lil much to ask to have you replace one line in your original code. You want something fast, found something perfect and went with what is roughly 3-5 times slower. glhf! – notjustme – 2015-10-21T19:47:23.433