3
1
Is there a way to optimize PowerShell code below (it greps particular lines by string contained from a bunch of text files into a single one):
$ErrorActionPreference = "Continue"
Start-Transcript -path D:\0xAC1CC07A.log -append
$OutFile = "D:\0xAC1CC07A.txt"
echo "filtering 0xAC1CC07A"
ForEach ($filenm in ((get-childitem -Path D:\FILES\* -include ubuntlive1mb_?????_201509*.txt -recurse -force)))
{
$filenm.fullName;
(Get-Content $filenm) | select-string "0xAC1CC07A" | Add-Content $OutFile
}
Stop-Transcript
It does well on small workloads but on 160K text files (over 200GB in total) it works more than 4 days on my Win2008R2 VM. Surprisingly Ubuntu 14.04 on the similar virtual hardware did the job within 4 hours:
grep --no-filename "0xac1cc07a" ./FILES/ubuntlive1mb_?????_201509*.txt >>./0xAC1CC07A.txt
Or more precisely:
find ./FILES -name "ubuntlive1mb_?????_201509*.txt" -type f -print0 | xargs -0 grep --no-filename "0xac1cc07a" $1 >>./0xAC1CC07A.txt
I am neither good at PowerShell nor at *nix, all above scripts were created by googling and copy-pasting.
Windows box has been file-system optimized by disabling dos file names and directory update on list. Ubuntu was just installed out of the box.
You'll have slightly different output (but that can be sorted afterwards ofcourse) but from what I've seen it's quite a bit faster using something like
Select-String "0xAC1CC07A" -Path $filenm.FullName
instead of reading the contents first. – notjustme – 2015-10-12T10:19:13.133notjustme: Sort order does not matter. From the log file it looks like
for-each
directory listing and filtering takes most of the time - maybe I have wrote it in a wrong way? – Anton Krouglov – 2015-10-12T11:13:21.533I meant 'sorted' as in manipulated to your liking. Listing files is in PowerShell is notoriously slow. Did you understand my example? Replace your line
(Get-Content $filenm) | select [etc.]
with the one I suggested. If you're OK with the output you could add the| Add-Content $OutFile
bit after. – notjustme – 2015-10-12T11:21:25.110notjustme: Yes I did. I will try it and get back tomorrow with results. – Anton Krouglov – 2015-10-12T11:36:09.983
22 hours passed and still no single file as it is stuck at
ForEach ($filenm in ((get-childitem -Path D:\FILES\* -include ubuntlive1mb_?????_2015090101*.txt -recurse -force)))
– Anton Krouglov – 2015-10-13T10:10:44.507As for speed with Get-ChildItem I believe something like
Get-ChildItem -Path "D:\FILES" -Filter "ubuntlive1mb_?????_2015090101*.txt" -Recurse -Force
should perform better thanInclude
and get the same result. – notjustme – 2015-10-13T11:49:23.790notjustme: I will try it today. Can you please post your comments as a reply (in order to get credits)? I am also thinking of using dir >file and then reading this file by PowerShell. – Anton Krouglov – 2015-10-14T09:18:50.923
notjustme:
-Filter
flag does not seem to do the job as it finds no files. I have done the test with no filtering:ForEach ($filenm in ((get-childitem -Path D:\FILES\* -force)))
and withSelect-String "0xAC1CC07A" -Path $filenm.FullName
. It does work fast but output lines contain filename: prefix which is not good. Withgrep
filenames are removed by using--no-filename
option. Is there a PowerShell analog for that? – Anton Krouglov – 2015-10-14T12:34:55.667You noticed the difference in the path for
-Filter
, right?"D:\Files\*"
vs"D:\Files"
. For the output (this is what I was talking about in my first comment) you'd have to do something likeSelect-String "0xAC1CC07A" -Path $filenm.FullName | Select-Object -ExpandProperty Line
– notjustme – 2015-10-14T12:51:47.407