3
I have to remove large chunks of information from tables in .xlsx files using Excel 2010. Every method I have tried so far has produced no results, either crashed or I've killed it after waiting for a day.
Somebody else created the spreadsheets using an SQL server. I have to analyse the data, but in order to do that I have to remove all of the bad data and then visualise it. I have asked the server guy to filter on his end but he says only one person in the company is competent to do that and they're too busy. I can't comment on that because I don't know anything about databases.
There are approximately ~500,000 records in the typical spreadsheet. I have tried removing all of the bad values manually using a filter, but my PC crashes when I delete of the bad records.
I have turned the .xlsx files into .csv files because I think these are simpler, and although they seem faster, they still crash.
I've written a VBA script, which I've tried to leave running for several days to no avail:
Sub delete_bad_records()
Dim not_good() As Variant
Dim cell As Excel.range
Dim none As Boolean
Application.Calculation = xlCalculationManual
Application.ScreenUpdating = False
not_good = Array("example_value", "another one")
none = False
Columns("C:C").Select
For Each element In not_good
none = False
Do While Not none
Set cell = Selection.Find(element, ActiveCell)
If cell Is Nothing Then
none = True
Else
cell.Rows().Delete
End If
Loop
Next element
Application.Calculation = xlCalculationAutomatic
Application.ScreenUpdating = True
End Sub
And I have also written a .bat file using findstr as a kind of filter, however some records mysteriously disappear and I'm not sure why. The general format is:
findstr /v "keywords" "original.csv" > "filtered1.csv"
findstr /v /l "specific phrase1" "filtered1.csv" > "filtered2.csv"
findstr /v /l "specific phrase2" "filtered2.csv" > "filtered1.csv"
If you can actually load the file fine in Excel, save it as .csv, then open the .csv file in something like Notepad++ – LPChip – 2018-06-25T12:57:02.870
1Personally I would use Powershell on the CSV. Something like
Import-CSV c:\Path\To\File.csv | ?{$arrayOfIlters -notContains $_.ColumnHeader} | Export-CSV c:\path\To\NewFile.csv -NoTypeInformation
– EBGreen – 2018-06-25T13:05:55.380It's a work machine and it would be very complicated to install another program like Notepad++. What would you suggest I do once I've opened it in Notepad++? – scc268 – 2018-06-25T13:06:41.453
I bet Powershell is already on it. :) – EBGreen – 2018-06-25T13:13:24.417
@EBGreen I've checked and yes I do. Can I point to a list of filters, eg a text file instead of writing down the entire array of filers in powershell? – scc268 – 2018-06-25T13:17:22.520
Have you thought above importing data into MS access then using the query builder to write a query? You can then pass the query back to the SQL guy who can make the necessary changes. – cybernard – 2018-06-25T13:34:42.957
@cybernard Unfortunately I don't have access to Access. If I could I would just try to write my own SQL query but I'm not allowed near the database because it's not my specific job... – scc268 – 2018-06-25T13:36:28.800
How about LibreOffice (base)?(free download) Then you can import your records into a local database which you have full admin rights. – cybernard – 2018-06-25T13:47:46.717
@cybernard it's a work machine, I can't install any program I want – scc268 – 2018-06-25T13:52:52.830
Can't you use PowerShell to iterate over the CSV file, line by line, validate the line, output valid lines to an output file. Invalid lines should cause your script to end the iteration and move to the next line... Your output file will then contain only your required clean data. Similar answer: https://stackoverflow.com/questions/33511772/read-file-line-by-line-in-powershell
– Kinnectus – 2018-06-25T14:12:25.137@Kinnectus you're right. Someone should have suggested that. – EBGreen – 2018-06-25T14:36:52.520
1@scc268 - You can use a text file to list the filter values then do something like
$arrayOfFilters = Get-Gontent c:\path\to\filters.txt
– EBGreen – 2018-06-25T14:37:58.007You seem competent enough with VBA that you might consider using vbscript to open the csv for binary access, read in lines and then emit to a second file any records that you want to keep. You would not need to keep the whole file in memory, and if you do status updates only every nth record, then it should run "fairly fast", but it might still take a while. The biggest issue might be data types for currency, but I do not know the content. see for example: https://stackoverflow.com/questions/6060529/read-and-write-binary-file-in-vbscript
– Yorik – 2018-06-25T15:57:27.447I am also wondering if the rows().delete is not part of the problem, as perhaps this triggers a cascade of recalculation/link updates/redraw/housekeeping etc.etc. What happens if you "blank the row" instead? If you export, does it remove the blanks? If not, then a simple "remove blank lines" filter. – Yorik – 2018-06-25T16:00:41.963
Hi @Yorik I did try blanking the rows however it still froze – scc268 – 2018-06-26T06:59:56.747
Your
findstr
commands are searching for two specific strings. It is not searching for a phrase. That is clearly defined in the help for theFINDSTR
command. If you want to search for a phrase then you need to use the/C
option. You shouldn't need threeFINDSTR
commands to do what you are doing. You can string together search phrases:findstr /V /L /C:"phrase one" /C:"phrase two two" /C:"phrase three peat"
. You can also put all your search phrases in a file and use the/G
option. – Squashman – 2018-06-26T18:24:32.903