How do I clean ~500,000 records in Excel 2010 without crashing my PC?

3

I have to remove large chunks of information from tables in .xlsx files using Excel 2010. Every method I have tried so far has produced no results, either crashed or I've killed it after waiting for a day.

Somebody else created the spreadsheets using an SQL server. I have to analyse the data, but in order to do that I have to remove all of the bad data and then visualise it. I have asked the server guy to filter on his end but he says only one person in the company is competent to do that and they're too busy. I can't comment on that because I don't know anything about databases.

There are approximately ~500,000 records in the typical spreadsheet. I have tried removing all of the bad values manually using a filter, but my PC crashes when I delete of the bad records.

I have turned the .xlsx files into .csv files because I think these are simpler, and although they seem faster, they still crash.

I've written a VBA script, which I've tried to leave running for several days to no avail:

Sub delete_bad_records()
Dim not_good() As Variant
Dim cell As Excel.range
Dim none As Boolean

Application.Calculation = xlCalculationManual
Application.ScreenUpdating = False

not_good = Array("example_value", "another one")
none = False
Columns("C:C").Select

For Each element In not_good
    none = False
    Do While Not none
        Set cell = Selection.Find(element, ActiveCell)
        If cell Is Nothing Then
            none = True
        Else
            cell.Rows().Delete
        End If
    Loop
Next element

Application.Calculation = xlCalculationAutomatic
Application.ScreenUpdating = True

End Sub

And I have also written a .bat file using findstr as a kind of filter, however some records mysteriously disappear and I'm not sure why. The general format is:

findstr /v "keywords" "original.csv" > "filtered1.csv"
findstr /v /l "specific phrase1" "filtered1.csv" > "filtered2.csv"
findstr /v /l "specific phrase2" "filtered2.csv" > "filtered1.csv"

scc268

Posted 2018-06-25T12:43:57.433

Reputation: 181

If you can actually load the file fine in Excel, save it as .csv, then open the .csv file in something like Notepad++ – LPChip – 2018-06-25T12:57:02.870

1Personally I would use Powershell on the CSV. Something like Import-CSV c:\Path\To\File.csv | ?{$arrayOfIlters -notContains $_.ColumnHeader} | Export-CSV c:\path\To\NewFile.csv -NoTypeInformation – EBGreen – 2018-06-25T13:05:55.380

It's a work machine and it would be very complicated to install another program like Notepad++. What would you suggest I do once I've opened it in Notepad++? – scc268 – 2018-06-25T13:06:41.453

I bet Powershell is already on it. :) – EBGreen – 2018-06-25T13:13:24.417

@EBGreen I've checked and yes I do. Can I point to a list of filters, eg a text file instead of writing down the entire array of filers in powershell? – scc268 – 2018-06-25T13:17:22.520

Have you thought above importing data into MS access then using the query builder to write a query? You can then pass the query back to the SQL guy who can make the necessary changes. – cybernard – 2018-06-25T13:34:42.957

@cybernard Unfortunately I don't have access to Access. If I could I would just try to write my own SQL query but I'm not allowed near the database because it's not my specific job... – scc268 – 2018-06-25T13:36:28.800

How about LibreOffice (base)?(free download) Then you can import your records into a local database which you have full admin rights. – cybernard – 2018-06-25T13:47:46.717

@cybernard it's a work machine, I can't install any program I want – scc268 – 2018-06-25T13:52:52.830

Can't you use PowerShell to iterate over the CSV file, line by line, validate the line, output valid lines to an output file. Invalid lines should cause your script to end the iteration and move to the next line... Your output file will then contain only your required clean data. Similar answer: https://stackoverflow.com/questions/33511772/read-file-line-by-line-in-powershell

– Kinnectus – 2018-06-25T14:12:25.137

@Kinnectus you're right. Someone should have suggested that. – EBGreen – 2018-06-25T14:36:52.520

1@scc268 - You can use a text file to list the filter values then do something like $arrayOfFilters = Get-Gontent c:\path\to\filters.txt – EBGreen – 2018-06-25T14:37:58.007

You seem competent enough with VBA that you might consider using vbscript to open the csv for binary access, read in lines and then emit to a second file any records that you want to keep. You would not need to keep the whole file in memory, and if you do status updates only every nth record, then it should run "fairly fast", but it might still take a while. The biggest issue might be data types for currency, but I do not know the content. see for example: https://stackoverflow.com/questions/6060529/read-and-write-binary-file-in-vbscript

– Yorik – 2018-06-25T15:57:27.447

I am also wondering if the rows().delete is not part of the problem, as perhaps this triggers a cascade of recalculation/link updates/redraw/housekeeping etc.etc. What happens if you "blank the row" instead? If you export, does it remove the blanks? If not, then a simple "remove blank lines" filter. – Yorik – 2018-06-25T16:00:41.963

Hi @Yorik I did try blanking the rows however it still froze – scc268 – 2018-06-26T06:59:56.747

Your findstr commands are searching for two specific strings. It is not searching for a phrase. That is clearly defined in the help for the FINDSTR command. If you want to search for a phrase then you need to use the /C option. You shouldn't need three FINDSTR commands to do what you are doing. You can string together search phrases: findstr /V /L /C:"phrase one" /C:"phrase two two" /C:"phrase three peat". You can also put all your search phrases in a file and use the /G option. – Squashman – 2018-06-26T18:24:32.903

Answers

2

Thanks to @EBGreen for suggesting PowerShell and showing me how to use it; although it seems simple now, I would have never known/thought about using PowerShell on my own!

Here's what I did:

  1. Create the .csv file
  2. Add a filter in Excel, and remove all the entries with the type I do want
  3. Copy the filtered list of entries with types I do not want and then remove duplicates
  4. Save this in not_good.txt
  5. Run this PowerShell script:

    $not_good = Get-Content .\not_good.txt
    Import-CSV ".\results.csv"  | ?{$not_good -notContains $_.Type} | Export-CSV ".\results filtered.csv" -NoTypeInformation
    

The script takes seconds, and I can then use it for all of the other spreadsheets I've got.

scc268

Posted 2018-06-25T12:43:57.433

Reputation: 181

1

If you want to load a lot of data into Excel, use Microsoft's Get & Transform power tools. (For Excel 2010-2013, use the free MS add-in Power Query for Excel.)

With this tool you have the possibility to access many different sources (CSV, Excel files, database, web, ...) without the need to write code. Also, transforming your data or combining it with other sources is quite easy.

visu-l

Posted 2018-06-25T12:43:57.433

Reputation: 426

Cheers for letting me know about these tools. I wasn't aware that they existed. – scc268 – 2018-06-27T13:47:00.723

you're welcome. These quite recent tools of Excel (Power Query, Power Pivot) will definitely bring you to the next level as power user. – visu-l – 2018-06-27T14:20:56.263