How to split a CSV file into multiple files based on a text string?

6

5

I have multiple text files (CSV) with a structure similar to this:

funkiana,23.55,-99.866
funkiana,24.634,-98.701
funkiana,24.717,-98.901
geminiflora,22.25,-104.6166667
geminiflora,21.21666667,-104.65
horrida,19.633,-97.367
horrida,23.61666667,-102.575
horrida,22.158,-100.979
horrida,19.506,-97.433
horrida,17,-97.56667
horrida,19.485,-97.263
horrida,19.017,-99.133
horrida,19.017,-99.15
horrida,18.91,-99.23
horrida,17.82167,-100.26333
horrida,19.507,-97.438
inaequidens,19.399,-99.314
inaequidens,23.58333,-105.8833
inaequidens,19.767,-103.7
inaequidens,20.787,-103.848

As you can see, there are three fields (species, latitude and longitude). Now, I want to split each CSV file into different CSV files, containing only the data of each species. In other words, I want a file for all the occurrences of funkiana (with lat/lon), another for geminiflora (with lat/lon) and so on.

Any ideas on how to do this? Maybe with a script or an Excel macro?

Kureno

Posted 2015-08-01T01:03:32.893

Reputation: 63

Answers

7

The fastest method I could think of is to use PowerShell

$fullpath = "D:\myFolder\input.csv"

$path = Split-Path $fullpath -parent
$data = Import-CSV -Delimiter "," -Path $fullpath -Header species,latitude,longitude  

foreach ($group in $data | Group species){        
    $data | Where-Object {$_.species -eq $group.name} | 
        ConvertTo-Csv -NoTypeInformation | 
        foreach {$_.Replace('"','')} | 
        Out-File "$path\$($group.name).csv"     
}
  1. Paste the code to a new text file and save it as e.g MySplitMacro.ps1
  2. Edit the first line and change $fullpath to your desired CSV path
  3. Right-Click the .ps1 file and click Run with PowerShell

Taking your initial example as input, the script will create 4 new CSV files at the same location as your input file. Each CSV will contain one set of entries when filtered by first column.

Resulting example folder
enter image description here

One resulting example file
enter image description here

Tweaks

  • Change species in $data | Group species to set the column to filter for
  • Change -Delimiter "," if your input file has a different delimiter like tabs "`t" or semicolons ";"
  • Change -Header species,latitude,longitude to your column names. Order it properly
  • Change $path\$($group.name).csv if you need a different output path
  • Instead of -eq $group.name to filter your results you could also use -like *$group.name* for wildcard comparison or -match '[A-Z]$group.name' for RegEx comparison

Used resources

nixda

Posted 2015-08-01T01:03:32.893

Reputation: 23 233

Beautiful answer...sorry I can only upvote it only one time. – Gary's Student – 2015-08-01T12:30:38.927

Thank you very much! This is exactly what I wanted to do :) . I will check the resources you used. Definitely, I need to learn Power Shell scripting. Thank you again. – Kureno – 2015-08-01T18:53:45.763

0

What you're asking for is commonly referred to as "control break" process. There is a "control" value. In your case, it's the species. When this changes values or "breaks" we want to do something. In your case, you want to write out a new file.

There are numerous ways to solve your problem. I would typically use a scripting language vs. Excel to solve it.

This link has a walk through if you're interested in learning how to write such a program/script: http://www.unix.com/tips-and-tutorials/209439-how-do-control-break-algorithm.html

If you're on a Windows platform and don't mind using a programming language, you can use LinqPad (http://www.linqpad.net/) which has a free edition and the following C# program (be sure to select "C# program" in the LinqPad Language dropdown):

void Main()
{
    var path = @"c:\sourceGit\speciesLatLon.txt";
    var inputLines = File.ReadAllLines(path);

    // Holds all the lines to be added to each output file
    var linesForCurrentSpeciesFile = new List<string>(); 

    // Read first row
    int i = 0;
    var currentSpecies = GetSpecies(inputLines[i]);

    // initialize hold value
    var holdValue = currentSpecies;

    // Initialize output values
    linesForCurrentSpeciesFile.Add(inputLines[i]);

    // Read next value
    i++;

    while( i < inputLines.Length )
    {
        currentSpecies = GetSpecies(inputLines[i]);
        if (currentSpecies !=  holdValue)
        {
            // output current file
            WriteSpeciesFile(holdValue, linesForCurrentSpeciesFile);

            // Initialize new output file by clearing out the previous
            linesForCurrentSpeciesFile.Clear();

            // update hold value with the value just examined.
            holdValue = currentSpecies;
        }
        // Add the current line to the output file
        linesForCurrentSpeciesFile.Add(inputLines[i]);
        i++;
    }
    // Write the output file because last row is equal to a break in the sequence
    WriteSpeciesFile(currentSpecies, linesForCurrentSpeciesFile);
}

// Define other methods and classes here
public string GetSpecies(string line)
{
    // return the first value of the input line
    return line.Split(new char[] {','})[0];
}

public void WriteSpeciesFile(string species, List<string> content)
{
    File.WriteAllLines(string.Format(@"C:\sourceGit\{0}.csv", species), content.ToArray());
}

ptilton

Posted 2015-08-01T01:03:32.893

Reputation: 31

Thank you for your answer. I have very little experience with programming (that's one thing that I want to solve asap), but definitely I'll check your suggestion. Also, thanks for mentioning the name of the process. I had doubts about the title of the question because I didn't know how to define it :P. – Kureno – 2015-08-01T19:02:05.337

0

A simple csv search app like CsvFileSearch will do this without the need to get complicated.It will search multiple files and save the results into another file.

Les Hardy

Posted 2015-08-01T01:03:32.893

Reputation: 1

But can it also split a file into multiple files? – Arjan – 2015-08-27T19:14:37.050

Add more detail into your answer – Prasanna – 2015-08-28T04:44:42.440