Process plain text data

I've this plain text file that I need to be automatically examined and results written in another plain text file. Note that I'll use plain text and CSV interchangeably, since CSVs are plain text with common separator between data (commas, tabs, etc).

Data examples

ABD-01A
ABD-01B
ABD-01C
AL-25A
AL-25B

Process one - recognize the first two sections of the data (text-number) to repeat and count them in output CSV:

ABD-01,1
ABD-01,2
ABD-01,3
AL-25,1
AL-25,2

Process two - First part of the data gathered above has sort of a primary key assigned to them on a different file. I would like to get the primary keys mentioned in the newly generated files as well:

#primary key file
1,ABD-01
2,ABD-02
3,AL-02
20,AL-25

#newly generated file
1,ABD-01,1
1,ABD-01,2
1,ABD-01,3
20,AL-25,1
20,AL-25,2

So basically, what I'm asking her is if there's any software that can help me with this on either Windows or Linux-based operating systems. Bash can help me with simpler processing, but I don't think it can handle this much. Please advise.

Oxwivi

Posted 2012-01-25T12:13:01.363

Reputation: 312

Can the same data occur multiple times in the same file? (for example, "ABD-01A, ABD-01B, AL-25A, ABD-01A"...) – user1686 – 2012-01-25T15:22:16.270

@grawity No, all the data are unique. – Oxwivi – 2012-01-25T16:27:33.193

In that case, you could use the example scripts I provided in my other comment. (Both do the same, but written in different languages so you can compare.) – user1686 – 2012-01-25T17:34:48.360

Answers

Choose your preferred language – Perl works well with text; Python, PHP and Ruby are okay, if a little slower.
Read the key file first:
- split each key,data line into key and data,
- then store both in a dict/hash/array: keys[data] = key
- also counts[data] = 0 if the language demands initialization
Read all data files:
1. use a regexp to find the "TEXT-NUMBER" data at the beginning of every line,
2. increment counts[data] by one,
3. and immediately output keys[data], the line, and counts[data].

I wrote an example script in both Perl and Python, so your choice:

process.pl

#!/usr/bin/env perl
use strict;

sub usage {
    print "Usage: $0 <key_file> <data_file...>\n";
    return 2;
}

my $keys_file = shift(@ARGV) // exit(usage);
my @data_files = @ARGV;

my %keys;   # $data => $primarykey
my %counts; # $data => $count

# Read keys

open(my $fh, "<", $keys_file) or die "$!";
while (my $line = <$fh>) {
    chomp($line);
    my ($key, $datum) = split(/,/, $line, 2);
    if (!defined $datum) {
        warn "$keys_file: line $. not in KEY,DATA format: '$line'\n";
        next;
    }

    $keys{$datum} = $key;
    $counts{$datum} = 0;
}
close($fh);

# Read and output data

my $ex = qr/^(\w+-\d+)\w*/;
for my $data_file (@data_files) {
    open(my $fh, "<", $data_file) or die "$!";
    while (my $line = <$fh>) {
        chomp($line);
        if ($line =~ /$ex/) {
            my $datum = $1;
            if (!defined $keys{$datum}) {
                warn "no primary key found for data '$datum'\n";
                next;
            }

            # Increment count, then print processed line immediately
            $counts{$datum}++;
            print "$keys{$datum},$&,$counts{$datum}\n";
        }
        else {
            warn "$data_file: line $. does not begin with TEXT-NUMBER: '$_'\n";
        }
    }
    close($fh);
}

process.py

#!/usr/bin/env python
from __future__ import print_function
import sys
import re

def usage():
    print("Usage: %s <key_file> <data_file...>" % sys.argv[0])
    return 2

try:
    keys_file = sys.argv[1]
    data_files = sys.argv[2:]
except IndexError:
    sys.exit(usage())
except ValueError:
    sys.exit(usage())

keys = {}
counts = {}

# Read keys

for line in open(keys_file, "r"):
    try:
        key, datum = line.strip().split(",", 1)
    except ValueError:
        print("%s: line not in KEY,DATA format: %r" \
            % (keys_file, line.strip()), file=sys.stderr)
        continue

    keys[datum] = key
    counts[datum] = 0

# Read and output data

ex = re.compile(r'^(\w+-\d+)\w*')
for data_file in data_files:
    for line in open(data_file, "r"):
        line = line.strip()
        m = re.match(ex, line)
        if m:
            datum = m.group(1)
            if datum not in keys:
                print("no primary key found for data %r" % datum,
                    file=sys.stderr)
                continue

            # Increment count, then print processed line immediately
            counts[datum] += 1
            print("%s,%s,%d" % (keys[datum], m.group(0), counts[datum]))
        else:
            print("%s: line does not begin with TEXT-NUMBER: %r" \
                % (data_file, line.strip()), file=sys.stderr)

user1686

Posted 2012-01-25T12:13:01.363

Reputation: 283 655

I would do it in Python using regular expressions. Just type python in your shell to see if it is installed.

Otherwise you can use Perl. Just type perl in your shell to see if it is installed. Is has built-in support of regular expressions.

utapyngo

Posted 2012-01-25T12:13:01.363

Reputation: 1 713

Can you provide some examples on how to use it? – Oxwivi – 2012-01-25T14:45:28.820

@Oxwivi: Here are a few, but a language cannot be learned from only "some examples"... "Learning Perl" and similar books might be a good start.

– user1686 – 2012-01-25T15:36:08.977

@grawity Sorry for getting back at you so late, I faced unexpected issues generating the data. I did this: perl process.pl */images products.csv > images - multiple data files, but no conflicts. However the results were without the keys - for example: ,ABD-47,2. The area reserved for the key value is empty. Also, if you will, please add an answer of your own, since it's likely I will choose yours. – Oxwivi – 2012-01-27T19:08:21.160

@grawity Only the first */images file encountered is processed. Referring directly to that file works as expected. – Oxwivi – 2012-01-27T19:11:21.950

@grawity What exactly is unmatched input supposed to mean? It found no match in the key file? – Oxwivi – 2012-01-27T19:23:12.893

@Oxwivi: The problem was that my script was written with only one input file in mind; when you ran process.pl */images products.csv, the system would in fact run perl process.pl first/images second/images third/images products.csv, and the script would blindly use only the first two names given. I uploaded a fixed version; now the command would be perl process.pl products.csv */images > images. – user1686 – 2012-01-27T19:24:02.157

@Oxwivi: The scripts show "unmatched input" if the line does not start with a text-number section. I have written the scripts according to your given examples, but it could be that the real inputs are somehow different. (The error messages should be clearer in this version.) – user1686 – 2012-01-27T19:26:47.113

@grawity Thank you, and please post an answer of your own - else it will be left unresolved. – Oxwivi – 2012-01-28T07:36:20.200

@grawity A little bug remains from previous script. Somehow some of the lines are ignored - a small example: NUR-01 NUR-02A NUR-02B NUR-02 becomes 191,NUR-02.jpg,1 191,NUR-02.jpg,2 - NUR-01 entirely missing. I reviewed and found I do not need multiple input files, so if you wish I can pastebin the entire input and output files. – Oxwivi – 2012-01-28T09:25:43.033

@grawity It seems I failed to mention about files names without any letters. I'm sorry. And I realized I also needed the letters at the end to be preserved, if any. Your current scripts does everything I asked in the question, so it will definitely get chosen. Sorry for asking so much of you, I'll do the rest manually or try to change your codes. – Oxwivi – 2012-01-28T11:13:24.567

@Oxwivi: Fixed, I think (although I might have misunderstood "preserve letters at the end", but try it and see; if it does not work correctly, try changing the final \w* into .* in the regexp). – user1686 – 2012-01-28T11:29:41.197

Process 1

perl count.pl datafile …

where count.pl is something like

#!perl
use strict;
use warnings;

my %headwordcount;

while (<>) {
  if (/^([A-Z]+-\d+)/) { $headwordcount{$1}++; }
  # else { warn "Bad data: $_"; } # uncomment line for data warnings
}

END {
  foreach (sort keys %headwordcount) {
     print "$_,$headwordcount{$_}\n";
  }
}

Untested, caveat emptor.

Process 2

Add something like

   my %key;

   BEGIN {
     my $keyfilename = 'primary.key';
     open my $fh, '<', $keyfilename or die "Can't read '$keyfilename' - $!\n";
     while (<$fh>) {
        chomp;
        my ($key,$headword) = split(/,/, $_, 2);
        $key{$headword} = $key;        
     }
     close $fh;
   }

and change the print line in END {} to

   print "$key{$_},$_,$headwordcount{$_}\n";

Again, untested.

You'll generate warnings if the data file contains headwords not in the key file. You can check for $key{$_} being undefined and if so print w/o key.

RedGrittyBrick

Posted 2012-01-25T12:13:01.363

Reputation: 70 632

What is that script supposed to do? The first process I described? – Oxwivi – 2012-01-25T16:35:35.183

@Oxwivi: Yes. If you can't easily follow the code. It may be best to try a different answer. If you understand bash-scripting and/or awk, Perl should not be too difficult. – RedGrittyBrick – 2012-01-25T16:38:17.847

I did use basic bash scripts, though I can't say I followed them well (inline syntaxes, mainly). And do I add the new codes you edited into the answer indented? – Oxwivi – 2012-01-25T16:41:49.360

@Oxwivi, yes, add the new stuff immediately after my %headwordcount; (apart from the replacement print line of course, which replaces the one inside the ÈND … foreach loop). – RedGrittyBrick – 2012-01-25T16:56:47.970

Okay, thank you very much! One last thing - datafile is to be replaced with the input file, that is obvious, but is the ... following it some Perl specific things? – Oxwivi – 2012-01-25T17:05:24.340

@Oxwivi The ellipsis (…) just means you can specify several filenames, they will all be processed together. – RedGrittyBrick – 2012-01-25T18:00:15.747

Sorry for getting back at you so late, but I had encountered unexpected issues generating the data file. Your codes reports this error: `perl count.pl */images > images Use of uninitialized value within %key in concatenation (.) or string at count.pl line 27, <> line 897. The 'Use' lines repeated itself for quite a few times. Some of the keys were missing and the count turned out weird. – Oxwivi – 2012-01-27T19:04:14.657