Generate an alphabetic index from a spreadsheet

2

I have a text file of lecture topics from my physics class that looks like this:

1/14,Galilean relativity,Einsteinian relativity,Minkowski space,Henri Poincare,,
1/16,Lorentz transformations,Velocity transformations,Proper time,Light aberration,Lorentz tensors formalism,Minkowski space
1/23,Lorentz boost,Lorentz group,Poincare group,contravariant tensor,covariant tensor,d'Alembertian

I actually store it as Google Doc, so I could export it as a tsv, ods, xslx, etc. instead of a csv.

I want to generate an index like you would see in the back of a book, listing the topics alphabetically by date, e.g.

d'Albertian 1/23
... (more entries)
Minkowski space 1/14 1/16
... (more entries)
Velocity transformations 1/16

or if you want to get fancy:

d'Albertian ................. 1/23
... (more entries)
Minkowski space ............. 1/14, 1/16
... (more entries)
Velocity transformations .... 1/16

I first thought of using LaTeX, but the indexing packages all seem to go by page number, not an arbitrary string like a date. In fact, "MakeIndex assumes that all page numbers are either arabic or lower-case roman numerals;" http://www.tex.ac.uk/ctan/indexing/makeindex/doc/makeindex.pdf. Moreover, the input file for makeindex must be a list with one topic and page number per line; it just does the sorting and formatting: http://www.troubleshooters.com/linux/lyx/makeindex.htm

So, the software I am looking for needs to do this:

  1. Accept a csv, tsv, or other spreadsheet format. This would be ideal, but I would be ok with a text format that could be fairly easily generated from a spreadsheet file, e.g. by substitutions with awk or sed.
  2. Associate the first entry of each line (e.g. the date 1/14) with each of the following entries (e.g. the topics Galilean relavity, Minkowski space, etc.).
  3. Sort all the topics for all the dates alphabetically.
  4. Combine the topics that show up more than once into a single entry, listing every date that the topic shows up in.
  5. Output the result as a standard format.

My only requirement for the software is that it run on Linux.

I'm not fussy about output format; text file, LaTeX, HTML, odf, etc. are all fine by me, as long as I can print out a paper copy.

Nathaniel M. Beaver

Posted 2013-04-25T17:26:52.650

Reputation: 312

Answers

2

Both the F# and awk solutions work well with a few tweaks. However, I've decided to go with the following python script:

#!/usr/bin/env python
import csv, sys
mydict = {}
for line in open(sys.argv[1],'r'):
    tokens = line.strip().split(',')
    item, keys = tokens[0], tokens[1:]
    for key in keys:
        if key != '':
            # If the key is already in the dictionary,
            # just add it to the set, otherwise make an
            # empty set to add the item to.
            mydict.setdefault(key, set()).add(item)
for key in sorted(mydict.keys(), key=str.lower):
    print key + ' \dotfill ' + ', '.join(mydict[key])

In this case, item corresponds to a date, and keys correspond to topics. The \dotfill is LaTeX markup for filling in horizontal space with dots.

Nathaniel M. Beaver

Posted 2013-04-25T17:26:52.650

Reputation: 312

2

Are you opposed to mono? If not then get F# interactive

http://fsharp.org/use/linux/

and use the following F# script (you could also compile it)

open System
open System.IO

let inputFile = "inputFile.csv"
let outputFile = "out.txt"

File.ReadAllLines(inputFile)
|> Seq.filter (fun i -> i.Length > 0)
|> Seq.collect
    (fun i ->
        let fields = i.Split(',')
        let date = fields.[0]
        fields.[1..] |> Array.map (fun entry -> date,entry)
    )
|> Seq.groupBy snd
|> Seq.sortBy (fun (entry,_) -> entry.ToUpper())
|> Seq.filter (fun (entry,_) -> entry <> "")
|> Seq.map 
    (fun (entry,dates) ->
        let dates = dates |> Seq.map fst |> Seq.sort
        let datestr = String.Join (", ",dates)
        String.Format("{0} ........ {1}", entry, datestr)
    )
|> (fun i -> File.WriteAllLines(outputFile,i))

This would produce a text file:

contravariant tensor ........ 1/23
covariant tensor ........ 1/23
d'Alembertian ........ 1/23
Einsteinian relativity ........ 1/14
Galilean relativity ........ 1/14
Henri Poincare ........ 1/14
Light aberration ........ 1/16
Lorentz boost ........ 1/23
Lorentz group ........ 1/23
Lorentz tensors formalism ........ 1/16
Lorentz transformations ........ 1/16
Minkowski space ........ 1/14, 1/16
Poincare group ........ 1/23
Proper time ........ 1/16
Velocity transformations ........ 1/16

Not exactly what you want but it would be easy to modify the above to produce latex markup. Just modify the line

String.Format("{0} ........ {1}", entry, datestr)

to include the desired mark up for each line. Unfortunately I don't have access to mono right this instant so It's tested with .NET.

jizugu

Posted 2013-04-25T17:26:52.650

Reputation: 81

I had trouble compiling from source, so I installed the package here: [http://fsxplat.codeplex.com/releases/view/55463]. It says error FS0222: Files in libraries or multiple-file applications must begin with a namespace or module declaration, e.g. 'namespace SomeNamespace.SubNamespace' or 'module SomeNamespace.SomeModule' – Nathaniel M. Beaver – 2013-04-26T02:00:44.587

try adding "module program" to the top of the file (no quotes) – jizugu – 2013-04-26T02:07:18.233

generate-index.fs(22,41): error FS0001: This expression was expected to have type string [] but here has type seq<string> – Nathaniel M. Beaver – 2013-04-26T02:30:17.837

Here's line 22: let datestr = String.Join (", ",dates). dates is at position (22,41) – Nathaniel M. Beaver – 2013-04-26T02:31:48.953

try: let datestr = String.Join (", ",dates |> Seq.toArray) – jizugu – 2013-04-26T02:51:41.733

That works! I also changed the last line to let datestr = String.Join (", ",dates |> Seq.toArray) and now it outputs correctly. – Nathaniel M. Beaver – 2013-04-26T02:58:58.177

2

Shell scripting and the glorious awk command:

awk -F, '
  { for (i=2;i<=NF;i++) { subject_dates[$i]=subject_dates[$i] " " $1 } }
  END { for (idx in subject_dates) { print idx, subject_dates[idx] } }
'

Ramillete

Posted 2013-04-25T17:26:52.650

Reputation: 61

Had to remove the two commas at the end of input line 1 to avoid a null-keyed line in the output. – Ramillete – 2013-04-25T19:38:14.747

Can awk sort the output? Right now I have to pipe it through sort to get it alphabetized. Also, how hard would it be to eliminate the double spaces after the topic? (Also, the first line returns all the dates, so I'm filtering it with tail -n +2.) – Nathaniel M. Beaver – 2013-04-26T02:26:57.493