I'd like to get a list of words that are strongly connected with a particular time epoch, for example, 19th century, 20th century and 21st century. I suspect that nobody used a word like "smartphone" in 19th and 20th century, and, vice versa, nobody still uses a word like "grubble" anymore in 21th century.

Input data

I have quite a few books, laid out in a following directory structure:

On the first level, you have directories named after years when book was published, i.e. from 1801 to 2015.
On the following levels, there could be individual book files or any number of nested directories (for example, grouping books by authors, by genres, by publishers, etc). Ultimately, there's at least a single book inside every such directory. Let's say that names of these files and directories are limited by whatever's allowed on VFAT filesystem (i.e. \ / : * ? " < > | are not allowed).

No other files exist in this structure (for example, a program file would be located elsewhere, no results of previous runs, temporary files, etc).

All book files have the same pure text format, using only ASCII character set. Words are continuous non-empty character sequences of word characters, i.e. [A-Za-z0-9_]+.

For example, contents of this short book:

Here you go, that's APL13-X for your 4923/YR_3.

have the following words: Here, you, go, that, s, APL13, X, for, your, 4923, YR_3.

Comparison of words is case insensitive, that is Here, here and HERE is the same word.

The task

I want to have a program that, being started in a top directory of this structure, would generate 3 files named 19, 20, 21. Each of these files should contain a list of words which occur only in relevant century books and not in any other century's single book (words separated with a newline). Words can be in any case, but all words given in such file must appear only once.

Sample test cases

If you want a sample test case, feel free to use my one, available either as zip file or git repo.

Scoring

Standard code golf rules apply. Shortest answer (written in a language released before 2015-03-16) wins after 3 weeks would pass (on 2015-04-06). Secondary nominations are "shortest answer in any non-esoteric, general purpose language" and "shortest answer per language".

P.S. Please note that "19th century" is defined as 1801..1900 inclusive, that is 1900 is still 19th.

GreyCat

Posted 2015-03-16T15:02:30.463

Reputation: 181

I'm assuming a full program is required, not just a function? – bcsb1001 – 2015-03-16T15:09:38.447

@bcsb1001: Yes, a full runnable program that traverses directories, finds input files, reads them, performs calculations and then outputs the result in 3 files, as per question. – GreyCat – 2015-03-16T15:17:26.863

"I'd like to get a list of words that are strongly connected with a particular time epoch": Are you going to use one of these programs on a more substantial set of txt files? If so, I'd be interested in the final output. – Nathan Merrill – 2015-03-18T20:21:44.227

@NathanMerrill: Not really, that's just a cool task story to tell :) In fact, this task was inspired by one of the tasks I've actually performed on a huge (~20 GB) array of texts - however, that was implemented in massively parallel way on a Hadoop cluster. – GreyCat – 2015-03-18T21:07:47.860

Answers

Ruby, 231 bytes

require'set'
m=[]
Dir['**/*'].map{|n|(m[(n[0..3].to_i-1)/100-18]||=Set.new).merge(IO.read(n).split(/\W/).map{|x|x.upcase})rescue''}
u=[m[0]-m[1]-m[2],m[1]-m[0]-m[2],m[2]-m[0]-m[1]]
3.times{|i|IO.write"#{i+19}",u[i].to_a.join("\n")}

In slighty more readable form with comments, with lvalues extracted:

require'set'
m = []
Dir['**/*'].map { |n|
  # Determine century from first 4 characters of path
  c = (n[0..3].to_i - 1) / 100
  # Make array indices 1 char long (0-1-2) instead of 2 chars (19-20-21)
  c -= 18
  # Initialize per-century array entry with empty set, if required
  m[c] ||= Set.new
  # Read file as a string, split by words, uppercase everything
  # (not lowercase, because "upcase" saves 2 bytes vs "downcase"),
  # merge into the set we've prepared; ignore any possible exceptions
  # (i.e. trying to read the directory as a string)
  m[c].merge(IO.read(n).split(/\W/).map { |x| x.upcase }) rescue ''
}

# New array of sets with unique words per century
u = [
  m[0] - m[1] - m[2],
  m[1] - m[0] - m[2],
  m[2] - m[0] - m[1]
]

# Output of this array entries, each to individual file
3.times { |i|
  IO.write "#{i + 19}", u[i].to_a.join("\n")
}

GreyCat

Posted 2015-03-16T15:02:30.463

Reputation: 181

R, 284 266

Reads in every file from the specified working directory and uniquely adds their words to the appropriate century word vector. Each word vector is compared to the others and the words that are not in are written out.

C=as.character;S=strsplit;L=tolower;U=unlist;W=list();for(f in list.files(r=T)){Y=C((strtoi(U(S(f,'/'))[1])-1)%/%100+1);W[[Y]]=unique(L(c(W[[Y]],U(S(scan(f,w=C()),'[^A-Za-z0-9_]',p=T)))))};for(i in 1:3)cat(W[[i]][!L(W[[i]])%in%L(U(W[-i]))],file=names(W)[i],sep='\n')

Caveats:

Assumes the current directory is pointing in the right place. Otherwise use setwd()
There must be at least one file per century and no additional files outside the 19th, 20th and 21st centuries.

In a more readable form

setwd("some/directory/here");
# Commonly used functions
C=as.character;
S=strsplit;
L=tolower;
U=unlist;
# Initialise word list
W=list();
# for each file in working directory
for(f in list.files(r=T)){
    # Determine century from first part of path.
    Y=C((strtoi(U(S(f,'/'))[1])-1)%/%100+1);
    # Scan the file splitting on non letters/numbers/underscore
    # refreshing the unique list of words for the century
    W[[Y]]=unique(L(c(W[[Y]],U(S(scan(f,w=C()),'[^A-Za-z0-9_]',p=T)))))
    };
# Write out the words for each century that aren't in the other centuries
for(i in 1:length(W))
    cat(W[[i]][!L(W[[i]])%in%L(U(W[-i]))],file=names(W)[i],sep='\n')

MickyT

Posted 2015-03-16T15:02:30.463

Reputation: 11 735

Can you remove setwd(...) invocation from your submission? Please assume that there's no need to change working directory, program's current working directory is already the one you would start traversal from. That would bring it down to 275. – GreyCat – 2015-03-17T01:47:37.133

Also note that there are always only 3 centuries - probably length(W) would be always 3? – GreyCat – 2015-03-17T03:24:04.947

@GreyCat Made the edits you suggested, Thanks. – MickyT – 2015-03-17T18:08:28.247

Shell script (reference solution, 879 bytes)

#!/bin/sh

# This program uses temporary files, but no need to perform any kind
# of cleanup in a real answer.

# Given a stream of lines on the stdin, rejects all lines that match
# lines in a file given in first argument. Both stdin stream and a
# given file must be sorted the same way.
reject()
{
    diff -u - "$1" | grep '^-' | cut -c2- | sed 1d
}

# Collect lists of words
for YEAR in ????; do
    CENTURY=$((($YEAR - 1) / 100 + 1))
    find "$YEAR" -type f -print0 |
        xargs -0 sed 's/\(.*\)/\L\1/g; s/[^A-Za-z0-9_]/\n/g;' >>words_$CENTURY
done

# Make lists unique and sorted
for C in 19 20 21; do
    sort -u <words_$C >words_unique_$C
done

# Output results
cat words_unique_19 | reject words_unique_20 | reject words_unique_21 >19
cat words_unique_20 | reject words_unique_19 | reject words_unique_21 >20
cat words_unique_21 | reject words_unique_19 | reject words_unique_20 >21

GreyCat

Posted 2015-03-16T15:02:30.463

Reputation: 181

Shell script, 233 bytes

J(){
diff -u - $1|grep ^-|cut -c2-|sed 1d
}
for Y in *; do
C=$(((Y-1)/100-18))
find $Y -type f -exec sed 's/\(.*\)/\L\1/;s/[^A-Za-z0-9_]/\n/g' {} +>>W$C
done
S(){
sort -u
}
S<W0>A
S<W1>B
S<W2>C
J B<A|J C>19
J A<B|J C>20
J A<C|J B>21

Basically a very compressed version of reference solution. Everything that could be replaced with 1-letter name was replaced. Still works in minimal sh.

GreyCat

Posted 2015-03-16T15:02:30.463

Reputation: 181

Words of centuries

Input data

The task

Sample test cases

Scoring

Answers

Ruby, 231 bytes

R, 284 266

Shell script (reference solution, 879 bytes)

Shell script, 233 bytes