Lorem Ipsum Generator

16

0

Lorem Ipsum is placeholder text used when preparing layout without wanted to have content already filled.

One of its key features over using some arbitrary text is that it is nonsense. It isn't even valid latin (though it is close). This prevents anyone being shown the layout from becoming distracted reading the text.

The other key feature is that it looks like real language. The words are the right length, and the characters occur with the right distributions.

A problem is that the Lorem Ipsum is not necessarily good gobbldy-gook for all languages, or even for all subject matter. We expect that the distribution of letters would be different in a novel, than in a scientific article.

Your task thus is to create a tool that when given an example of text, generate a lorum ipsum style gobbldy-gook that is a suitable imitation of the characteristics of that text.

The Task

Given an input in UTF8 text input-text, and a number x. Output x characters of gobbldy-gook following the style of that text.

It may be assumed that words are separated using the space character in the input text.

You may take the input in any reasonable form (though it must be able to accept UTF-8). Eg reading from STDIN, reading from file, as a string passed to the function etc. Similar for output.

Critria

  • Must not be real correct text in the language of input-text
    • In particular must contain no words with more than 3 characters from input-text.
    • This can be trivially avoided by using a hashset to ban all words from input-text from occurring in the output
  • Must look like it could have been real text from the language of input-text.
    • That is to say it must have the correct letter and word length distributions that is convincing to the eye of the reader.
      • It is up to the implementation if they take this to mean unigram, bigram, trigram or other statistics for the distributions.
      • The goal is for it to look right

Judging Texts

For purposes of judging please provide 256 characters of output based on each of the following public domain texts:

gist with all files

Those that these are all UTF-8 encoded, and in particular Die Verwandlung, being in German uses many characters outside [a-zA-Z]

Lyndon White

Posted 2017-07-08T01:52:21.353

Reputation: 1 021

Question was closed 2017-07-12T00:11:54.643

What's the winning/scoring criterion for this challenge? – LegionMammal978 – 2017-07-08T02:24:10.230

2@LegionMammal978 It is tagged popularity-contest, which means the judges will be the voters. – LyricLy – 2017-07-08T02:25:53.713

This reminds me of the time I wrote a Markov Chain program to generate several pages of grammatically correct English gibberish (real words, real grammar) to submit to a "write fiction badly" contest. It (and the entry I submitted the following year) were sufficient to get me banned forever (and subsequently forced to judge). – Draco18s no longer trusts SE – 2017-07-10T19:41:12.347

Answers

4

Python 3

import string
import random

with open('lorem.txt') as f:
    text = f.read()
number = eval(input("Input number: "))

dictionary = ''.join(ch for ch in text if ch not in set(string.punctuation)).split(" ")

chars = []

for word in dictionary:
    for char in word:
        chars.append(char)

chars = list(set(chars))

occurences = {}

for char in chars:
    occurences[char] = []

for word in dictionary:
   for idx, char in enumerate(word):
        try:
            occurences[char].append(word[idx+1])
        except IndexError:
            pass

lengths = [len(x) for x in dictionary]

result = ""

for _ in range(number):
    length = random.choice(lengths)
    word = random.choice(chars)
    for _ in range(length):
        try:
            word += random.choice(occurences[word[-1]])
        except IndexError: 
            break
    result += word + " "

print(result)

Takes text input through a file named lorem.txt and number of words through STDIN.

Samples:

De Finibus:

bolute moresu xisi libospssev lolu ute tamqump Sercolimo mnci iaerue unt xptemp ptur xiui mumnsc ner citerum hiositius fun Set llu hibol etet mpesi amitap voserata mol iumol ospe pte ell asequmut

Die Verwandlung:

Frehwauchmi « vo ineic venzundenwasender geitende « darenenserttem Semint estt Scklode Zur g Zen»e Orer urd cher s Bet kun renaunn

Pride and Prejudice:

paime k hothenoc 5 “ 230 erth ZZWedomincab Mrevoueim Vifedrctogei Come Kidi yol qul “Thant Kie Rag 41 mf vate 12 Eled 8 Dathed 1 mb Bed “Y merie xto zasorl h’sh ghasst Wha quce skeniat zan dishond othing hare hing din

And finally, the Magna Carta:

wincand zisend s Pes CEFured indons fequ dindshay ranen Gofun OURore ibec 6248 Rofonestheasaven sul nthal 232 rend matyont r uiconlicend ucriof 7 ashesevenye Gerd BEndrouse IIns jurig 20000 UREndit mandira 0000000000s a wic 47 Enccet

It's not my best work.

All I'm doing here is some simple Markov chain recreation of the original text, with word lengths taken from a list of all lengths in the given text.

I sort of hoped for better results, but oh well.

LyricLy

Posted 2017-07-08T01:52:21.353

Reputation: 3 313

Looks like you are having Unicode issues with Die Verwandlung. As ¼ never occurs in the original – Lyndon White – 2017-07-08T03:13:26.597

1Sorry, I saved the file in the wrong format. Should be good now. – LyricLy – 2017-07-08T03:18:54.007

2'ch for ch in text if ch not in set' – Matthew Roh – 2017-07-08T08:38:54.887

4

Clojure

This was hard! I started this last night at 9pm thinking I could knock something quick out, no problem. All the little problems I didn't realize when I started began to pile up though, so I had to put it on hold until this morning.

It's still not great, but it does the job.

Note: For some reason, IntelliJ/Cursive's REPL (which I used to test and generate the phrases) can't accept code greater than 65k chars in length, which means I couldn't simply paste some of the sample texts into the REPL. I tried reading them from file, and for some reason accented characters aren't being read properly, despite specifying UTF-8 encoding. As a workaround, I used subsets of the texts to act as the sample to generate from.

  • De Finibus:

    vitsmai qomsme tainrte,e qanerim toaurtatt roliaum veaiieq foimpq smarsu mneurutds drtameitt vndnceui cesolleqd Ambrsuou hntumetni euuhcqitu eoseiuane fusipdm,t depamme vofuuet uamsel aumsatet aaee.idu, metosa etqirtai ialisua vsleataa vellpuen aalui, celaqa eultnsec celili eolisi,t qcrm,qets qolniani earure elfucnsn donate qitua,at. aimaau dmgaim aeumltn qori,et cogerentt serlitmt. uaqua,dr aososlt dulir,m oelmmc ietiqctti aupniuon semionu aerciamti qlmomi hinldt htnuaiqa cnnoot cetultad nutcliaa aolstu puouru nfiuitmiu lapopttr rxsutea ood,rut aiseuc quoe.rmi ntiemhb ueter,sb delsrias titopduto qurorcd,t vxiupsi sem,reer iaeesit, mooila eolnoa infosmese exni,uns qtnpeet peiealna lolopt duosrtun oufmttn omsarhdsa veri,rme doilrnt hulstqo cxlusa uasputua qtlpl, certppmtd qopu,tmi oonires oosoia fu,iepa eedumtn,e sesiot,t rtioeam helnora rrnoeeas aslceiist ealimtdd qolsmnas eetgoeta tomema eemurst exoacumtn tuioeq,sa vtauctdst oarittsbt lonpstoa, famade,qa empoenuti minisatie vuiuseuis bxnmii nafuiemsi plscrt vusuiiu macirs cugaesa vttsitm dumudta ruluiqu easimin vsceicnt tuqmeiuns aoitm, inaulnac vfsioe,nu iomistaa. qalisi lanurr eiutma qesbtdi eprurni qintp,iim iuiamt setuoteee qugpeau hfge,e eosqpe dicemte vamerdm noluisi euceaaa eliiutii iinnat ioiuraa aubie,uat ctret,u ruaemi,a, noq,re dupiet eugesr nsnosema ouseipon qasupeeci axtilemt aaaedmdni tssimac ouptpicii notunein qiclo,ist eetummea aelste aictd, vanerean qulurus fenumit, ciieriste dtascec eolirliee bemuam qactiiaei cirudadn qupespase eerqednb vaiimua ltmemsia mucnpu qpmttea militu outpeqct caiomeetu mid,semd ailudtm eumcpri duniics negusqat nunnsu nobod,ntu aoss,it mopaueta dubupntao Seluoiatd atteresc ouisrmn mutdri ee,iri lonoeuo meirqudi montutut qre,ain foitute vaiupmi seilpusin eaoeei,i eetldidse heguise fetbra,sd deri,ssai qirrpcua dothii qiioutub dol,iem moacdss uocr,iit eassrreti ilouptutd qtgupc qiuicet vtiusitam aesupti qaileats qobeee, qornstesi tucpmitna sunuoqdt noclunmc eopisstr aicnppts suuimtett telom, ontimtc aosoren,i stbirstre qpctd,tb aunviu eapaott dnrutmt eoiure iiliei sulootitd iepidtsb, excpraii factm,tse outittt fbgira suaapeiut atniiet vuiireo vtes,eisi qtseueai qour,ea

  • Pride and Predjudice:

    eoonr. Wosr sotd. siwlt aaaer” ohe” osxalp shrprc afel nhsn ttydt aetr ohtti aotu.n tonl bouno Inui woptur weomtf hhnars ohte oisms Laurl _nje_y thr_ Mofl,d aist! rdne snage iasi mier.n onen bolh myyier Nont,c hian ihl!i Bcs””r aoei jewesr Msuwh wodeed aiuwdd arssl” efae mosnlb fuqet onsm “e,eeo vereen roor tlrli” nm.p tsrtid auitl aigm yhru shurr sontav yene. esug hoyvot “odhg ahenke coufn hhcer mike:l mfol heae oadiai iooi fotzdf piey ofu? shn,tc Iied Ihnrg mhoihs naurn IIahkk mosi nfolc Mfekei gieu deoei iHnu ivtu huuti Bo.tsi rysnei mirvn ered are,e ooel wIae Temmoe qoyu fetee, ynhd nontui hgnw aaye fet.kd irro “gtp noatk inrn isdstt tDun thde ohulie watw teness veeee, Mhntsr sai.s Lodue Iteref niergr thst luegn tsuned wii,e, taes iemtsh lhuhs ys..s hahie eehren unet haettd fam.i thurl Lptm our hftle ar.nna ayeoeg hem.,r whnn oner foiet- Bhfe “o.mll thue yoyh cory aean Mehg ho,ncs falntn nstu tnse,. dnutni weuee olmpt ay.nnu coeso lyeteg ancy,p iffg ihele Ooreen arrn “akyan tttn cxno nomi “heh momee an.ra tosts, Efee afi! arow ynele io.z teocr aids sore ioog shrr hmrve aynri Braus guaein bouree lupa aB,.t inru hnuu todnl thdiyi hvanwa botvu hidiet Nilie toa.y wothn. nesth Ioia mhoob whdr tYen oosl yehg Molt mokdn gonetf shrle toieg tooet He,lh bu.ead datog “aru atre soym eiomar yose mome htas, Lonr “otets oidrsi hwanes yimr! anelsh Bhst tNn; fhang, talp mirn “oklbt “eod” dorhit hopc ieddn Maae Bnelp tuuovg iWne co,ye ghsee fa!rn MTenee vodnr hu.a aatrel hoehs yfnmc Meog oenec prrn ieuwae “olei

  • Die Verwandlung:

    inrter Bninrl aäshsi zictth ztgtee.n ireztnr ereoo«de dlifc su,rienn dicgi! duechre gädkean iafiu hbrhür,e öamk, auehse, fhgltr ernsme urcttm uGmartrt eacge wungu eeelle üaise,n dantrid s-ßenn Tamrsd, ierosne fifn, satfg decnt Firüncr goztfekh urcfenl zrg,t donet fartan eishgta eo,dt mrnagks Cutterc iirlo uufbea Wsnhdh hages iicrelnn wimbs wickon Pehuelö? uuertn eleezn üewnlnäe zeiüriti Pidhi Feldaet Banec ahleßelk Nanpüehn eedeuh elflt aüagn mogteäg wehhhorr for,tt h-trtlg öoßdrit Teaue aaedeee necmär vingrrha hai,h vidos ncsne,ne nurlurns Uomeer vrrecoü kcphtd uhrleßl neeke dcrhth amämtza dehbsn Meftet,e dnlwenei Gigrser zagnüe srrcse Wenstt asnbhn Zägele Eebeefn eerhte dihee Nernta. saheter Uihöu fitfu aibeeens Einre Banwf vaneine eardtg fnebtn Sihne hontn hemte, uecßred arrdhe dollnän. Juhno gasdee eeelhn wonlht Aänsüenr vlrssf Dowtlhin Ueeseasl iusnu namhie Binziae wieleet bucdrne maselort Süfiwr strzgdlg dGnite sifgh iusecti Dohutl eucl.g Bbrstnih amehn,üc sarlhe vcredrne krlnat dbrnra enimsnhs larwrn docdää karho Zaths »ersnnhf weire Tirhee drrhm?s iesetehu ialneke geihc disege larsuet, mrintm Ginhdrhe eoieu Msitpn rarzeut, eahßl düernäde eugchn enean zenüor dertnhe Inheuer dlctßcim fitimok hrrsnec säscl Seletete grßrtdce Kunenh bameänßn Wirmg heluhst Uensi gärnre döarrct wrnw,wel datbc Raree lonse mcrseuf aetoet rrncär wrcgelm saste Zirpe uewl,t wceeeskr Aismttme ketfer hnssen Pind,e, einsr?e! diflec mlrlcnft »umgeiß sär,ed nodfst Hbdßen Weensel finhedd lunnt dehsnrg fibsin.a hictfet« vrgturs lehntnm leektfrf brehrrnd brlen eorfem srsgenrn Ditreeen girzee iigeenie devr, daelmrs eanlecse iootohca durgt dalmce srzsh dihrugnt Bbtrwsg iDre,nil aeegesüt Irfcdr etcnruec mlaznegc setirkh Sehteeec kinlaret Ubetetlc gnßfeot eieiuen kantgr wackt -usltrm kereä eerben sahendfh nusarnd ialaac decfear hofüei wbmtee! srnper eegik nünnh woiieb« derlee« dregeeee danhele deetpi gmraeeg esaaez z-täe

  • Magna Carta:

    giotei ooeleeg majegcg aorses fernpt sekml mhmt,p tfaGta deoel (n8rsse andpi ufaih fnoseu efaes hhod,i wieer hogd,e boeer soin IwsoRei bed)t fogds tipdfr neylnme ouehps oedhhem Chuaee hst)sr oascs werooen ou'hin hhansn ofecdsd towsof alee, lube'a Gosisfr boceeae ahdbee laaeh tngdc hheoTct (oyce ouedoe' dnysscg beenru hocOo Jshlsee kralse juilihaerhs walreno (nlln troll onoplat dw9ed senroc aolpmdt toedhh gerdc eiype beilsr gbeibet pedns le.rsu aamtnon aoenri tsiaey torrtv oamle snvsn faaorw dhrss arrber Thvsat rafgmu thnslp pnnril rfrsrt letol whhle Ifoehog fhaeos. trthon fukmrnn hnkei otrvt bieyl armbs mneh, felycmg weter mfhte direon' Ma0slo leelo ane'r hneh, Fress snurwe seaoas poevoc, dasee heeeic, efn)lss gtct- eo9laip wnrrlit lyyei, gndosti oecoe suroe taftu bieel feolrg oortefd loatfv -rleln tacblst Nnrcr nOnnr rtimhc toahlfe Wirrn brfbo aecrhb whehri ohnmi walnn oeorlfs rhpsene cauppn woitsns toynr heapihd oanetd ofghen Nfesrtf gamerg desolsf gr.roeb i9drnf uodthn tnrehne wuae'l iutsse Eonllse poplrsa Ghoel m4teest ffest tauec i3pall, shent, gwadi sitcra ifsoo, Wivwn s1tdeur bagersa fead.,g teseus rncdrne hangss thzel aents ienus, binseae (osa.re trrrs of)io eyfgo, trrml oiado sumsn weygrch ahmlvnn tuednt utepsit iota- Ehe.lm crtlot aet)r crblras aenelf, tupmtae Eelee wtoerrd rfant ifldd fhyalt ffdog thuho tists, teadhl nadtoat tflda, Aielsd otreli trrn,mc feeneut nndcdeu hhnlv lidea, titges afrit wrkor eev)lt, fndni hrmml)d tegvee Ihr,nnf kvqoe cnentr Nedtl onmebba hhwdee eopoi aonthnl aautle uflhu. Retan shclaer Jostsi *prdz ebnnlc ferde hiepraa oa9)eat lisvyn nilos brslbwe btyson aoncs,vi,t tntter stret cifgile tuslene kivtle irmie aylrie Faala-e aneies gatoy wuomsne shyah snoio.d afeh'a drfree khvoget wicrsns caves, hheto (blfs sheoue

(ns mandelbrot.loremipsum
  (:require [clojure.string :as s]))

(defn rand-word
  "Generates a random word based on what letters appeared at what index.
  If the index doesn't exist, it pulls a random character from the sample-text."
  [length pos-map sample-text]
  (apply str
    (for [i (range length)]
      (rand-nth (get pos-map i sample-text)))))

(defn safe-rand-word
  "Helper that loops while the produced input is in the exclusion set.
  May get stuck if the sample size is too small."
  [length pos-map sample-text exclusion-set]
  (loop []
    (let [word (rand-word length pos-map sample-text)]
      (if (exclusion-set word)
        (recur)
        word))))

(defn rand-word-len
  "Generates a length based on avg-len +/- an amount in 
  the (range (- deviation) deviation))."
  [avg-len deviation]
  (+ avg-len
     (rand-nth (range (- deviation) deviation))))

(defn pos-letters
  "Records what characters occured at each index of each word."
  [words]
  ; A two dimensional reduction.
  ; The inner reduction takes an existing pos-map, and updates it with information about the word
  ; The outer reduction combines all the word pos-maps together.
  (reduce (fn [pos-map word]
            (reduce (fn [acc [i chr]]
                      (update acc i #(if % (conj % chr)
                                           [chr])))
                    pos-map

                    ; Maps with the index of each letter
                    (map vector (range) word)))
          {}
          words))

(defn gobbly [n-chars sample-text]
  (let [; Split the text into words, then place the words into a set
        ;  to ensure that they aren't generated.
        words (s/split sample-text #"\s")
        word-set (into #{} words)

        stripped-text (s/replace sample-text #"[,.?!]" "")

        ; Figure out the average word length, then calculate how much
        ;  deviation should be allowed.
        avg-word-len (/ (count sample-text) (count words))
        length-deviation (/ avg-word-len 4)

        ; Find which characters appear at which positions
        pos-map (pos-letters words)]

    ; Generate n-chars many characters by...
    (doseq [n (range n-chars)]
      ; ...printing a space, then a randomly generated word.
      (print ""
        (safe-rand-word (rand-word-len avg-word-len length-deviation)
                        pos-map
                        stripped-text
                        word-set)))

    (flush)))

Use it like:

(gobbly 256 "KNOW THAT BEFORE GOD, for the health of our soul and those of our\nancestors and heirs, to the honour of God, the exaltation of the holy\nChurch, and the better ordering of our kingdom, at the advice of our\nreverend fathers Stephen, archbishop of Canterbury, primate of all\nEngland,")

Carcigenicate

Posted 2017-07-08T01:52:21.353

Reputation: 3 295

So it is based of P(letter | index in word), rather than n-grams (as used in a markov chain). – Lyndon White – 2017-07-09T07:49:58.457

@LyndonWhite No clue. I have 0 experience with NLP lol (or whatever study that is). This was just the first method to come to mind. Yes, it takes into consideration what place each letter occurs in each word, and tries to replicate it. Its quite flawed though, as letters that are supposed to occur at the end of small words appear in the middle of large ones, but it's the best I could do. I suppose I could have had different pos_maps for each word length, but I didn't think of that at the time. – Carcigenicate – 2017-07-09T18:20:31.297

Can someone comment on the downvote please? I'm the only one who's downvoted at the same time the challenge is closed? – Carcigenicate – 2017-07-12T01:38:21.710

2

Python 3

I chose not to assume almost anything about the source language. In particular, I chose not to give any special meaning to anything except for whitespace. Therefore the code doesn't generate sentence-like constructs or valid quotations, but doesn't get worse with another language with different "syntax".

The heart of my code is a Markov chain. The first part of generate() parses each word of the text, collecting the probabilities that a certain character appears after another one in the text, along with some extra data.

The second part of the code then generates words based on those probabilities, along with some validation and modification:

  • The Markov chain only generates characters that appear after the previous generated character in the source text.
  • The lengths of the words are distributed as in the input. If the Markov chain does not know how to continue generating a word, it is discarded.
  • A character must have existed in the current position in a word, both from the start and end. (For example, for input "a-bcd," e. a dash may only exist as the third character of a word.) This removes lots of wonky punctuation. After 1000 tries the code may allow an invalid character.
  • Each character is made uppercase with the same probability that the character in that position is uppercase in the source.
  • "Real" words are discarded by a simple set lookup.
  • The whitespace between words is picked from the whitespace sequences in the source.

Code

def generate(text):
    import random
    def increment(dic, key, mid = None):
        if mid is not None:
            if mid not in dic:
                dic[mid] = {}
            dic = dic[mid]
        if key not in dic:
            dic[key] = 0
        dic[key] += 1
    def pick(dic):
        total = sum(dic.values())
        pos = random.randrange(total)
        for key in dic:
            if pos < dic[key]:
                return key
            pos -= dic[key]
    words = text.split()
    markov = {}
    allchrs = {}
    lens = {}
    white = {}
    poss = {}
    uppos = {}
    valpos = {}
    pos = 0
    prev = 0
    for word in words:
        increment(lens, len(word))
        if prev:
            pos += len(prev)
            whi = text[pos:text.index(word, pos)]
            pos += len(whi)
            increment(white, whi)
        prev = word
        curr = 0
        for p, char in enumerate(word):
            increment(allchrs, char.lower())
            increment(markov, char.lower(), curr)
            increment(poss, p)
            increment(valpos, p, char.lower())
            increment(valpos, p - len(word), char.lower())
            if char.isupper():
                increment(uppos, p)
            curr = char.lower()
    words = set(words)
    output = ""
    while True:
        append = pick(white) if output else ""
        wordlen = pick(lens)
        while True:
            prev = 0
            word = ""
            valid = True
            for i in range(wordlen):
                if prev not in markov:
                    valid = False
                    break
                for _ in range(1000):
                    char = pick(markov[prev])
                    if valpos[char].get(i) and valpos[char].get(i - wordlen):
                        break
                upp = random.random() < uppos.get(i, 0) / poss[i]
                word += char.upper() if upp else char
                prev = char
            if valid and word not in words:
                break
        append += word
        if len(output + append) > 256:
            break
        output += append
    return output

import os, sys
for file in os.listdir():
    if file.endswith(".txt"):
        print(file)
        with open(file, encoding="utf-8") as stream:
            print(generate(stream.read()) + "\n\n---\n")

Output examples

De Finibus

This one is probably the best for my code. Almost resembles lorem ipsum.

do volllem ptem funsc sumoloduma arec aderodi iupen vonsteretae et, pt atisatus monce qupll remeris sasps am coria cecumaedaii ro an isii da vot esi Erenemonere lo vol lasuama min do oluaecodaer coluse ptrel bea ata, ctati vonditacetes mm oma oventquia

The Magna Carta

Looks a bit like Spanish.

bl oranr tr wighibintert han ma am mmpe "im noreervetord h
olandiale tll afin ssh iany hor 48) touro oelllll ce rc ome bl tse pale ota Edeid fonnny," disen oc nd,- kindi ccofore mamar mo hakne fr
beneth te byse me sur hindesweay
ane me rciven asho shidi

Pride and Prejudice

Again, English is hard.

hendver gh th,”

te ther y.” wherorath pprmyo it_t un? stro wam wllyserinof hish ac Wfey’sm sar.” wenenonto ther chu The, L ryo jounddle.” asis i
fto ok il warano rrithiel
amsti hes weteran ay.’sthinsti
merlisomos, asth “llle Hellye yerg

mer ady syo

Die Verwandlung

This one looks almost like German, but gets a lot of weird punctuation.

h
den. zunder an,« di grhtsck ieit erimie Len De ihttergommkor kkuc aledoht hrserogo ereschla Ateg seneichier-,« gonu imori hr sch demaßti aglfhl iht Rnde au Igeng,« ding üb ukrtl iemen dein Wehle nkt ieiei Ügenifregr En g mirgs sworer zwi blimmi er,« kb

The program's own source code

what(wils(le uenc rumid dent pp ren
    ifot inc inos(itpendi pr iss imit
                wor f uenete me orditp ds ourdinorkemit()
                    ist(whar wori nserev[preval "
                    sp iminderreralin):
            m

um
        inc

PurkkaKoodari

Posted 2017-07-08T01:52:21.353

Reputation: 16 699

This does not really look like German... XD – Mega Man – 2017-07-09T07:29:46.650

@MegaMan Well, tried my best. I think some of the words resemble German ones. – PurkkaKoodari – 2017-07-09T10:41:35.660

2

Python 3

This code is very heavily based on the text generation example program from the Keras machine learning library. It uses a Long Short-Term Memory (LSTM) recurrent neural network trained on the data to generate the output character by character. I would've written the LSTM implementation myself, but I didn't want to reinvent the wheel when there was already such a good implementation that was capable of performing the task at hand.

The input is taken in a file named input.txt and is output to out.txt. I originally just used STDOUT, but Windows CMD didn't like trying to output utf-8.

The output could be improved on longer corpuses (i.e. Pride and Prejudice in the four that were given) by increasing the number of epochs the model trained for, but the runtime for a long corpus was already quite high, so I didn't bother doing so.

from keras.models import Sequential
from keras.layers import Dense, Activation
from keras.layers import LSTM
from keras.optimizers import RMSprop
import numpy as np
import random
import re

PATH = 'input.txt'
DIVERSITY = 1.0

text = open(PATH, encoding='utf8').read()
print('Corpus length:', len(text))

chars = sorted(list(set(text)))
print('Total chars:', len(chars))
char_indices = dict((c, i) for i, c in enumerate(chars))
indices_char = dict((i, c) for i, c in enumerate(chars))

regex = re.compile('[^a-zA-Z ]')
words = {word for word in regex.sub('', text).split() if len(word) > 3}
print('Total words:', len(words))

maxlen = 40
step = 3
sentences = []
next_chars = []
for i in range(0, len(text) - maxlen, step):
    sentences.append(text[i: i + maxlen])
    next_chars.append(text[i + maxlen])
print('NB sequences:', len(sentences))

epochs = max(1, min(20, 200000 // len(sentences)))

print('Vectorization...')
X = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool)
y = np.zeros((len(sentences), len(chars)), dtype=np.bool)
for i, sentence in enumerate(sentences):
    for t, char in enumerate(sentence):
        X[i, t, char_indices[char]] = 1
    y[i, char_indices[next_chars[i]]] = 1

print('Building model...')
model = Sequential()
model.add(LSTM(128, input_shape=(maxlen, len(chars))))
model.add(Dense(len(chars)))
model.add(Activation('softmax'))

optimizer = RMSprop(lr=0.01)
model.compile(loss='categorical_crossentropy', optimizer=optimizer)

def sample(preds, temperature=1.0):
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)

model.fit(X, y,
          batch_size=128,
          epochs=epochs)

start_index = random.randint(0, len(text) - maxlen - 1)

generated = ''
sentence = text[start_index: start_index + maxlen]
generated += sentence

for i in range(1000):
    x = np.zeros((1, maxlen, len(chars)))
    for t, char in enumerate(sentence):
        x[0, t, char_indices[char]] = 1.

    preds = model.predict(x, verbose=0)[0]
    next_index = sample(preds, DIVERSITY)
    next_char = indices_char[next_index]

    generated += next_char
    sentence = sentence[1:] + next_char

generated = ' '.join(word for word in generated.split(' ') if regex.sub('', word.strip()) not in words)[:256]

file = open('out.txt', 'w', encoding='utf8')
file.write(generated)
file.close()

Examples

De Finibus

uos et ndllitilum lollreaces vetnocoptam llpe calateciae dorlstem veluatirem.lu ex ept is eptlet neteceppir idararer laror pt lolor.aesuptis caut doquia fiis solopet sen et at nxt at ex ditor aem oond, eo er di iolletiu pllnetaisec cec caperi mi, nd rorxme

Pride and Prejudice

Let us at a too appear icovoim to seen adeain and misougnor. Mr. Coilins count; mander it had nearbaryt tordse sithing naly mone agains at tullnearn of alfactained, illed the eavougant one; but her tamed a retoufly gilliction of her may nos. Comberuable.

Die Verwandlung

rer du nur ers unbeilnmit und wortst zu dinmaunn.« raue gewisch, schof sau damach.« rang, den sie ihrel erspiefen.

allech gegan kammer mihr ihr, als die Gertüs»allen; den BeHohden, umschof hatee und das höbem In zu neune Stume, daß vornpiel gemößtigeß, am

Magna Carta

s by as by geards, and the improved of we renaiss-fur others of any restice of by our welr the issuen the

ard or is Gid insumed to a sarilly ragalint, marners, sim in.

B2. And oney and rentershss lard bagean, in te heish in af any out itfordedvers foroth

notjagan

Posted 2017-07-08T01:52:21.353

Reputation: 4 011

1Cool idea, but I see real words in the generated Pride and Prejudice text. – Carcigenicate – 2017-07-10T18:01:38.000

@Carcigenicate Hmm, it seems my filter for the words isn't quite working; seen and appear managed to make their way into the output (the other words like tamed didn't actually occur in the corpus, and instead were found in other words like untamed). I'll take a look and try to see what the problem is. – notjagan – 2017-07-10T18:05:36.123