Hyphenation algorithm

A hyphenation algorithm is a set of rules, especially one codified for implementation in a computer program, that decides at which points a word can be broken over two lines with a hyphen. For example, a hyphenation algorithm might decide that impeachment can be broken as impeach-ment or im-peachment but not impe-achment.

One of the reasons for the complexity of the rules of word-breaking is that different "dialects" of English tend to differ on hyphenation: American English tends to work on sound, but British English tends to look to the origins of the word and then to sound. There are also a large number of exceptions, which further complicates matters.

Some rules of thumb can be found in the Major Keary's: "On Hyphenation – Anarchy of Pedantry."[1] Among the algorithmic approaches to hyphenation, the one implemented in the TeX typesetting system is widely used. It is thoroughly documented in the first two volumes of Computers and Typesetting and in Franklin Mark Liang's dissertation.[2] The aim of Liang's work was to get the algorithm as accurate as he practically could and to keep any exception dictionary small.

In TeX's original hyphenation patterns for American English, the exception list contains only 14 words.[3]

In TeX

Ports of the TeX hyphenation algorithm are available as libraries for several programming languages, including Haskell, JavaScript, Perl, PostScript, Python, Ruby, C#, and TeX can be made to show hyphens in the log by the command \showhyphens.

In LaTeX, hyphenation correction can be added by users by using:

\hyphenation{words}

The \hyphenation command declares allowed hyphenation points in which words is a list of words, separated by spaces, in which each hyphenation point is indicated by a - character. For example,

\hyphenation{fortran er-go-no-mic}

declares that in the current job "fortran" should not be hyphenated and that if "ergonomic" must be hyphenated, it will be at one of the indicated points.[4]

However, there are several limits. For example, the stock \hyphenation command accepts only ASCII letters by default and so it cannot be used to correct hyphenation for words with non-ASCII characters (like ä, é, ç), which are very common in almost all languages except English. Simple workarounds exist, however.[5][6]

gollark: Hmm. What if I deploy a webcrawler to autogenerate these for ALL languages?
gollark: Is Go 2 to fix its atrocious error handling?
gollark: The only alternative is `interface{}` i.e. dynamic types, and I don't know if you can use comparison operators on arbitrary values like that.
gollark: It doesn't have metaprogramming. It doesn't have generics. It doesn't have macros.
gollark: If you mean the `max` thing, yes.

References

  1. Major Keary. "On Hyphenation - Anarchy of Pedantry". PC Update. Australia: Melbourne PC User Group. Archived from the original on March 10, 2005. Retrieved Oct 6, 2005.
  2. Liang, Franklin Mark (Aug 1983), "Word Hy-phen-a-tion by Com-pu-ter", PhD dissertation, Stanford University Department of Computer Science, STAN-CS-83-977
  3. "The Plain TeX hyphenation tables". Retrieved June 23, 2009.
  4. "\hyphenation". Hypertext Help with LaTeX. Yale.
  5. "Accented words aren't hyphenated". TeX FAQ.
  6. "How does hyphenation work in TeX?". Tex FAQ.
This article is issued from Wikipedia. The text is licensed under Creative Commons - Attribution - Sharealike. Additional terms may apply for the media files.