Cologne phonetics

Class	Phonetic algorithm
Worst-case performance	O(N)
Best-case performance	O(N)
Average performance	O(N)
Worst-case space complexity	O(N)

Cologne phonetics (also Kölner Phonetik, Cologne process) is a phonetic algorithm which assigns to words a sequence of digits, the phonetic code. The aim of this procedure is that identical sounding words have the same code assigned to them. The algorithm can be used to perform a similarity search between words. For example, it is possible in a name list to find entries like "Meier" under different spellings such as "Maier", "Mayer", or "Mayr". The Cologne phonetics is related to the well known Soundex phonetic algorithm but is optimized to match the German language. The algorithm was published in 1969 by Hans Joachim Postel.

Method

The Cologne phonetics matches each letter of a word to a digit between "0" and "8". To select the appropriate digit, at most one adjacent letter is used as a context. Some rules apply specifically to the initials of words. In this way similar sounds are supposed to be assigned the same code. The letters "W" and "V" for example, are both encoded with the number "3". The phonetic code for "Wikipedia" is "3412" (W=3, K=4, P=1, and D=2). Unlike the Soundex code, the length of the codes from the Cologne phonetics method is not limited.

Procedure

Letter	Context	Code
A, E, I, J, O, U, Y		0
H		-
B		1
P	not before H	1
D, T	not before C, S, Z	2
F, V, W		3
P	before H	3
G, K, Q		4
C	in the initial sound before A, H, K, L, O, Q, R, U, X
C	before A, H, K, O, Q, U, X except after S, Z
X	not after C, K, Q	48
L		5
M, N		6
R		7
S, Z		8
C	after S, Z
	in initial position except before A, H, K, L, O, Q, R, U, X
	not before A, H, K, O, Q, U, X
D, T	before C, S, Z
X	after C, K, Q

That for the letter "C" the rule "SC" has priority over the rule "CH" was taken into account by the addition of "except after S, Z" in line 10 of the table. This is not explicitly mentioned in the original publication but can be inferred from the examples listed there, e.g. for "Breschnew" the code "17863" is specified.

Lowercase letters are encoded accordingly; all other characters (such as hyphens) are ignored. For the umlauts Ä, Ö, Ü, as well as ß, that are not taken into account in the conversion table, it suggests itself to match them to the vowels (code "0") respective to the group S, Z (code "8").

Processing of a word is done in three steps:

Encode letter by letter from left to right according to the conversion table.
Remove all digits occurring more than once next to each other.
Remove all code "0" except at the beginning.

Example

The name Müller-Lüdenscheidt will be coded as follows:

Encode each letter: 60550750206880022
Collapse of all multiple consecutive code digits: 6050750206802
Remove all "0" digits: 65752682

Literature

Hans Joachim Postel: Die Kölner Phonetik. Ein Verfahren zur Identifizierung von Personennamen auf der Grundlage der Gestaltanalyse. in: IBM-Nachrichten, 19. Jahrgang, 1969, S. 925-931.

gollark: Was it fixed?

gollark: Hmm, it doesn't want to link that.

gollark: https://pronouny.xyz/u/</html>

gollark: They must have put more work into the cat pictures than security.

gollark: I would not be entirely unsurprised if the passwords weren't hashed.

External links

Martin Wilz: Aspekte der Kodierung phonetischer Ähnlichkeiten in deutschen Eigennamen (PDF-Datei; 502 kB). Magisterarbeit an der Philosophischen Fakultät der Universität zu Köln, 2005; enthält eine Implementierung in der Programmiersprache Perl.
Maroš Kollár: Perl-Implementierung der Kölner Phonetik und ähnlicher Verfahren als freie Software im CPAN (Comprehensive Perl Archive Network)
Andy Theiler: PHP und Oracle PL/SQL-Implementierung der Kölner Phonetik
Nicolas Zimmer: PHP-Implementation der Kölner Phonetik in einem Kommentar zum Eintrag soundex im PHP-Manual, 2008.

This article is issued from Wikipedia. The text is licensed under Creative Commons - Attribution - Sharealike. Additional terms may apply for the media files.