Telegraphy Golf: Decode Baudot Code

31

3

Background

In 1870 Émile Baudot invented Baudot Code, a fixed-length character encoding for telegraphy. He designed the code to be entered from a manual keyboard with just five keys; two operated with the left hand and three with the right:

Baudot 5-key keyboard

The right index, middle and ring fingers operate the I, II, and III keys, respectively, and the left index and middle fingers operate IV and . (Henceforth I'll use their Western Arabic numerals, i.e. 1, through 5.) Characters are entered as chords. To enter the letter "C," for example, the operator presses the 1, 3, and 4 keys simultaneously, whereupon a rotating brush arm reads each key in sequence and transmits a current or, for keys not depressed, no current. The result is, in modern terms, a 5-bit least-significant-bit-first binary encoding, in which our example, "C," is encoded as 10110.

5 bits??

You might be thinking that 5 bits, which can express at most 32 unique symbols, isn't enough for even all of the English letters and numerals, to say nothing of punctuation. Baudot had a trick up his sleeve, though: His character set is actually two distinct sets: Letters and Figures, and he defined two special codes to switch between them. Letter Shift, which switches to Letters mode, is activated by pressing the 5 key alone (00001), and Figure Shift is activated with the 4 key (00010).

Challenge

Your challenge is to write a program or function that decodes Baudot Code transmissions.

A real transmission would begin with some initialization bits, plus a start and stop bit before and after each character, but we're going to skip those and only worry about the 5 unique bits for each character. Input and output formats are discussed below.

Baudot's Code

There are two different versions of Baudot Code: Continental and U.K. We're going use the U.K. version, which doesn't include characters like "É" from Baudot's native French. We're also going to leave out all of the symbols in the U.K. version that aren't among the printable ASCII characters. You will only have to decode the characters in the table below, all of which are printable ASCII characters except the final three control characters that are explained below the table.

The "Ltr" column shows the characters in Letter mode and "Fig" shows the Figure mode characters:

        Encoding             Encoding
Ltr Fig  12345       Ltr Fig  12345
--- --- --------     --- --- --------
 A   1   10000        P   +   11111
 B   8   00110        Q   /   10111
 C   9   10110        R   -   00111
 D   0   11110        S       00101
 E   2   01000        T       10101
 F       01110        U   4   10100
 G   7   01010        V   '   11101
 H       11010        W   ?   01101
 I       01100        X       01001
 J   6   10010        Y   3   00100
 K   (   10011        Z   :   11001
 L   =   11011        -   .   10001
 M   )   01011        ER  ER  00011
 N       01111        FS  SP  00010
 O   5   11100        SP  LS  00001
 /       11000

The last three rows in the right column are control characters:

  • ER is Erasure. Baudot's telegraphy machines would print an asterisk-like symbol for this character to tell the reader that the preceding character should be ignored, but we're going to be even nicer to the reader and actually omit (do not print) the preceding character. It acts the same in both Letter and Figure mode.

  • FS is Figure Shift. This switches the character set from Letters to Figures. If the decoder is already in Figure mode, FS is treated as a Space (ergo SP in the "Ltr" column). When the decoder is in Figure mode it stays in Figure mode until an LS character is received.

  • LS is Letter Shift. It switches the character set from Figures to Letters. If the decoder is already in Letter mode, LS is treated as a Space. When in Letter mode the decoder stays in Letter mode until an FS character is received.

The decoder always starts in Letter mode.

Here's an example with Figure Shift, Letter Shift, and Space:

01011 10000 00100 00001 00010 10000 11100 00001 10101 11010
  M     A     Y   LS/SP FS/SP   1     5   LS/SP   T     H

This yields the message MAY 15TH. As you can see, the first 00001 (Letter Shift/Space) character acts as a space, because the decoder is already in Letter mode. The next character, 00010 (Figure Shift/Space) switches the decoder to Figure mode to print 15. Then 00001 appears again, but this time it acts as Letter Shift to put the decoder back in Letter mode.

For your convenience, here are the characters in a format that's perhaps easier to digest in an editor, sorted by code:

A,1,10000|E,2,01000|/,,11000|Y,3,00100|U,4,10100|I,,01100|O,5,11100|FS,SP,00010|J,6,10010|G,7,01010|H,,11010|B,8,00110|C,9,10110|F,,01110|D,0,11110|SP,LS,00001|-,.,10001|X,,01001|Z,:,11001|S,,00101|T,,10101|W,?,01101|V,',11101|ER,ER,00011|K,(,10011|M,),01011|L,=,11011|R,-,00111|Q,/,10111|N,,01111|P,+,11111

Input

Input will be a string, array, or list of bits in least-significant-bit-first order. Each character will be represented by a quintet of 5 bits. Bits may be in any reasonable format, e.g. a binary string, an array of 0s and 1s, a string of "0" and "1" characters, a single very large number, etc., as long as it maps directly to the bits of the transmission.

Every transmission will have at least one printable quintet and at most 255 quintets (printable or otherwise), i.e. 5–1,275 bits inclusive.

The input can contain only the bits of the transmission, with two allowed exceptions: Any number of leading or trailing 0 bits and/or, for string input, a single trailing newline may be added to the transmission. Leading or trailing bits or characters cannot be added before or after each quintet, i.e. you cannot pad each quintet to 8 bits (or take each quintet as a single number in an array—unless your language has a 5-bit integer type) or separate quintets with any additional bits, e.g. "01111\n11100".

Notes & edge cases

  1. The transmission will contain only the characters in the "Ltr" and "Fig" columns in the table above. You will never receive e.g. 01110 in Figure mode, because it is absent from the "Fig" column.

  2. It is assumed that the decoder will always be in Letter mode at the beginning of a transmission. However, the first character may be an FS character to switch to Figure mode immediately.

  3. When the decoder is in Letter mode, it may receive an LS character, and when it is in Figure mode it may receive an FS character. In either event a Space character must be printed (see Output).

  4. The ER character will never be the first character in a transmission, nor will it ever immediately follow an LS, FS, or another ER.

  5. An FS character may immediately follow an LS character and vice versa.

  6. Neither the LS nor FS character will be the last character in any transmission.

  7. The / and - characters may be received in either Letter mode (codes 11000 and 10001, respectively) or Figure mode (10111 and 00111).

Output

Output may be in any reasonable format, the most reasonable being ASCII (or UTF-8, for which all of the represented characters are the same as ASCII). Please indicate in your answer if your output is in another encoding or format.

Notes

  • The space character (see 3. above) should be an ASCII space (0x20) or your encoding's equivalent, i.e. what you get when you press the space bar.

Winning

This is . The shortest code in bytes wins.

Restrictions

  • Standard loopholes are forbidden.

  • Trailing spaces and/or a single trailing newline are allowed. Leading spaces or other characters (that are not part of the transmission) are disallowed.

  • You may not use any built-in or library functions that decode Baudot Code (or any of its descendants, e.g. Murray Code, ITA-1, etc.).

Test Cases

Input: 001101000010100111101110010101
Output: BAUDOT
Input: 11010010001001100011110111101111100
Output: HELLO
Input: 01011100000010000001000101000011100000011010111010
Output: MAY 15TH
Input: 0001000100010000001000001011101110011100101010010110101010001111100101
Output: 32 FOOTSTEPS
Input: 10110000110101011100111100001111011010000001101110
Output: GOLF
Input: 000100011000001111100000100010110111001100010110010000111111
Output: 8D =( :P
Input: 0000100001000010000100010001111011111011000011100010001
Output (4 leading spaces):     -/=/-

Jordan

Posted 2016-09-21T13:22:32.803

Reputation: 5 001

Related. – Jordan – 2016-09-21T13:22:50.177

1Note: I encoded the test cases by hand; if you see anything that looks wrong please speak up. – Jordan – 2016-09-21T13:45:17.697

1In the code table and the accompanying digest, code 00010 is listed as SP in letter mode, and FS in figure mode. According to the description, if we're in letter mode and we receive code 00010, we should shift to figure mode, but the values in the table seem to be the other way around. Also, vice versa for 00001. – Sok – 2016-09-21T14:54:57.797

@Sok You're right, I transposed the Ltr and Fig columns in each of those rows. I've fixed both the table and the digest. Thanks! – Jordan – 2016-09-21T15:38:34.050

3This man was pretty damn smart, I never knew about the compression used in telegraphy. Thanks for the history lesson man. – Magic Octopus Urn – 2016-09-21T21:04:06.183

4

@carusocomputing Right?? Baudot had no formal education beyond primary school, but not only did he invent Baudot Code, he invented a multiplexing system that allowed four operators to use a single telegraph line simultaneously. I found this 1919 pamphlet that describes his inventions in some detail super interesting: http://www.samhallas.co.uk/repository/telegraph/b6_baudot_multiplex.pdf

– Jordan – 2016-09-21T21:17:41.147

For erasure, is it permissible to use the backspace character ('\b' aka '\x08'), or must the previous character actually not be printed? The backspace character works on most terminals that I have come across, but it doesn't work on things like TIO. – Phlarx – 2018-02-05T21:56:46.467

@Phlarx Sure, I'll allow it. – Jordan – 2018-02-05T21:57:48.403

Is Baudot Code a valid output encoding? – 12Me21 – 2018-02-06T00:18:38.437

Answers

6

Pyth, 98 97 95 93 90 83 80 bytes

The code contains unprintable characters, so here is a reversible xxd hexdump:

00000000: 753f 7133 4a69 4832 5047 2b47 3f3c 334a  u?q3JiH2PG+G?<3J
00000010: 4040 6332 2e22 275a 75ae 5751 fb4e 3cd7  @@c2."'Zu.WQ.N<.
00000020: 02ce 8719 aac1 e0e0 fe1f 09e5 85bc a767  ...............g
00000030: 8e0c 1f47 508a cad1 1acb b26f 951e e5d6  ...GP......o....
00000040: 225a 4a2a 5c20 715a 3d5a 744a 637a 356b  "ZJ*\ qZ=ZtJcz5k

Try it online. Test suite.

Quite long, but the lookup table does take up most half of the space.

For 117 bytes, here's the same thing without unprintables (needs ISO-8859-1, though):

u?q3JiH2PG+G?<3J@@c2."'Zu®WQûN<×\x02Î\x87\x19ªÁààþ\x1f\tå\x85¼§g\x8e\x0c\x1fGP\x8aÊÑ\x1a˲o\x95\x1eåÖ"ZJ*\ qZ=ZtJcz5k

Or, for 93 bytes, with no compression on the lookup table:

u?q3JiH2PG+G?<3J@@c2"OVDPYSBREXGMIWFNA-JKUTCQ/ZHL5'0+3;8-2;7);?;;1.6(4;9/;:;="ZJ*\ qZ=ZtJcz5k

PurkkaKoodari

Posted 2016-09-21T13:22:32.803

Reputation: 16 699

5

JavaScript (ES6), 160 158 153 bytes

let f =
    
s=>s.replace(/.{5}/g,s=>(n='0b'+s-1)<2?m-n?(m^=1,''):' ':"? !YSBREXGMIWFNA-JKUTCQ/ZHLOVDP? ?!3 8-2 7) ?  1.6(4 9/ : =5'0+"[n+m*32],m=0).replace(/.!/g,'')

console.log(f("001101000010100111101110010101"));
console.log(f("11010010001001100011110111101111100"));
console.log(f("01011100000010000001000101000011100000011010111010"));
console.log(f("0001000100010000001000001011101110011100101010010110101010001111100101"));
console.log(f("10110000110101011100111100001111011010000001101110"));
console.log(f("000100011000001111100000100010110111001100010110010000111111"));
console.log(f("0000100001000010000100010001111011111011000011100010001"));

Arnauld

Posted 2016-09-21T13:22:32.803

Reputation: 111 334

5

Batch, 306 304 bytes

@echo off
set/pc=
set r=
set d=! !!YSBREXGMIWFNA-JKUTCQ/ZHLOVDP!! !3!8-2!7)!?!!1.6(4!9/!:!=5'0+
set s=2
:l
set/an=(s^&32)+0%c:~,2%%%6*8+0x%c:~2,3%%%14
set c=%c:~5%
if %n%==%s% set/as^^=35&goto l
call set r=%%r%%%%d:~%n%,1%%
if %r:~-1%==! set r=%r:~,-2%&goto l
if not "%c%"=="" goto l
echo %r%

Takes input on STDIN. Since Batch has no binary conversion, I have to fake it using octal and hex conversion.

  • The first two digits are converted from octal (I can't use decimal because the first digit might be 0). Possible values are 00, 01, 10 and 11. The latter two have value 8 and 9 but I want 2 or 3 so I take the remainder modulo 6.
  • The last three digits are converted from hexadecimal. Digits are either 14 or 252 times their desired value, to I take the remainder modulo 14 (252=14*18).
  • c is the coded string
  • r is the result so far
  • d is the decoding array
  • s is the the index (taking the shift state into account) of the character that switches shift state
  • n is the binary decode plus bit 5 of s, which either equals the shift state, in which case the shift state is toggled, or indexes into the decoding array to find the next character (or ! to erase)

Neil

Posted 2016-09-21T13:22:32.803

Reputation: 95 035

3

PHP, 206 Bytes

foreach(str_split($argv[1],5)as$s)($k="# f*YSBREXGMIWFNA-JKUTCQ/ZHLOVDP#l *3#8-2#7)#?##1.6(4#9/#:#=5'0+"[32*$f+bindec($s)])=="*"?array_pop($a):($k=="f"?$f=1:($k=="l"?$f=0:($k=="#"?:$a[]=$k)));echo join($a);

Jörg Hülsermann

Posted 2016-09-21T13:22:32.803

Reputation: 13 026

2

Chip, 1069 bytes

It's a big'n, but was quite fun to write.

Takes input as a string of "1"'s and "0"'s. (Though it really only looks at the low bit.)

 AZZZZ,-o.AZZZZ  AZZZZ,o-.AZZZZ
*\\\\\]oo[\/\\\**//\\\]oo[/\\\\*
*\\\\/]oo[\/\\/**//\\/]oo[/\\\/*
*\\\//]oo[\/\//**//\//]oo[/\\//*
*\\\/\]oo[\/\/\**//\/\]oo[/\\/\*
*\\//\]oo[\///\**////\]oo[/\//\*
*\\///]oo[\////**/////]oo[/\///*
*\\/\/]oo[\//\/**///\/]oo[/\/\/*
*\\/\\]oo[\//\\**///\\]oo[/\/\\*
=
        o--------K-----o
      ,oo.   z---+~S  ,oo.
     ,LooR. !ZZZZ'   ,LooR.
    ,LLooRR.        ,LLooRR.
   ,LLLooRRR.      ,LLLooRRR.
  ,LLLLooRRRR.    ,LLLLooRRRR.
 ,LLLLLooRRRRR.  ,LLLLLooRRRRR. ,~Z
,LLLLLLooRRRRRR.,LLLLLLooRRRRRR.>m'
|||||||oo||||||||||||||oo||||||)/Rz.
xxxxxxxxxxxxxxx)xxxxxxxxxxxxxxxx\^-^S
x)x))))))))))))xx)))))))))))))xx\g
xx)xxxxxxxxxxxxxxxxxxxxxxxxxxx))\f
xxxxxx))xxxxxxxxxxxxx)))))))))xx\e
xx)x))x)xxxxx))x)))))xxxxxxx)))x\d
xx))x))xxx)))xxxxx)))xxxx)))xx)x\c
xx)xx)xx))x))x)xx)xx)xx))x))x)xx\b
x)))))))x)xx)xxxx)x)xx)x)xx)xx)x\a
x)x)x))))))x)x))x)))x)))xx))x))x/f
x)x)x))))))x)x)xxx)xxxxxxxx)x)xx/e
xxxxxxxx))xxxxxx))))x)))xxx)x))x/d
xxxxx))xxxxx)x)xxx)xxx))xx))xx)x/c
xxx)xxx)xxxx)x)xxxxxx))xxx))x))x/b
x)xxx)x)x)xx)xxxxx))x)))xx))xxxx/a

Try it online!

Note: Erasure uses the ASCII backspace character (\x08), which means that they'll look funny in TIO, but they look fine in, say, xterm.

Basic Structure

At the top, above the = line, is the input decoder. It turns the input into one of 32 separate signals. These are sent from the o's above the = to those below.

The triangular mountains of L's and R's just rotate the pattern from separate rows to columns. The grid below that translates each column to its output character. For unknown signals, NUL (\x00) is produced. For the special shifts, instead of printing a character, the little blob to the right changes the mode.

The cable-car like thing between the two mountains suppresses any printing between each quintet, otherwise, this would attempt to decode all overlapping quintets too. Try replacing the ! with a space to see this for yourself. (Running in verbose mode -v may also be of interest here.)

I'm not sure how to make this smaller at the moment; it's already quite dense for its size.

Phlarx

Posted 2016-09-21T13:22:32.803

Reputation: 1 366

0

GNU sed, 334 + 1 = 335 bytes

+1 byte for -r flag. Takes input on STDIN.

Looking over old challenges I realized this one would be pretty easy with sed, and good for practice. I haven't attempted any compression, so the lookup table is more than half the code.

s|.*|#@&;01000E211000/%00100Y310100U401100I%11100O500010f 10010J601010G711010H%00110B810110C901110F%00001 l10001-.01001X%11001Z:00101S%10101T%01101W?11101V'00011<<10011K(01011M)11011L=00111R-10111Q/01111N%11111P+10000A111110D0|
:
s/@([01]{5})(.*;.*\1)(..)/\3@\2\3/
t
s/@;.*//
s/#f /@/
s/@ l/#/
s/#(.)./\1#/
s/@.(.)/\1@/
t
s/.<|[#@]//g

Try it online!

Explanation

The code works in two phases: First, it replaces each run of 5 binary digits with the corresponding two characters (letter and figure) from a lookup table. The lookup table is in the format … where is a binary digit and and are the corresponding letter and figure, respectively. % stands in for missing characters (this could be any character other than newline). FS/SP is represented by f<space> and SP/LS is <space>l. ER is represented by <<.

Then it steps through each pair with a "cursor" corresponding to the current mode—# for letter mode, @ for figure mode. The # cursor removes the second character of the pair and then advances to the next pair, and the @ removes the first and advances. In other words, #A1B8 becomes A#B8 and then AB#, and @A1B8 becomes 1@B8 and then 18@. When the # cursor encounters f<space> it deletes it and replaces itself with the @ cursor, and vice versa when @ encounters <space>l.

When no pairs remain, the final cursor is removed along with any characters followed by <.

# Setup: Append a lookup table to the line.
# Also prepends "#" and "@" which we'll use as "cursors" later.
s|.*|#@&;01000E211000/%00100Y310100U401100I%11100O500010f 10010J601010G711010H%00110B810110C901110F%00001 l10001-.01001X%11001Z:00101S%10101T%01101W?11101V'00011<<10011K(01011M)11011L=00111R-10111Q/01111N%11111P+10000A111110D0|

# Phase 1
:
  # Using "@" as a "cursor", substitute for each run of 5 binary digits the
  # two corresponding characters from the lookup table.
  s/@([01]{5})(.*;.*\1)(..)/\3@\2\3/
  t   # Loop (branch to `:`) as long as substitutions are made.

s/@;.*//       # Delete the "@" and lookup table

# Phase 2
s/#f /@/       # FS (f ) in letter mode (#); delete and switch to figure mode (@ cursor).
s/@ l/#/       # LS ( l) in figure mode (@); delete and switch to letter mode (# cursor).
s/#(.)./\1#/   # Letter mode; replace pair with first of pair; advance cursor.
s/@.(.)/\1@/   # Figure mode; replace pair with second of pair; advance cursor.
t              # If any substitutions were made, branch (loop) to `:`.

# Teardown
s/.<|[#@]//g   # Delete characters followed by < (ER) and cursor.

Jordan

Posted 2016-09-21T13:22:32.803

Reputation: 5 001