Speak digits from 0 to 9 aloud

15

4

Inspired by this question from electronics.SE, here's a challenge for you:

Write a program or subroutine that takes in a sequence of decimal digits (0 to 9) and speaks them aloud, without using an existing speech synthesis tool.

Input:

You may ask for the input digits to be provided in any reasonable format, e.g. as a string of ASCII digits, an array of integers, a BCD-encoded number, etc. If your solution is an executable program, you may take the input as a command line parameter, read it from standard input, or obtain it in any other reasonable manner.

Your program must be able to speak at least eight digits per invocation. You may assume that the first digit is not zero, unless it is the only digit.

Output:

Your program may either speak the numbers directly using an audio device, or it may output a playable sound file. The output file, if any, may be in any standard audio format, or it may consist of raw sample data. If you output raw sample data, please note the appropriate parameters for playback (sample rate, bits per sample, endianness, signed/unsigned, # of channels). Formats supported by aplay are preferred.

You are free to decide the details on how the numbers will be spoken, but your output should consist of English language digits spoken in a manner understandable to a typical English speaker, and it should be clear enough for the listener to be able to accurately transcribe a spoken eight-digit random number. No, just beeping n times doesn't count. Don't forget to include pauses between the digits.

Scoring:

Standard scoring rules apply: Your score is the length of your code in bytes or, if your code is written in Unicode text, in Unicode characters. Lowest score wins. Any language goes.

As the original question on electronics.SE was about embedded programming, I felt it would be appropriate to toss a bone to authors using low-level languages: if your solution is written in a compiled language, you may choose to count the length of the compiled executable file in bytes as your score. (Yes, precompiled bytecode, such as a Java .class file, is OK too.) If you choose to make use of this option, please include a copy of the compiled executable in your answer (e.g. as a hex dump) along with your source code and the compiler version and options you used to generate it.

An honorable mention, along with a +50 rep bounty, will be granted to the first answer that also meets the criteria of the original question, i.e. is capable of running on an embedded MCU with 4 kb of flash and 1 kb of SRAM.

Restrictions:

You may not make use of any files or network resources that are not part of your chosen language's standard runtime environment, unless you count the length of said files or resources as part of your score. (This is to disallow e.g. loading audio samples from the web.)

You may also not use any pre-existing speech synthesis tools or libraries or compilations of audio data (unless you also count their size as part of your score), even if they're included in your chosen language's standard runtime environment.

Ilmari Karonen

Posted 2013-08-02T13:53:52.483

Reputation: 19 513

Ps. I may post a solution of my own later, if I manage to make it produce something that actually sounds understandable. Don't be shy of posting your own, though; at this point, any answer is a good answer. – Ilmari Karonen – 2013-08-02T14:02:58.280

1Are we allowed to download a database of spoken digits (and count its size towards the score) or we have to record our own voice? I doubt I can generate speech samples algorithmically. – John Dvorak – 2013-08-02T15:18:13.007

umm... the "output" section does not specify we must output speech samples. Are we allowed to simply beep ten times? – John Dvorak – 2013-08-02T15:19:59.687

@PeterTaylor: If you count their size as part of your score, it's OK. I was just worried that there might be some system out there that has audio samples of digits buried somewhere in its standard runtime environment. – Ilmari Karonen – 2013-08-02T15:20:03.150

@JanDvorak: No, because that would not really be "understandable to a typical English speaker" without an explanation. – Ilmari Karonen – 2013-08-02T15:20:52.343

@IlmariKaronen I believe "beep---beep-beep---beeeeeep---beep-beep-beep-beep" is quite understandable - especially if I encode a short pause after a group of five. – John Dvorak – 2013-08-02T15:23:43.047

@JanDvorak: Maybe if you're R2D2. But no, that's not the intent of the challenge. I've edited the rules to try to clarify them on this point (and the one brought up by Peter). – Ilmari Karonen – 2013-08-02T15:26:10.603

So AppleScript's say command is out of the question? – arshajii – 2013-08-02T18:10:57.973

@arshajii it sounds like a pre-existing speech synthesis tool. So, yes. – John Dvorak – 2013-08-02T18:26:43.013

3Since there seems to be a steady stream of people who don't read the question to the end and post trivial wrappers around heavyweight libraries, it might be worth editing to put even more emphasis on the "Do it yourself" aspect. – Peter Taylor – 2013-08-13T11:46:41.223

Answers

10

ruby - 3710 = 90 characters code + 3620 bytes data

require'zlib'
$><<$*[0].chars.map{|x|Zlib::Inflate.inflate File.open(x).read}.join(?0*5e3)

input: a single command line argument, the number to read

output: raw sound data, PCM 8bit/8kHz

This can read any input string, as long as

  • it only contains characters that are valid file names. for only four chars, you can enlarge that set to all characters.
  • you have the neccessary files.
  • why oh you space dee oh en apostrophe tee space em i en dee space tee aitch i es period

5e3 encodes the pause between two words. Here, 5ksamples ~= 0.6s. Tweak as desired.

Now, the tricky part is to get the sample files in 4K and yet be able to decompress them easily and in sufficient quality. Here is how I got them:

Now, one has to choose a sample rate and decimation amount. Too much, and the sound won't be understandable. Too little, and you don't fit. I have settled for 8kHz/3b. There they are: https://github.com/honnza/drops/raw/master/digits.zip

  • 8KHz * 4b/sample and higher quality - too big
  • 8KHz * 3b/sample - low quality, but it fits into 4K
  • 8KHz * 2b/sample - kch kchhhhhhhhh [not understandable]
  • 2KHz * 8b/sample - too big
  • 2KHz * 3b/sample - kch kchhhhhhhhh
  • 1KHz * 8b/sample - kch kchhhhhhhhh

Here's the decimation script:

require'zlib'
Dir.glob "*.raw" do |fname|
  File.open fname[/\d/], "wb" do |out|
    File.open fname do |input|
      bytes = input.bytes.to_a
      bytes.map! {|x|x&0xE0}
      dfl = Zlib::Deflate.deflate(bytes.pack("C*"),9)
      dfl.each_byte do |byte|
        out.print byte.chr
      end
      puts "done #{fname}: #{dfl.size}"
    end
  end
end

As for the original challenge: there are 476 bytes of space for code and the file table. This might be slightly too much depending on how tiny we can get with a DEFLATE library. If neccessary, we can cut a few corners here and there by cropping the audio samples a bit more aggresively. [fo:r] or [o:] doesn't really matter but it saves bytes. I have been somewhat benevolent when cropping the numbers. Also, a different decimation scheme or sacrificing some decimation for downsampling might help - I'll toy with these later. Also, dropping the DEFLATE headers might save a tiny amount of space.

Concatenating sound samples is quite easy, but 4K is a little cramped. If you are not bound by 4k space, I suggest less decimation. 4 bits per sample actually fares quite well and is only slightly bigger.

John Dvorak

Posted 2013-08-02T13:53:52.483

Reputation: 9 048

+1, not bad. The clarity is pretty marginal, though: I tried transcribing a few random numbers and got about a 70% success rate. (I was hoping for something closer to 99%.) I'm also still a little bit on the fence about the honorable mention thing: while you've made a pretty good argument that 4K could be attainable this way, you haven't actually demonstrated it. Even if you ditched ruby for C (which seems easy enough to do; I'd be willing to take that part on faith), could you really fit a DEFLATE decoder in the remaining flash space? Plus, as I noted, the sound quality is pretty bad. – Ilmari Karonen – 2013-08-03T11:14:49.413

Ps. A few tips on better compression: You could pad all the samples to a fixed length with null bytes (which should compress well) and concatenate them into one compressed file, then decompress and slice it. Also, the KZIP trick from this answer could give you better DEFLATE compression. Finally, try editing the combined sound file to replace equivalent phonemes with exact copies.

– Ilmari Karonen – 2013-08-03T11:20:15.013

well, the original sound samples were not exactly understandable either IMO - the downsampling did little damage to that. The smallest DEFLATE library I know - the first one linked by wikipeda - weighs about 500b. Frankly, do you want me to port the inflater to that specific device? I might get to it actually, but I've never coded for ARM before. – John Dvorak – 2013-08-03T11:25:34.533

I'm pretty surprised about the 70% success rate - I've found the numbers to be easy to understand. Which digits did you confuse the most? – John Dvorak – 2013-08-03T11:28:32.357

Porting it to a Cortex M0 is probably a bit too much to ask (although if you could do that, that'd be awesome!), but I do think that a stand-alone binary (+ data files, if any) fitting under 4k would seem a reasonable demonstration. (No need to statically link in libc for file I/O, since you wouldn't need that on an embedded device, but the DEFLATE code should certainly be counted.) Basically, something that you could post as an answer to the original question on electronics.SE and confidently say "if you compile this for your device, I bet it'll fit". – Ilmari Karonen – 2013-08-03T11:37:16.000

As for the errors, looking at my tests, it seems I consistently transcribed your "1" as a "9" and vice versa. Your "8" also sounds a bit like a "6" to me, and your "3" like a "2". (I bet I could learn to transcribe them correctly with practice, but that's again really outside the scope of the challenge, for the same reason that just emitting beeps is.) – Ilmari Karonen – 2013-08-03T11:43:16.007

How well do you fare transcribing digits from other sources? I think that beating native speakers in terms of legibility is beyond the scope of this challenge. I can upload the undecimated digit sounds. If they aren't easier to understand for you, it's kinda futile to try to develop a better compression scheme. The digits you've mentioned seem pretty distinct to me - but try these anyways: https://github.com/honnza/drops/raw/master/digits-hq.zip

– John Dvorak – 2013-08-03T13:00:02.723

I tried to download the HQ samples, but your link gives me a 404 error. :( Anyway, I just tried it with this speech synthesizer and I haven't made any transcription errors so far.

– Ilmari Karonen – 2013-08-04T15:40:05.623

I'm curious that you don't list any 4 kHz results. 4 kHz at 5 bits would be smaller than 8 kHz at 3 bits. – Peter Taylor – 2013-08-07T08:29:34.477