ruby - 3710 = 90 characters code + 3620 bytes data

require'zlib'
$><<$*[0].chars.map{|x|Zlib::Inflate.inflate File.open(x).read}.join(?0*5e3)

input: a single command line argument, the number to read

output: raw sound data, PCM 8bit/8kHz

This can read any input string, as long as

it only contains characters that are valid file names. for only four chars, you can enlarge that set to all characters.
you have the neccessary files.
why oh you space dee oh en apostrophe tee space em i en dee space tee aitch i es period

5e3 encodes the pause between two words. Here, 5ksamples ~= 0.6s. Tweak as desired.

Now, the tricky part is to get the sample files in 4K and yet be able to decompress them easily and in sufficient quality. Here is how I got them:

Take a Text-to-speech engine able to produce sound files. Wikipedia has one.
Feed it a text containing all digits, ideally close together. I used http://en.wikipedia.org/wiki/Base_13
Downsample.
Cut out each part in a sound editor.
Save as a raw file.
Decimate each sample (discard low-order bits).
Deflate.

Now, one has to choose a sample rate and decimation amount. Too much, and the sound won't be understandable. Too little, and you don't fit. I have settled for 8kHz/3b. There they are: https://github.com/honnza/drops/raw/master/digits.zip

8KHz * 4b/sample and higher quality - too big
8KHz * 3b/sample - low quality, but it fits into 4K
8KHz * 2b/sample - kch kchhhhhhhhh [not understandable]
2KHz * 8b/sample - too big
2KHz * 3b/sample - kch kchhhhhhhhh
1KHz * 8b/sample - kch kchhhhhhhhh

Here's the decimation script:

require'zlib'
Dir.glob "*.raw" do |fname|
  File.open fname[/\d/], "wb" do |out|
    File.open fname do |input|
      bytes = input.bytes.to_a
      bytes.map! {|x|x&0xE0}
      dfl = Zlib::Deflate.deflate(bytes.pack("C*"),9)
      dfl.each_byte do |byte|
        out.print byte.chr
      end
      puts "done #{fname}: #{dfl.size}"
    end
  end
end

As for the original challenge: there are 476 bytes of space for code and the file table. This might be slightly too much depending on how tiny we can get with a DEFLATE library. If neccessary, we can cut a few corners here and there by cropping the audio samples a bit more aggresively. [fo:r] or [o:] doesn't really matter but it saves bytes. I have been somewhat benevolent when cropping the numbers. Also, a different decimation scheme or sacrificing some decimation for downsampling might help - I'll toy with these later. Also, dropping the DEFLATE headers might save a tiny amount of space.

Concatenating sound samples is quite easy, but 4K is a little cramped. If you are not bound by 4k space, I suggest less decimation. 4 bits per sample actually fares quite well and is only slightly bigger.

John Dvorak

Posted 2013-08-02T13:53:52.483

Reputation: 9 048

+1, not bad. The clarity is pretty marginal, though: I tried transcribing a few random numbers and got about a 70% success rate. (I was hoping for something closer to 99%.) I'm also still a little bit on the fence about the honorable mention thing: while you've made a pretty good argument that 4K could be attainable this way, you haven't actually demonstrated it. Even if you ditched ruby for C (which seems easy enough to do; I'd be willing to take that part on faith), could you really fit a DEFLATE decoder in the remaining flash space? Plus, as I noted, the sound quality is pretty bad. – Ilmari Karonen – 2013-08-03T11:14:49.413

Ps. A few tips on better compression: You could pad all the samples to a fixed length with null bytes (which should compress well) and concatenate them into one compressed file, then decompress and slice it. Also, the KZIP trick from this answer could give you better DEFLATE compression. Finally, try editing the combined sound file to replace equivalent phonemes with exact copies.

– Ilmari Karonen – 2013-08-03T11:20:15.013

well, the original sound samples were not exactly understandable either IMO - the downsampling did little damage to that. The smallest DEFLATE library I know - the first one linked by wikipeda - weighs about 500b. Frankly, do you want me to port the inflater to that specific device? I might get to it actually, but I've never coded for ARM before. – John Dvorak – 2013-08-03T11:25:34.533

I'm pretty surprised about the 70% success rate - I've found the numbers to be easy to understand. Which digits did you confuse the most? – John Dvorak – 2013-08-03T11:28:32.357

Porting it to a Cortex M0 is probably a bit too much to ask (although if you could do that, that'd be awesome!), but I do think that a stand-alone binary (+ data files, if any) fitting under 4k would seem a reasonable demonstration. (No need to statically link in libc for file I/O, since you wouldn't need that on an embedded device, but the DEFLATE code should certainly be counted.) Basically, something that you could post as an answer to the original question on electronics.SE and confidently say "if you compile this for your device, I bet it'll fit". – Ilmari Karonen – 2013-08-03T11:37:16.000

As for the errors, looking at my tests, it seems I consistently transcribed your "1" as a "9" and vice versa. Your "8" also sounds a bit like a "6" to me, and your "3" like a "2". (I bet I could learn to transcribe them correctly with practice, but that's again really outside the scope of the challenge, for the same reason that just emitting beeps is.) – Ilmari Karonen – 2013-08-03T11:43:16.007

How well do you fare transcribing digits from other sources? I think that beating native speakers in terms of legibility is beyond the scope of this challenge. I can upload the undecimated digit sounds. If they aren't easier to understand for you, it's kinda futile to try to develop a better compression scheme. The digits you've mentioned seem pretty distinct to me - but try these anyways: https://github.com/honnza/drops/raw/master/digits-hq.zip

– John Dvorak – 2013-08-03T13:00:02.723

I tried to download the HQ samples, but your link gives me a 404 error. :( Anyway, I just tried it with this speech synthesizer and I haven't made any transcription errors so far.

– Ilmari Karonen – 2013-08-04T15:40:05.623

I'm curious that you don't list any 4 kHz results. 4 kHz at 5 bits would be smaller than 8 kHz at 3 bits. – Peter Taylor – 2013-08-07T08:29:34.477

Ps. I may post a solution of my own later, if I manage to make it produce something that actually sounds understandable. Don't be shy of posting your own, though; at this point, any answer is a good answer. – Ilmari Karonen – 2013-08-02T14:02:58.280

1Are we allowed to download a database of spoken digits (and count its size towards the score) or we have to record our own voice? I doubt I can generate speech samples algorithmically. – John Dvorak – 2013-08-02T15:18:13.007

umm... the "output" section does not specify we must output speech samples. Are we allowed to simply beep ten times? – John Dvorak – 2013-08-02T15:19:59.687

@PeterTaylor: If you count their size as part of your score, it's OK. I was just worried that there might be some system out there that has audio samples of digits buried somewhere in its standard runtime environment. – Ilmari Karonen – 2013-08-02T15:20:03.150

@JanDvorak: No, because that would not really be "understandable to a typical English speaker" without an explanation. – Ilmari Karonen – 2013-08-02T15:20:52.343

@IlmariKaronen I believe "beep---beep-beep---beeeeeep---beep-beep-beep-beep" is quite understandable - especially if I encode a short pause after a group of five. – John Dvorak – 2013-08-02T15:23:43.047

@JanDvorak: Maybe if you're R2D2. But no, that's not the intent of the challenge. I've edited the rules to try to clarify them on this point (and the one brought up by Peter). – Ilmari Karonen – 2013-08-02T15:26:10.603

So AppleScript's say command is out of the question? – arshajii – 2013-08-02T18:10:57.973

@arshajii it sounds like a pre-existing speech synthesis tool. So, yes. – John Dvorak – 2013-08-02T18:26:43.013

3Since there seems to be a steady stream of people who don't read the question to the end and post trivial wrappers around heavyweight libraries, it might be worth editing to put even more emphasis on the "Do it yourself" aspect. – Peter Taylor – 2013-08-13T11:46:41.223

Speak digits from 0 to 9 aloud

Input:

Output:

Scoring:

Restrictions:

Answers

ruby - 3710 = 90 characters code + 3620 bytes data