This is the audio version of the Twitter image encoding challenge.

Design an audio compression format that can represent at least one minute of music in 140 bytes or less of printable UTF-8-encoded text.

Implement it by writing a command-line program that takes the following 3 arguments (after the name of the program itself):

The string encode or decode.
The input filename.
The output filename.

(If your preferred programming language lacks the ability to use command-line arguments, you may use an alternative approach, but must explain it in your answer.)

The encode operation will convert from your chosen audio format to your compressed “tweet” format, and the decode operation will convert from your “tweet” format to the original audio format. (Of course, you're expected to implement lossy compression, so the output file need not be identical to the input, just in the same format.)

Include in your answer:

The source code of your program, in full. (If it's too long for this page, you may host it elsewhere and post a link to it.)
An explanation of how it works.
At least one example, with a link to the original audio file(s), the “tweet” text it compresses down to, and the audio file obtained by decoding the tweet. (Answerer is responsible for copyright “fair use” assertions.)

Rules

I reserve the right to close any loopholes in the contest rules at any time.
[Edited April 24] For the input of your encode function (and output of your decode function), you may use any reasonable, common audio format, whether it be:
- Uncompressed waveform, like WAV.
- Compressed waveform, like MP3.
- “Sheet music” style, like MIDI.
Your compressed “tweet” format must actually encode the sounds in the input file. So, the following types of output do not count:
- A URI or file path giving the location where the actual output is stored.
- A key to a database table where the actual output is stored as a blob.
- Anything similar.
Your program must be designed to compress generic music files, so don't do stuff that's too obviously tied to your specific example song. For example, if you're demonstrating “Twinkle, Twinkle, Little Star", your compression routine shouldn't hard-code a specific symbol for the sequence do-do-so-so-la-la-so.
Your program's output should actually be able to go through Twitter and come out unscathed. I don't have a list of the exact characters that are supported, but try to stick to letters, digits, symbols, and punctuation; and avoid control characters, combining characters, BIDI markers, or other weird stuff like that.
You may submit more than one entry.

Judging criteria

This is a popularity contest (i.e., most net upvotes wins), but voters are urged to consider the following:

Accuracy

Can you still recognize the song after it's been compressed?
Does it sound good?
Can you still recognize which instruments are being played?
Can you still recognize the lyrics? (This is probably impossible, but it would be impressive if anyone accomplished it.)

Complexity

The choice of the example song matters here.

[Added April 24] This challenge will be easiest with MIDI or similar formats. However, if you take the extra effort to make it work with waveform-type formats, that deserves extra credit.
What's the structure? Sure, you can meet the one-minute requirement by simply repeating the same 4 measures an arbitrary number of times. But more complex song structures deserve more points.
Can the format handle a lot of notes being played at one time?

The code

Keep it as short and simple as possible. However, this is not a code golf, so readability matters more than character count.
Clever, complicated algorithms are OK too, as long as they're justified by improved quality of results.

dan04

Posted 2014-04-23T04:21:01.953

Reputation: 6 319

9Using MIDI versus WAV is a drastically different challenge. I think you should restrict the formats to WAV only. – grovesNL – 2014-04-23T04:58:51.113

1Have changed the rules to allow only waveform-type formats. – dan04 – 2014-04-23T12:56:46.467

10I'm keen to see any solutions, but to be honest: Packing 60s of sound in 140 bytes means that you have less than 19 bits per second available. There are a few ultra efficient speech encoders, which operate at 300 bps, but these are only able to decode to synthesized phonemes with the aim to produce comprehensible speech and are not in any way able to encode music. – jarnbjo – 2014-04-23T14:34:12.667

Could you provide a sample file? – golfer9338 – 2014-04-23T15:00:46.737

You're asking for software with compression factors many orders of magnitude greater than the current state of the art. If you want sensible answers (i.e., not involving compositions like 4'33" or Funeral March for the Obsequies of a Deaf Man), I'd encourage you to drop the time constraint to 1 second.

– r3mainer – 2014-04-23T16:02:03.960

3@squeamishossifrage he didn't say it had to sound recognizable, though. – cjfaure – 2014-04-23T16:05:05.047

– xem – 2014-04-23T17:52:33.597

As others have said, it's simply not doable unless you allow MIDI. If you are only interested in waveform-based formats, a more realistic goal might be to encode the audio for a single word in a way that it is intelligible. – skibrianski – 2014-04-24T12:18:46.280

There's an argument on chat (and following day) about whether you actually mean 140 bytes or a tweet-sized 140 characters.

– Peter Taylor – 2014-04-24T14:15:05.293

@skibrianski: I've re-allowed the MIDI format. Of course, anyone who wants to perform a Fourier analysis on a waveform file is still welcome to do so. – dan04 – 2014-04-24T14:25:21.627

2@dan04: Now the challenge has essentially become "compress text and choose which notes to remove" which isn't very interesting. The direct WAV manipulation would have been more interesting if the 140-limit had been greatly increased and everyone used the same sample, bitrate, etc. – grovesNL – 2014-04-25T02:08:57.383

Answers

Scala

Sure, it'd be easier to encode MIDI files, but who's got a bunch of MIDI files lying around? It's not 1997!

First things first: I've decided to interpret a "Unicode byte" as a "Unicode char" and use CJK characters, because:

It matches the image challenge
Twitter is cool with it
I really need those bits

There are a few tricks I use to squeeze every last drop of entropy from the sources:

Firstly, music is made with notes. Moreover, we generally regard the same note in a different octave as the same note (which is why a 12-string guitar sounds right), so we only have 12 possibilities to encode. (when I output B, for example, I actually output a chord, consisting solely of B in all the octaves, a bit like a 12-string guitar).

Next, I remember from high-school music class that most note transitions are small (up or down one note). Jumps are less common. This tells us that there's probably less entropy in the jump sizes than in the notes themselves.

So, our approach is to break our source into a number of blocks - I found 14 blocks per second worked well (side note, I'd always wondered why audio was encoded at 44100 Hz. It turns out that 44100 has lots of factors, so I could have chosen 1, 2, 3, 4, 5, 6, 7, 9, 10, 12, 14, 15, 18, 20, 21, 25, 28 or 30 blocks per second, and it would have divided cleanly). We then FFT these blocks (well, technically not fast, as the library I used isn't fast for non-power-of-2 blocks. And technically I used a Hartley transform, not Fourier).

We then find the note that sounds loudest (I used A-weighting, with high and low cut-offs, mostly because it's easiest to implement), and either encode this note, or encode silence (silence detection is based on SNR - low SNR is silence).

We then translate our encoded notes into jumps, and feed them to an adaptive arithmetic coder. The process for translation to text is similar to the image compression question (but involves some abusive use of BigInteger).

So far, so good, but what if the sample has too much entropy? We use a crude psychoacoustic model to remove some. The lowest entropy jump is "no change", so we look at our FFT data to try to find blocks where the listener probably won't notice if we keep playing the previous note, by looking for blocks where the note from the previous block is almost as loud as the loudest note (where "almost" is controlled by the quality parameter).

So, we've got a target of 140 characters. We start by encoding at quality 1.0 (max quality), and see how many characters that is. If it's too many, we drop to 0.95 and repeat, until we get to 140 characters (or we give up after quality 0.05). This makes the encoder an n-pass encoder, for n <= 20 (although it's massively inefficient in other areas too, so m'eh).

The encoder/decoder expects audio in s16be mono format. This can be achieved using avconv as:

#decoding ogg to s16be, and keeping only the first 60s
avconv -i input.ogg -ac 1 -ar 44100 -f s16be -t 60s input.raw
#encoding s16be to mp3
avconv -f s16be -ac 1 -ar 44100 -i output.raw output.mp3

To run the encoder:

sbt "run-main com.stackexchange.codegolf.twelvestring.TwelveString encode input.raw encoded.txt"
sbt "run-main com.stackexchange.codegolf.twelvestring.TwelveString decode encoded.txt output.raw"

Full code at https://github.com/jamespic/twelvestring.

Pitfall to note: You'll need nayuki's Arithmetic Coding library, which doesn't currently have Maven artifacts available. Instead, you'll need to locally build and install developster's fork of it.

And here are some samples. They sound awful, but just about recognisable:

Beethoven's 5th: original, encoded - 刲檁囉罓佖镱賑皌蝩蔲恁峕逊躹呯兲搆摼蝺筶槛庬一掛獴趤笲銗娵纜喫覤粠僭嫭裵獄鱨蠰儏咍箪浝姑椻趕挍呪白鸞盙宠埘謭擆闯脲誜忘椐笸囃庣稨俖咛脔湙弻籚砌鍖裏橈镙訁鹻塿骱踠筟七趇杅峇敖窈裞瘫峦咰呹櫬茏蛏姆臸胝婁遼憀麁黦掏毈喙眝綄鴀耢椚筤菮蟞斗俼湛营筬禴籙嬧窻丄
Fur Elise: original, encoded - 訖忢擫鏝拪纒铇鯪薯鉰孝暱埛痏絘僌莻暆鴈屖鳮絒婘譮蠣託騶腀饚緂柤碠瞢碩脅歙棆敗辦冏鄔酾萖苯劺誺軩忇穤锳婁伉巠桭晘酉筟緵俅怚尵鎽蜓崁飿嘔但鈐謽酝屮呤誥俊覊鶮龔癸埙煂臙牎繬肜摲炚雗頨款驦燈菩凧咁楾媡夥俰欓焵韀冊嗥燠鱧駟髉
Twinkle Twinkle Little Star: original, encoded - 欠悺矜莳錥鷗谴裴皽鳺憝漿箔皇殤鸧蜻猻丱
A fun chiptune: original, encoded - 简詐諥尘牿扫鲑龮箫颫齰蠏騁煟靸阒萎囦鮇雝愯訖芉馄鈿鬦嶏觲沠丆贀蛑蛀漥荤侲咔麑桬鲠僵擕灼攲陇亄鹘琾業纟鵼牤棌匩碆踬葫鶙懲欃铳樯柇鋡疡衻澯伅墇搢囻荸香貱夹咽蟽籐屈锂蛉袒貊屨鈦夜镤沄鍡唦魮骬乔蚖醶矕咻喸碋利褼裊匎嶮窢幘六沧鼝瞮坡葍帷锆邵旻符琨鱴郤栱烇仾椃騋荄嘵統篏珆罽

Update

I tweaked the silence threshold in the code, and re-encoded. The encodings have been updated accordingly. Also, I added another song (technically not open source, but I doubt the original copyright holder will feel their IP is under threat), just for fun:

Imperial March: original, encoded - 岼讶湠衢嫵焅喋藭霔憣嗗颟橳蕖匵腾嗅鈪芔区顕樭眀冂常僠寝萉乹俚戂闤灈蟑拷邢音褈霈媬盒萳璥唂焰銯艉鶱縩巻痭虊窻熲紆耺哅淙苉嘏庸锺禒旇蘄籷遪刨繱蕖嬯摺祑仰軈牰杊瘷棏郖弘卄浕眮騜阖鏴鶺艂税寛柭菸採偋隆兎豅蚦紛襈洋折踜跅軩树爺奘庄玫亳攩獼匑仡葾昐炡瞱咏斎煟价藭恐鷖璌榍脅樐嬨勀茌愋

More Updates

I've tweaked the encoder a little, and it's had a surprising impact on quality (I'd forgotten that in DHT, out-of-phase signals are effectively negative, so I was ignoring out-of-phase signals).

An earlier version of the code just took the larger of these out-of-phase signals, but we now take the RMS. Also, I added a fairly conservative window function to the encoder (Tukey, alpha 0.3), to try to combat artifacting.

Everything's updated accordingly.

James_pic

Posted 2014-04-23T04:21:01.953

Reputation: 3 988

ockquote>

"If it's too many, we drop to 0.95 and repeat, until we get to 140 characters" — this quality search can be improved by using binary search (or even binary search with middle point prediction)

– Display Name – 2016-08-27T05:04:36.763

1I can't play the Twinkle Twinkle and the chiptune. Fur Elise is quite close, while Beethoven is barely recognizable, haha. – justhalf – 2014-04-29T13:56:21.727

Do you want to try Twinkle Twinkle and the Chiptune again? I think I've fixed the URLs. – James_pic – 2014-04-29T14:01:47.350

1It works now. The Twinkle Twinkle is quite descent. But what is happening at the end? – justhalf – 2014-04-29T14:08:57.027

Yeah, I'm not totally sure what's happening at the end. I suspect it's happening somewhere in the arithmetic coding. In an earlier version of the code, the stream was terminated by an EOF symbol, but in some cases the decoder was failing to read the EOF symbol. I suspect I've not closed the BitOutputStream correctly, but I'll look into it. – James_pic – 2014-04-29T14:17:41.207

1Yes, in fact that was exactly it. There was a BitOutputStream::close method I'd forgotten to call. I'll fix the code and update the outputs. – James_pic – 2014-04-29T14:22:02.333

Python

I don't do any special mangling with regards to UTF-8, so my submission passes the 140 bytes requirement. I make no claims about the usefulness, accuracy, or efficiency of my solution.

I used a sample rate of 44100 Hz for the input and output. SAMPLES_PER_BYTE controls the quality of the conversion. The lower the number, the better quality the sound. The values I used are given in the results section.

Usage

Encode

Input file should be a wav. It only encodes the first channel.

twusic.py -e [input file] > output.b64

Decode

twusic.py -d [input file] > output.raw

Playing the Decoded Music

aplay -f U8 --rate=[rate of input file] output.raw

The Code

#!/usr/bin/env python
SAMPLES_PER_BYTE = 25450

from math import sin, pi, log
from decimal import Decimal

PI_2 = Decimal(2) * Decimal(pi)

FIXED_NOTE = Decimal('220') # A
A = Decimal('2') ** (Decimal('1') / Decimal('12'))
A_LN = A.ln()

def freq(note):
    return FIXED_NOTE * (A ** Decimal(note))

def note(freq):
    return (Decimal(freq) / FIXED_NOTE).ln() / A_LN

VOLUME_MAX = Decimal('8')
def volume(level):
    return Decimal('127') * (Decimal(level+1).ln() / VOLUME_MAX.ln())

def antivolume(level):
    x = Decimal(level) / Decimal('127')
    y = VOLUME_MAX ** x
    return y - 1

NOTES = [freq(step) for step in xrange(-16, 16)]
VOLUMES = [volume(level) for level in xrange(0, VOLUME_MAX)]


def play(stream, data):
    t = 0
    for x in data:
        x = ord(x)
        w = PI_2 * NOTES[(x&0xf8) >> 3] / Decimal(16000)
        a = float(VOLUMES[x&0x07])
        for _ in xrange(0, SAMPLES_PER_BYTE):
            stream.write(chr(int(128+(a*sin(w*t)))))
            t += 1

NOTE_MAP = {'A': 0b00000000,
    'g': 0b00001000,
    'G': 0b00010000,
    'f': 0b00011000,
    'F': 0b00100000,
    'E': 0b00101000,
    'd': 0b00110000,
    'D': 0b00111000,
    'c': 0b01000000,
    'C': 0b01001000,
    'B': 0b01010000,
    'a': 0b01011000}

def convert(notes, volume):
    result = []
    for n in notes:
        if n == ' ':
            result += '\00'
        else:
            result += chr(NOTE_MAP[n] | (volume & 0x07)) * 2
    return ''.join(result)

TWINKLE = convert('C C G G A A GG' +
                    'F F E E D D CC' +
                    'G G F F E E DD' +
                    'G G F F E E DD' +
                    'C C G G A A GG' +
                    'F F E E D D CC', 0x7)

if __name__ == '__main__':
    from base64 import b64encode, b64decode
    import numpy as np
    from numpy.fft import fft, fftfreq
    import wave
    import sys

    if len(sys.argv) != 3:
        print 'must specify -e or -d plus a filename'
        sys.exit(1)

    if sys.argv[1] == '-e':
        w = wave.open(sys.argv[2], 'rb')

        try:
            output = []
            (n_channels, sampwidth, framerate, n_frames, comptype, compname) = w.getparams()
            dtype = '<i' + str(sampwidth)

            # Find max amplitude
            frames = np.abs(np.frombuffer(w.readframes(n_frames), dtype=dtype)[::n_channels])
            max_amp = np.percentile(frames, 85)

            w.rewind()

            read = 0
            while read < n_frames:
                to_read = min(n_frames-read, SAMPLES_PER_BYTE)
                raw_frames = w.readframes(to_read)
                read += to_read

                frames = np.frombuffer(raw_frames, dtype=dtype)[::n_channels]
                absolute = np.abs(frames)
                amp = np.mean(absolute)

                amp = int(round(antivolume(min((amp / max_amp) * 127, 127))))

                result = fft(frames)
                freqs = fftfreq(len(frames))

                while True:
                    idx = np.argmax(np.abs(result)**2)
                    freq = freqs[idx]
                    hz = abs(freq * framerate)
                    if hz > 0:
                        break
                    result = np.delete(result, idx)
                    if len(result) <= 0:
                        hz = 220
                        amp = 0
                        break

                n = int(round(note(hz)))
                n &= 0x1F
                n <<= 3
                n |= amp & 0x07
                output.append(chr(n))
        finally:
            w.close()
        print b64encode(''.join(output)).rstrip('=')
    else:
        with open(sys.argv[2], 'rb') as f:
            data = f.read()
        data = data + '=' * (4-len(data)%4)
        play(sys.stdout, b64decode(data))

The Inputs

My official submission is Impromptu for Pianoforte and Beatbox by Kevin MacLeod. For this file I used a SAMPLES_PER_BYTE of 25450.

I also took the liberty of encoding Twinkle, Twinkle, Little Star with a SAMPLES_PER_BYTE of 10200. It sounds much better.

The Output

Impromptu for Pianoforte and Beatbox

aWnxQDg4mWqZWVl6W+LyOThfHOPyQThAe4x5XCqJK1EJ8Rh6jXt5XEMpk1Epe5JqTJJDSisrkkNCSqnSkkJDkiorCZHhCxsq8nlakfEp8vNb8iqLysp6MpJ7s4x7XlxdW4qKMinJKho

Link

Twinkle, Twinkle Little Star

HBobGlJSUlJSY2FlYVNRUVFCQkJCQjs5PDksKisqGxoZGVFTUVNRREFDQjs6OjoqKykpKVRRVFJDQkJCOjs6OzksKikpGxobG1JSUlNRZWFlYVNSUVFCQkJDQTw5PDorKisqGhsZGRk

Link

tecywiz121

Posted 2014-04-23T04:21:01.953

Reputation: 1 127

Musical Tweet Challenge

Rules

Judging criteria

Accuracy

Complexity

The code

Answers

Scala

Update

More Updates

Python

Usage

Encode

Decode

Playing the Decoded Music

The Code

The Inputs

The Output

Impromptu for Pianoforte and Beatbox

Twinkle, Twinkle Little Star