The McGill Billboard Project annotates various audio features of songs from a random sample of the Billboard charts. I scraped this data to produce the following file of chord progressions:

Write the shortest code that outputs the above file (or its contents).

Edit: Here's a quick description of the notation to get you started:

N denotes no chord.
A:min denotes the A minor triad chord.
C#:sus4(b7,9) and other spicy chords are completely described here.

Dustin G. Mixon

Posted 2020-01-24T01:54:32.857

Reputation: 1 833

3That's .... a really long list. Does the output have to be in order? – Jo King – 2020-01-24T04:47:22.643

are compression/decompression libraries allowed? – Hymns For Disco – 2020-01-24T06:54:07.723

1Could you include a brief description of the chord format? Understanding what C#:sus4(b7,9) exactly means may help to compress it. – Arnauld – 2020-01-24T07:42:52.223

What prevents us from reading the file and printing it? – RGS – 2020-01-24T07:55:08.327

@RGS You can do that. As per the rules described in this meta post, your score would be your_code_size + 802213 + 1.

– Arnauld – 2020-01-24T08:05:14.687

Ha ha, seriously! Clever way to ask us to find the shortest logic in pop music, will we be cited in the academic paper? :D Joke aside, what is N? Also I think it would have been better with simpler patterns extracted, or chords used in the songs rather than the whole songs.. – Kaddath – 2020-01-24T08:47:00.947

@JoKing - Yes, the output must be in order. – Dustin G. Mixon – 2020-01-24T09:51:05.733

1@HymnsForDisco - Yes, compression/decompression libraries are allowed, but presumably, the best compressions will leverage insights from music theory. – Dustin G. Mixon – 2020-01-24T09:53:19.357

1@DustinG.Mixon I'm hugely interested, being a musician, but quite discouraged by the 125000+ lines of data, I have a work too :D is there a version with songs titles just for info? Another interesting part would be the relative aspect of notes, as the same progression can be transposed, but that's another matter, the way notes are written make this a bit difficult – Kaddath – 2020-01-24T10:14:59.060

@Kaddath - For the record, the file size is not unprecedented for this community. Song titles (and other interesting information) are available at The McGill Billboard Project website under "Index."

– Dustin G. Mixon – 2020-01-24T10:23:46.063

This challenge would be more interesting if (like the Moby Dick challenge) we hade to output something close to the file, with score depending on both code size and number of errors.

– Robin Ryder – 2020-01-24T16:38:09.753

Answers

Java (JDK), 295709 47044 (gzipped file) + 943 (code) + 1 (file) + 67 (imports) = 48055 bytes

Examining the file manually (with some help from notepad++), I found that there were 976 unique entries in the file, formed of 36 unique characters (plus newlines):

#(),/123456789:ABCDEFGNXabdghijmnsu

I then looked for common patterns, and created a dictionary as follows (key = value):

:maj = ¬
:min = `
\r\nA = "
\r\nB = £
\r\nC = $
\r\nD = %
\r\nE = ^
\r\nF = &
\r\nG = *
\r\nN = _
\r\nX = -
:sus = +
:hdim = =
:dim = [
(9) = }
(#9) = ]
:7 = {
:5 = ~
:aug = ;
#11 = @
b7 = '
maj7 = <
b13 = >
:11 = ?
(11) = \
b:9 = Z
¬^¬ = H
¬%¬ = I
¬$¬ = J
¬"¬ = K
£b¬ = L
¬*¬ = M
b¬"b = O
¬&¬ = P
b¬^b = Q
+4(', = R
¬£ = S
+4(') = T
%b¬ = U
£`£ = V
`7%`7 = W
7"` = Y
"b{] = c
*`7 = e
:13^b = f
`7$ = k
%` = l
^` = o
"` = p
b`7^ = q
b{% = r
cc = t
oo = v
&#~ = w
__ = x
YY = y
&#¬ = z

I then find-and-replace using those items in order:

()->{String s = "A LONG STRING THAT I CAN'T PASTE HERE - SEE TIO LINK";
		String[] d=new String[]{"&#¬","z","YY","y","__","x","&#~","w","oo","v","cc","t","b{%","r","b`7^","q","\"`","p","^`","o","%`","l","`7$","k",":13^b","f","*`7","e","\"b{]","c","7\"`","Y","`7%`7","W","£`£","V","%b¬","U","+4(')","T","¬£","S","+4(',","R","b¬^b","Q","¬&¬","P","b¬\"b","O","¬*¬","M","£b¬","L","¬\"¬","K","¬$¬","J","¬%¬","I","¬^¬","H","b:9","Z","(11)","\\",":11","?","b13",">","maj7","<","b7","'","#11","@",":aug",";",":5","~",":7","{","(#9)","]","(9)","}",":dim","[",":hdim","=",":sus","+","\r\nX","-","\r\nN","_","\r\nG","*","\r\nF","&","\r\nE","^","\r\nD","%","\r\nC","$","\r\nB","£","\r\nA","\"",":min","`",":maj","¬"};
		for (int i=0;i<d.length;i+=2){s=s.replace(d[i+1],d[i]);}return s;}

TIO (sort of).

EDIT

By compressing the string, as suggested in the comments, this answer can then be made shorter.

Using the GZIPped version of the string in a file "f" (size 45708 bytes), the code can then be as follows:

import java.io.*; import java.nio.file.*; import java.util.zip.*; ()->{String s="",l;try{BufferedReader b=new BufferedReader(new InputStreamReader(new GZIPInputStream(new ByteArrayInputStream(Files.readAllBytes(Paths.get("f"))))));while((l=b.readLine())!=null){s+=l;}}catch(Exception e){}String[] d=new String[]{THE SAME DICTIONARY AS THE PREVIOUS CODE - REDACTED HERE TO MAKE ANSWER SHORT ENOUGH};for (int i=0;i<d.length;i+=2){s=s.replace(d[i+1],d[i]);}return s;}

simonalexander2005

Posted 2020-01-24T01:54:32.857

Reputation: 1 157

FYI - If you zip your string, it's only 47,044 bytes, so I think this would beat my answer. – Dustin G. Mixon – 2020-02-11T15:41:53.440

Python 3, 23 + 802213 (size of file we want to reproduce) + 1

Quite a naïve answer IMO

lambda s:open(s).read()

This function takes as argument the file we want to reproduce and then prints it xD

Scoring as per this meta answer, linked by @Arnauld

RGS

Posted 2020-01-24T01:54:32.857

Reputation: 5 047

3Just to be clear, the purpose of my comment was actually to dissuade you from doing so. :-) – Arnauld – 2020-01-24T08:27:53.947

3@Arnauld it is just a basic benchmark while you guys are trying to figure out the best encodings :) – RGS – 2020-01-24T08:29:13.147

3Worst Golf Score Ever :P – Kaddath – 2020-01-24T09:10:28.557

@Kaddath Just look up some Lenguage/Unary answers, those score a lot worse. ;) This ~62 trillion bytes answer of mine is actually rather short for Lenguage. Usually it's more like this 5.71728886e+3431 bytes answer, haha. ;p

– Kevin Cruijssen – 2020-01-24T13:56:16.290

I don't think this counts as a serious contender. – Grimmy – 2020-01-24T18:03:16.790

@Grimmy should I delete it? – RGS – 2020-01-24T18:38:58.903

@Grimmy It's serious all the time there are no other answers – Anush – 2020-01-24T20:07:14.773

MATLAB, 55935 = 10 + 55924 + 1

Compress the text file as a 55924-byte zip file c and then run the code

unzip('c')

We add a byte to the score to account for the extra file.

Dustin G. Mixon

Posted 2020-01-24T01:54:32.857

Reputation: 1 833

Python 3: 93943 = 91,752 + 1990 + 1

This is my messy attempt at a compression. Sadly still worse than simply zipping the file though XD.

The information density of the file is not as great as it seems and there were several observable redundancies. Because of that I have doubts that there is a reasonable way to predict the data. I am prepared to be wrong though :-). The code by itself is not golfed that much, since the main portion of the cost is the size of the compressed files. The decompressor is only 1990 bytes long. The compression itself is done by Two separate huffman trees, implied ":" between them and two one-bit streams that act as a guides. Also there are few simple substitutions in place.

First huffman tree codes the first letter
Second huffman tree codes the portion that follows the ":"
First bit file (a) codes whether the line before the ":" consists of 0-1 or 2 characters
Second bit file (u) codes whether the second character before : is "b" or "#"

Lastly, all the files are compressed via zip since on my work computer we don't have anything else. Might try other compressors later on. zip-input This archive has to be in programs working directory.

The code itself is tied tightly to the file. It does not cover full format specifications listed by the author of the question. (Also, I have hard-coded the input file size into the decoder - after I missed the proper termination one times too many.) There is definitely a room for improvements - for example few more substitutions tied to the numbers in brackets.

There might be a missing trailing newline in the output, that I somehow keep missing while trying to debug.

Huffman tree generator:

import heapq as h
import numpy as np
from bitarray import bitarray

class List:
    def __init__(self, char, freq):
        self.char = char
        self.freq = freq
        self.leftSon = None
        self.rightSon = None

    def __lt__(self, other):
        return self.freq < other.freq

    def check(self,arr,  prefix):
        if self.char is not None:
            arr.append((self.char, prefix))
        else:
            self.leftSon.check(arr, prefix + "0")
            self.rightSon.check(arr, prefix + "1")

    def encode(self):
        if self.char is None:
            toRet = "0"
            toRet += self.leftSon.encode()
            toRet += self.rightSon.encode()
        else:
            toRet = "1" + np.binary_repr(ord(self.char), 8)
        return toRet


class HuffmanTree:
    def __init__(self, inputList):
        a = []
        for x in inputList:
            h.heappush(a, List(x[0], x[1]))
        while len(a) > 1:
            first = h.heappop(a)
            second = h.heappop(a)
            new = List(None, first.freq + second.freq)
            new.leftSon = first
            new.rightSon = second
            h.heappush(a, new)
        self.root = a[0]
        self.encodeArray = []
        self.root.check(self.encodeArray, "")
        encodeDict = {}
        for x in self.encodeArray:
            encodeDict[x[0]] = x[1]
        self.encodeDict = encodeDict
        print(self.encodeArray)

    def encodeSelf(self):
        head = self.root
        bits = head.encode()
        x = bitarray()
        for b in bits:
            if b == "0":
                x.append(0b0)
            else:
                x.append(0b1)
        return x

The code to generate the support files:

import huffman
from bitarray import bitarray

a = ""
with open("chords.txt") as chr:
    a = chr.read()

def getCounts(l):
    counts = {}
    for x in l:
        if x in counts.keys():
            counts[x] += 1
        else:
            counts[x] = 1
    res = []
    for x in counts.keys():
        res.append((x, counts[x]))
    return res

def stringToBits(string):
    ret = bitarray()
    for x in string:
        if x =="0":
            ret.append(0b0)
        else:
            ret.append(0b1)
    return ret

lines = a.split("\n")
firstPart = []
secondPart = []
for i in lines:
    a = i.split(":")
    firstPart.append(a[0])
    if len(a) > 1:
        secondPart.append(a[1] + "\n")


###First part
Afile = ""
huffFile = ""
sharpFile = ""

firstPartChars = ""
fChar = []
sChar = []
for x in firstPart:
    firstPartChars += x
    if len(x) == 0:
        fChar.append("\n")
        Afile += "0"
    else:
        fChar.append(x[0])
    if len(x) == 1:
        Afile += "0"
    if len(x) == 2:
        Afile += "1"
        sChar.append(x[1])

print([x for x in set(fChar)])
print([x for x in set(sChar)])

fCharCount = getCounts(fChar)
sCharCount = getCounts(sChar)
firstHuff = huffman.HuffmanTree(fCharCount)
for char in fChar:
    code = firstHuff.encodeDict[char]
    huffFile += code

for char in sChar:
    if char == "b":
        sharpFile += "0"
    elif char == "#":
        sharpFile += "1"
    else:
        raise Exception("error in sharp file")

#######
#####Second part
sPartFile = ""
subs = {}
subs["maj"] = "~"
subs["min"] = "&"
subs["sus"] = "^"
subs["aug"] = "%"
subs["dim"] = "$"
subs["hdi"] = "@"
spPart = []
secSubPart = ""
for x in secondPart:
    c = x.replace("maj", "~").replace("min", "&").replace("sus", "^").replace("aug", "%")\
        .replace("dim", "$").replace("hdim", "@")
    spPart.append(c)
    secSubPart += c
secPartCounts = getCounts(secSubPart)
secondHuff = huffman.HuffmanTree(secPartCounts)
for x in secSubPart:
    sPartFile += secondHuff.encodeDict[x]


aBits = stringToBits(Afile)
huffBits = stringToBits(huffFile)
sharpBits = stringToBits(sharpFile)
sPartBits = stringToBits(sPartFile)
print("test")
with open("a", "wb") as t:
    aBits.tofile(t)
with open("h", "wb") as t:
    huffBits.tofile(t)
with open("u", "wb") as t:
    sharpBits.tofile(t)
with open("f", "wb") as t:
    sPartBits.tofile(t)

with open("i", "wb") as t:
    a = firstHuff.encodeSelf()
    a.tofile(t)
with open("j", "wb") as t:
    b = secondHuff.encodeSelf()
    b.tofile(t)

Finally the code itself:

from bitarray import bitarray as b
import zipfile
with zipfile.ZipFile("h.zip", 'r') as z:
    z.extractall(".")

class N:
    def __init__(s):
        s.c = None
        s.l = None
        s.r = None
    def i(s):
        return s.l == None
    def d(s, r):
        if s.i():
            return s.c
        else:
            if r.b() == 0:
                return s.l.d(r)
            else:
                return s.r.d(r)
class R:
    def __init__(s, bits):
        s.d = bits
        s.i = 0

    def b(s):
        t = s.d[s.i]
        s.i += 1
        return t

    def B(s):
        toRet = s.d[s.i:s.i + 8]
        s.i += 8
        return toRet

def rn(r):
    if r.b() == 1:
        n = N()
        n.c = r.B().tostring()
    else:
        left = rn(r)
        right = rn(r)
        n = N()
        n.l = left
        n.r = right
    return n

def bff(i):
    with open(i, "rb") as f:
        c = b()
        c.fromfile(f)
        return c

def huffRead(a):
    c = bff(a)
    r = R(c)
    n = rn(r)
    return n


fh = huffRead("i")
fp = R(bff("h"))
sh = huffRead("j")
sp = R(bff("f"))
af = R(bff("a"))
bp = R(bff("u"))

ft = ""
test = open("test", "w")
i=0
while i < 125784:
    i += 1
    tp = ""
    n = "\n"
    s = "#"
    b = "b"
    t = fh.d(fp)
    spr = True
    if af.b() == 0:
        if t == n:
            tp += t
            spr = False
        elif t == "N" or t == "X":
            tp += t + n
            spr = False
        else:
            tp += t + ":"
    else:
        tp += t
        if bp.b() == 1:
            tp += s + ":"
        else:
            tp += b + ":"
    if spr:
        x = sh.d(sp)
        while x != n:
            tp += x
            x = sh.d(sp)

        tp += x
    tp = tp.replace("~", "maj").replace("&", "min").replace("^", "sus")\
        .replace("%", "aug").replace("$", "dim").replace("@", "hdim")
    ft += tp
print(ft)

Shamis

Posted 2020-01-24T01:54:32.857

Reputation: 141

Python 3, 969 + 36208 (size of data file) = 37177 bytes

A mix of encoding and traditional compression. I store each of the roots (the "letters") and index into an array to find them, then, if it's not N, X or the empty string (which don't have a "chord type" (combination of the quality and other stuff tacked on the end) and are immediately followed by a newline), I use the next byte to index a table of the "chord types".

The data file (named "b") is available on my GitHub.

import base64,zlib,io
x,y=eval(zlib.decompress(base64.decodebytes(b"eNocyAEGgEAYBtGrxIc/7CWqKiAAsYLuf4jsg2dM71VtqgXhN1whN4TGjtA4IE+ExoXQuAdPva3XnJ+q+sAVI4SBAHqXdCQkC4xxOM53eu/3T3axB0dl5w3bu5SrvWgtsi4XF5q3uHhQD/BG1T3LOLKaFjI9rZ4tmvwroGnl/wuZ7K4+jk01anfsKV9TuSfXUW7YkR5QP+RDcWIH4gvS2CE75o69zJ68eNxacS0HRvwgXMRwxbKWQHw4DuXQerKGqx/DLtgp9R28Y+yQHXOHdU/2HJ7iqVeue+KneqkxUM+oxTDxDrni4c87jxdaQgsyzSQ+pWKhvaHXH7547D1t0+MR0oB1iCGB1IWkDu0V3l8GaBwaJwvsG/3y8NHDL5Z7gT7qpJGLabRVjqgnc7IkT7jmHdsK0uMR6gEOjIAEZkAD1iGGBiTQhBTCYGPHjHyh5dAP3uvKJM5FUExLcs3r24KpQwwNSKAJXafkUh9E4u6iEJ8q5Xgdm6alqEMMDUggda3IvG9/E+Ga5xFD49aHrx77KQRp5KK5WGp1YaUskuSZrMmGYo2Lm3qAAyMggRnQgDE0IYVeQhhsDSA5NE7WMDtmpF9wLwrmc5/UIYGwA41MV17zpVDq0IAEmpD1Qz5U54rMW1/UoQEJZHyozv24g1j6PJS//vzqkX6NvJjC1CGGxDUi/YFykqJY42pa0LPxP0cbuWEvu5mi4teChg2hS+74qKaRxhigDg1IIIVM/049NpYLAINXQ3I=")))
f=io.BytesIO(zlib.decompress(open("b","rb").read()))
while 1:
 a=x[f.read(1)[0]];print(a if not a or a in"NX"else a+y[f.read(1)[0]])

Edit: I decided not to be lazy after all and go Google correct music terms.

famous1622

Posted 2020-01-24T01:54:32.857

Reputation: 451

It doesn't seem to work: IndexError: index out of range with the provided file – G B – 2020-02-12T10:53:29.047

@GB I use the error out at the end to exit the loop – famous1622 – 2020-02-12T14:32:06.627

Ok, so I assume the output file is smaller than expected because of CR/LF. I can't verify the original because I have no access to dropbox from behind a company proxy. – G B – 2020-02-12T14:35:26.997

@GB yeah, it should be because of the CR/LF, is the end Cb:maj(9) repeated and a final N? – famous1622 – 2020-02-12T14:37:20.710

bash, 802213 + 5 + 1 = 802219 bytes

cat<b

With a file named b containing the contents of chords.txt.

S.S. Anne

Posted 2020-01-24T01:54:32.857

Reputation: 1 161

How predictable is popular music?

Answers

Java (JDK), 295709 47044 (gzipped file) + 943 (code) + 1 (file) + 67 (imports) = 48055 bytes

Python 3, 23 + 802213 (size of file we want to reproduce) + 1

MATLAB, 55935 = 10 + 55924 + 1

Python 3: 93943 = 91,752 + 1990 + 1

Python 3, 969 + 36208 (size of data file) = 37177 bytes

bash, 802213 + 5 + 1 = 802219 bytes