Lossy or Lossless?

18

2

Given an audio file, determine whether it is encoded in a lossy format or a lossless format. For the purposes of this challenge, only the following formats need to be classified:

Rules

  • If input is taken in the form of a filename, no assumptions should be made about the filename (e.g. the extension is not guaranteed to be correct for the format, or even present).
  • There will be no ID3 or APEv2 metadata present in input files.
  • Any two unique and distinguishable outputs may be used, such as 0 and 1, lossy and lossless, foo and bar, etc.

Test Cases

The test cases for this challenge consist of a zip file located here which contains two directories: lossy and lossless. Each directory contains several audio files that are all 0.5-second 440 Hz sine waves, encoded in various formats. All of the audio files have extensions matching the formats above, with the exception of A440.m4a (which is AAC audio in a MPEG Layer 4 container).

Mego

Posted 2017-04-15T08:47:31.420

Reputation: 32 998

"AAC audio in a MPEG Layer 4 container" raises the question: what other container formats do answers need to handle? – Peter Taylor – 2017-04-15T11:54:42.860

@PeterTaylor Only AAC was given special mention because I couldn't find a way to provide AAC audio without embedding it in an MPEG Layer 4 container via FFMPEG. The Vorbis audio is embedded in an Ogg container (as is the norm for Vorbis audio). All others are standalone formats. – Mego – 2017-04-15T18:32:47.547

Are you sure about the TTA file? According to the spec, TTA files should start with the magic number TTA1 or TTA2. FFM2 (the magic number of your file) appears to correspond to FFmpeg stream. Linux file recognizes the TTA1 header, but not the FFM2 one.

– Dennis – 2017-04-27T14:49:11.647

Also, can we assume that AAC will always be in an MPEG Layer 4 header? If not, what can we assume? – Dennis – 2017-04-27T15:16:43.037

Can we take the contents of the file as an input or does our code have to retrieve them? – Shaggy – 2017-04-27T16:31:25.023

@Shaggy The contents may be piped in, or taken as a filename/URL or any of the other default I/O methods, so long as no additional input is used. – Mego – 2017-04-27T20:18:21.480

Would my 20 byte solution, so? Sorry, I'm still learning all the rules & standards around here. – Shaggy – 2017-04-28T10:09:56.457

@Shaggy It appears valid to me. – Mego – 2017-04-28T10:10:37.680

Sweet, thanks, @Mego :) – Shaggy – 2017-04-28T10:11:07.863

Answers

18

Jelly, 7 5 bytes

ƈƈeØA

Lossy formats return 0, lossless formats return 1.

Try it online! (permalinks in Gist)

Background

The formats we have to support have the following magic numbers, i.e., they start with these bytes.

Format    Header (text)       Header (hex)
-----------------------------------------------------------------------------------
AC3       .w                  0B 77
AMR       #!AMR               23 21 41 4D 52
AAC       ÿñP@..ü             FF F1 50 40 00 1F FC
  M4A     ... ftypM4A         00 00 00 20 66 74 79 70 4D 34 41 20
MP2       ÿû                  FF FB
MP3       ÿû                  FF FB
OGG       OggS                4F 67 67 53
WMA       0&²u.fÏ.¦Ù.ª.bÎl    30 26 B2 75 8E 66 CF 11 A6 D9 00 AA 00 62 CE 6C

AIFF      FORM????AIFF        46 4F 52 4D ?? ?? ?? ?? 41 49 46 46
FLAC      fLaC                66 4C 61 43
TTA       TTA1                54 54 41 31
  FFM2    FFM2                46 46 4D 32
WAV       RIFF????WAVE        52 49 46 46 ?? ?? ?? ?? 57 41 56 45

Indented entries are containers for the preceding format that appear in the test cases. ? denotes a variable byte. . denotes an unprintable byte. All other bytes are displayed as their ISO 8859-1 character.

By looking only at the second byte, we can determine the format in an easy way:

Lossless formats have an uppercase letter as their second byte, while lossy formats do not.

How it works

ƈƈeØA  Main link. No arguments.

ƈ      Read a char from STDIN and set the left argument to this character.
 ƈ     Read another char from STDIN and set the return value to this character.
   ØA  Yield the uppercase alphabet, i.e., "ABCDEFGHIJKLMNOPQRSTUVWXYZ".
  e    Exists; return 1 if the return value (second char on STDIN) belongs to the
       uppercase alphabet, 0 if not.

Dennis

Posted 2017-04-15T08:47:31.420

Reputation: 196 637

2This is a very clever solution. – Mego – 2017-04-27T23:39:55.117

10

C, 82 80 32 bytes

Inspired by @Dennis' answer, this can be reduced much further:

main(){return getchar()&200^64;}

Pipe the file data to stdin. Returns 0 for lossless, or nonzero for lossy.

Or the original longer check:

char v[5];main(){scanf("%4c",v);return*v&&strstr("fLaC FORM RIFF TTA1 FFM2",v);}

Pipe the file data to stdin. Returns nonzero (1) for lossless, or 0 for lossy.

From what I can tell, all the formats you listed have separate magic numbers (except AIFF/WAV, but those are both lossless anyway), so this just checks that magic number for a known lossless value. The *v&& is just to protect against matching files which start with a null byte (M4A).

I've included the values I found in spec sheets (fLaC = FLAC, RIFF = WAV/AIFF, TTA1 = TTA), and FORM = AIFF and FFM2 = TTA are from the sample files provided (I can only guess these are wrapper formats or later versions).


Or a shorter feels-like-cheating alternative:

Bash + file, 61 bytes

N="$(file "$1")";[[ $N = *": d"* || $N = *IF* || $N = *FL* ]]

Takes the filename as an argument. Returns 0 for lossless, or nonzero for lossy.

Does exactly what you'd expect; asks file what the filetype is, then checks for known patterns. TTA matches : d (: data), AIFF / WAV match IF, and FLAC matches FL. None of the lossless results match any of these, and I've tested that it still works if the filenames are removed.


Testing:

for f in "$@"; do
    echo "Checking $f:";
    ./identify2 "$f" && echo "shorter C says LOSSLESS" || echo "shorter C says LOSSY";
    ./identify < "$f" && echo "longer C says LOSSY" || echo "longer C says LOSSLESS";
    ./identify.sh "$f" && echo "file says LOSSLESS" || echo "file says LOSSY";
done;

# This can be invoked to test all files at once with:
./identify_all.sh */*

Dave

Posted 2017-04-15T08:47:31.420

Reputation: 7 519

Does your Bash solution also work if the file extension is incorrect? "the extension is not guaranteed to be correct for the format", so you should be able to give a file a wrong extension and still have it work. – mbomb007 – 2017-04-26T15:50:56.367

@mbomb007 I've just tested with the extensions mixed up and it still identifies them fine. I think file doesn't trust extensions anyway (lots of users thing renaming a png to a jpeg is the same as converting it!) – Dave – 2017-04-26T17:23:21.043

7

GS2, 3 bytes

◄5ì

Lossy formats return 0, lossless formats return 1.

Try it online! (permalinks in Gist)

Background

The formats we have to support have the following magic numbers, i.e., they start with these bytes.

Format    Header (text)       Header (hex)
-----------------------------------------------------------------------------------
AC3       .w                  0B 77
AMR       #!AMR               23 21 41 4D 52
AAC       ÿñP@..ü             FF F1 50 40 00 1F FC
  M4A     ... ftypM4A         00 00 00 20 66 74 79 70 4D 34 41 20
MP2       ÿû                  FF FB
MP3       ÿû                  FF FB
OGG       OggS                4F 67 67 53
WMA       0&²u.fÏ.¦Ù.ª.bÎl    30 26 B2 75 8E 66 CF 11 A6 D9 00 AA 00 62 CE 6C

AIFF      FORM????AIFF        46 4F 52 4D ?? ?? ?? ?? 41 49 46 46
FLAC      fLaC                66 4C 61 43
TTA       TTA1                54 54 41 31
  FFM2    FFM2                46 46 4D 32
WAV       RIFF????WAVE        52 49 46 46 ?? ?? ?? ?? 57 41 56 45

Indented entries are containers for the preceding format that appear in the test cases. ? denotes a variable byte. . denotes an unprintable byte. All other bytes are displayed as their ISO 8859-1 character.

By looking only at the second byte, we can determine the format in an easy way:

Lossless formats have an uppercase letter as their second byte, while lossy formats do not.

How it works

     (implcit) Push the entire input from STDIN as a string on the stack.
◄    Push 1.
 5   Get the strings character at index 1, i.e., its second character.
  ì  Test if the character is an uppercase letter.

Dennis

Posted 2017-04-15T08:47:31.420

Reputation: 196 637

2

JavaScript (ES6), 20 bytes

c=>/^[fFRT]/.test(c)

Explanation

Takes the contents of the file as an input and returns true if the file is lossless or false if it is lossy by testing the first character of that input to see if it is an f, F, R or T.


Try It

Paste the contents of a file into the textarea.

f=
c=>/^[fFRT]/.test(c)
i.addEventListener("input",_=>console.log(f(i.value)))
<textarea id=i></textarea>


Second effort, 81 63 bytes

Fetches the contents of a file from a provided URL, which turned out to be overkill.

u=>fetch(u).then(r=>r.text()).then(t=>alert(/^[fFRT]/.test(t)))

First effort, 146 116 89 bytes

Invalid as mime types are tied to extensions and, apparently, response headers qualify as additional input.

u=>fetch(u).then(r=>alert(/aiff|flac|tta|wave|wav$/.test(r.headers.get("Content-Type"))))

Shaggy

Posted 2017-04-15T08:47:31.420

Reputation: 24 623

web servers usually generate the MIME based on the file extension, which is against the rules here. Have you checked if it works on files served without an extension? (if it does then you should probably include the name of the server you're using as part of the "language") – Dave – 2017-04-26T17:25:50.130

1@Dave Pretty sure they don't. MIME and extension are not dependent on each other at all. If you change a file's extension and upload it, the MIME type is the MIME of the actual content of the file, not the extension. As it is, though, taking input as a URL is probably not allowed. I'm not sure. – mbomb007 – 2017-04-26T20:27:36.873

@mbomb007 I'm not sure why you say that; mime types are an internet thing, not a filesystem / file thing, and the servers I'm aware of will determine it based on extension using a configured lookup (for speed of serving headers; they don't want to inspect every file before serving it). Take for example Apache's AddType <mime> <extension>, or IIS's <MimeMap>. Of course a specific setup or file hosting tool could do a proper inspection, and that would merit making the server choice be part of the answer (since it's the server that's determining the filetype!) – Dave – 2017-04-26T21:25:55.183

1I've done file validation with .NET, and the MIME type matched the content even when the extension was changed before upload. – mbomb007 – 2017-04-26T21:32:41.510

@mbomb007 then whichever .NET component you used must have performed file inspection either during the upload or when serving the files (I'd guess during upload for performance, but you never know). So going back to my original comment, that would make this answer something like "JavaScript + .NET SeverLibraryXYZ". As for taking input from a URL, I can see why you're hesitant but personally I'd consider it valid as long as the server choice is mentioned. Maybe there's an existing meta on it, but ultimately of course it's up to Mego. – Dave – 2017-04-26T21:44:39.843

Using the Internet is off-limits by default as a loophole. – mbomb007 – 2017-04-26T21:54:20.783

@mbomb007 The intent of that loophole is to prevent answers from fetching information from the internet that is not present in the challenge to gain an advantage. Use of the MIME type in content headers counts as loophole abuse here, because that's extra input. Fetching from a URL is fine, but using response data other than the actual file data isn't. – Mego – 2017-04-26T22:40:57.247

Yes, MIME types are tied to extensions (at least in IIS) but I had hoped using them would be sufficient to meet the requirement that we not use file extensions to determine file type. I've updated my answer to determine type from the contents of the file, instead. @Mego, do you have a link to a consensus that responses headers count as extra input? – Shaggy – 2017-04-27T08:32:08.997

@Shaggy There is no consensus needed because there is no confusion - the input is only the file data (and optionally a filename, should that I/O method be used). Anything else is extra input. – Mego – 2017-04-27T08:34:19.277

Yeah, I suppose I can see the point, @Mego; the fact that the method I'm using to retrieve the file contents includes the headers in the response is irrelevant. – Shaggy – 2017-04-27T09:20:44.160

1

Chip, 11 bytes

~Z~S
t'G~aF

Shamelessly replicated Dennis' Jelly answer in Chip.

Lossless returns 0x0, lossy returns 0x1.

Try it online, links in gist (thanks Dennis for the TIO strategy here)

Explain!

~Z~S
t'

This portion is housekeeping: it Skips the first byte, and terminates after the second.

G~aF

This is the meat of the decision. Each input byte is accessed by the bits HGFEDCBA. If G is set, and F is not, that means the byte is within the range 0x40 to 0x5f (which is roughly equivalent to 'uppercase', and good enough for the task at hand).

However, for byte savings, I invert this decision from G and (not F) to (not G) or F, since or's can be implicit in Chip.

This resultant true/false value is then placed into a, which is the lowest bit of the output. (All other bits will be zero). In the TIO, I run the output through hexdump so that the values are visible.

Equivalently, in C-ish, one would say something like:

out_byte = !(in_byte & 0x40) && (in_byte & 0x20)

Phlarx

Posted 2017-04-15T08:47:31.420

Reputation: 1 366

1

Cubix, 16 bytes

$-!u'HIa'@/1@O<

Net form:

    $ -
    ! u
' H I a ' @ / 1
@ O < . . . . .
    . .
    . .

Try it yourself

You should input the file decimal byte values in a separated list. The separator doesn't matter, anything that is not a digit or a minus sign suffices. The code really only cares about the first byte, so you can leave out the rest of the file if you like. The program outputs 0 for lossless, and 1 for lossy. Try it here! The default input uses a FLAC header.

Explanation

The nice thing about files is that (nearly) all of them have a so-called magic. Those are the first few bytes of the file. Good software doesn't check the file extension, but rather the file magic to see if it can handle a certain file.

Dennis has found a way to use this magic to find the compression type, but the fact that he discarded the first byte made me want to try to come up with a method that used the first byte, rather than the second. After all, this community is all about saving bytes.

Here's a list of the first bytes of the different file types. I ordered them into two groups: lossy and lossless. Here are the values of their first byte in decimal, hexadecimal and binary. You might see a pattern already...

Lossy:                  Lossless:
255:0xFF:0b11111111     102:0x66:0b01100110
 79:0x4F:0b01001111      84:0x54:0b01010100
 35:0x23:0b00100011      82:0x52:0b01010010
 11:0x0B:0b00001011      70:0x46:0b01000110
  0:0x00:0b00000000

The pattern I saw, was that the second bit (counted from left to right) was always on on the "lossless" bytes and the fifth bit was always off. This combination does not appear in any of the lossy formats. To "extract" this, we would simply do a binary AND (by 0b01001000 (=72)) and then compare to 0b01000000 (=64). If both are equal, the input format is lossless, otherwise it's lossy.

Sadly, Cubix doesn't have such a comparison operator, so I used subtraction (if the result is 64, this yields 0, and it results in 8, -56 or -64 otherwise. I'll get back to this later.

First, let's start at the beginning of the program. The binary AND is done using the a command:

'HIa
'H   # Push 0b01001000 (72)
  I  # Push input
   a # Push input&72

Then, we compare to 64 using subtraction (note we hit a mirror that reflects the IP to the top face [first line, second character, pointing south] in the middle of this part).

'@-
'@  # Push 0b01000000 (64)
  - # Subtract from (input&72)
    # Yields 0 for lossy, non-zero otherwise

After the IP is turned around by the u, we use some control flow to push a 1 to the stack if (and only if) the top of the stack is non-zero:

!$1
!   # if top = 0:
 $1 #   do nothing
    # else:
  1 #   push 1

After we wrap around the cube, we hit the < instruction, which points the IP west on the fourth line. All that's left to do is output and terminate.

O@
O  # Output top of the stack as number
 @ # End program

So, the program outputs 0 for lossless, and 1 for lossy.

Luke

Posted 2017-04-15T08:47:31.420

Reputation: 4 675