Repair encoding of ID3 tags

I have inherited about 2000 MP3 files. For the majority of them, their ID3 tags display garbled in Amarok. I need a software that upgrades the ID3 tags to v2.4 type $03 (i.e. UTF-8 encoding), removes any v1 tags, and is also smart about figuring out the original encoding on a case-by-case basis (most likely one of Windows-1252, BOM-less UTF-16 or GB18030).

Before I start programming this on my own on top of TagLib, is there already such a complete solution I could use?

Do not recommend Musicbrainz – it is heavily biased to United States published music and near useless to me. Do not recommend general ID3 tagging software without testing it first against my requirements – most of them

do not support v2.4
have, frankly said, quiet retarded ideas about character encoding
have no automation features (I do not want to spend my time editing manually).

I am also not (yet) interested in tag cleaning, mass renaming or categorisation software only; I first have do the afore-mentioned normalisation step.

daxim

Posted 2010-01-02T17:52:51.817

Reputation: 1 072

Answers

You want Ex Falso, the tag editor included in the Quod Libet project. Picard (the MusicBrainz tagger) may use the same tagging library, but QL originated it.

In particular, you want the Mutagen tagging library, which supports id3v2.4 (and by "support" I mean "enforce" ...militarily...). It is also excellent with character encodings, and includes a basic scriptable commandline tagger (mid3v2). As far as your normalization step goes, Mutagen only saves tags in ID3v2.4. It is certainly capable of converting all text into UTF-8, but you may need to script that yourself (I believe that the mid3v2 tool's defaults are to keep the current encoding where possible, and I don't know if it can be told to save everything in a particular encoding). Mutagen is written in Python.

Ex Falso is a nice, clean GUI , and supports most of the major retag-multiple-files features you'd expect. I don't think it does much in the way of internet lookups and I don't know how it is with album artwork -- Quod Libet may support that; Ex Falso can do it with a plugin, should one exist, though one might not exist. I've never needed that functionality -- I use EF and mid3v2 in concert to handle my retagging needs.

quack quixote

Posted 2010-01-02T17:52:51.817

Reputation: 37 382

mid3v2 is only half the solution. After trying it, it is definitely not good with with the misidentified encodings I am suffering from, i.e. an upgraded ID3 tag still displays wrong in Amarok. Mutagen fails my requirement about »smart about figuring out the original encoding«; it blithely assumes Latin1/Windows-1252, which is standard conformant, but useless for the messy real world. I am inclined to not accept this answer right now; I'll give a few more days opportunity for other answers. If nothing good comes, you get accepted. – daxim – 2010-01-02T19:23:10.683

perfectly acceptable. if you're a python coder, you might consider writing a script that uses Mutagen to a) read in the existing tag, b) perform some smart conversion steps (based on what you suspect or know the source encoding to be), and c) write a new tag. sounds like it's mainly the conversion step that mid3v2 isn't too bright at, and that doesn't surprise me... but i think python has some character encoding modules (perhaps iconv or similar) that are smarter and might be useful to a DIY-er. – quack quixote – 2010-01-02T20:12:57.230

i'm not much of a python coder myself or i'd have some more useful suggestions. maybe someone on stackoverflow would have better ideas if you end up writing your own script. – quack quixote – 2010-01-02T20:14:58.240

I don't think you're going to find a standalone application that will fix up your particular selection of incorrectly-tagged encodings. Having a mixture of cp1252, UTF-16 and GB-18030 is quite unusual and I don't think existing software will be able to solve that automatically.

So I'd download Mutagen and write a custom Python script to automate your own decisions about how to fix up unknown encodings. For example:

musicroot= ur'C:\music\wonky'
tryencodings= 'gb18030', 'cp1252'

import os
import mutagen.id3

def findMP3s(path):
    for child in os.listdir(path):
        child= os.path.join(path, child)
        if os.path.isdir(child):
            for mp3 in findMP3s(child):
                yield mp3
        elif child.lower().endswith(u'.mp3'):
            yield child

for path in findMP3s(musicroot):
    id3= mutagen.id3.ID3(path)
    for key, value in id3.items():
        if value.encoding!=3 and isinstance(getattr(value, 'text', [None])[0], unicode):

            if value.encoding==0:
                bytes= '\n'.join(value.text).encode('iso-8859-1')
                for encoding in tryencodings:
                    try:
                        bytes.decode(encoding)
                    except UnicodeError:
                        pass
                    else:
                        break
                else:
                    raise ValueError('None of the tryencodings work for %r key %r' % (path, key))
                for i in range(len(value.text)):
                    value.text[i]= value.text[i].encode('iso-8859-1').decode(encoding)

            value.encoding= 3
    id3.save()

The above script makes a few assumptions:

Only the tags marked as being in encoding 0 are wrong. (Ostensibly encoding 0 is ISO-8859-1, but in practice it is often a Windows default code page.)
If a tag is marked as being in UTF-8 or a UTF-16 encoding it's assumed to be correct, and simply converted to UTF-8 if it isn't already. Personally I haven't seen ID3s marked as UTF (encodings 1-3) in error before. Luckily encoding 0 is easy to recover into its original bytes since ISO-8859-1 is a 1-to-1 direct mapping of the ordinal byte values.

When an encoding 0 tag is met, the script attempts to recast it as GB18030 first, then if it's not valid falls back to code page 1252. Single-byte encodings like cp1252 will tend to match most byte sequences, so it's best to put them at the end of the list of encodings to try.

If you have other encodings like cp1251 Cyrillic, or a lot of cp1252 filenames with multiple accented characters in a row, that get mistaken for GB18030, you'll need a cleverer guessing algorithm of some sort. Maybe look at the filename to guess what sort of characters are likely to be present?

bobince

Posted 2010-01-02T17:52:51.817

Reputation: 8 816

+1, nice example script, and a good explanation of what it does and what assumptions it makes. i split those out to make them more obvious; hope you don't mind. – quack quixote – 2010-01-03T06:56:08.327

How about Mp3Tag with Wine?

Features (among others):

Batch Tag Editing Write ID3v1.1, ID3v2.3, ID3v2.4, MP4, WMA, APEv2 Tags and Vorbis Comments to multiple files at once.

Import from Amazon, discogs, freedb, MusicBrainz Save typing and import tags from online databases like Amazon, discogs, freedb, MusicBrainz, and more.

Replace characters or words Replace strings in tags and filenames (with support for Regular Expressions).

Full Unicode Support User-interface and tagging are fully Unicode compliant.

cschol

Posted 2010-01-02T17:52:51.817

Reputation: 862

-1

Foobar has pretty complete tagging support. It runs under wine.

Iain

Posted 2010-01-02T17:52:51.817

Reputation: 4 399

-1

there's also EasyTag

EasyTAG is a utility for viewing and editing tags for MP3, MP2, MP4/AAC, FLAC, Ogg Vorbis, MusePack, Monkey's Audio and WavPack files. Its simple and nice GTK+ interface makes tagging easier under GNU/Linux or Windows.

also you might want to know that id3v2.3 is usually preferable format, because windows media player doesn't support 2.4

Alexey Yakovenko

Posted 2010-01-02T17:52:51.817

Reputation: 239

folks who insist on following the v2.4 spec generally aren't interested in what Windows Media Player does or doesn't support... – quack quixote – 2010-01-03T18:05:28.770