I don't think you're going to find a standalone application that will fix up your particular selection of incorrectly-tagged encodings. Having a mixture of cp1252, UTF-16 and GB-18030 is quite unusual and I don't think existing software will be able to solve that automatically.
So I'd download Mutagen and write a custom Python script to automate your own decisions about how to fix up unknown encodings. For example:
musicroot= ur'C:\music\wonky'
tryencodings= 'gb18030', 'cp1252'
import os
import mutagen.id3
def findMP3s(path):
for child in os.listdir(path):
child= os.path.join(path, child)
if os.path.isdir(child):
for mp3 in findMP3s(child):
yield mp3
elif child.lower().endswith(u'.mp3'):
yield child
for path in findMP3s(musicroot):
id3= mutagen.id3.ID3(path)
for key, value in id3.items():
if value.encoding!=3 and isinstance(getattr(value, 'text', [None])[0], unicode):
if value.encoding==0:
bytes= '\n'.join(value.text).encode('iso-8859-1')
for encoding in tryencodings:
try:
bytes.decode(encoding)
except UnicodeError:
pass
else:
break
else:
raise ValueError('None of the tryencodings work for %r key %r' % (path, key))
for i in range(len(value.text)):
value.text[i]= value.text[i].encode('iso-8859-1').decode(encoding)
value.encoding= 3
id3.save()
The above script makes a few assumptions:
Only the tags marked as being in encoding 0 are wrong. (Ostensibly encoding 0 is ISO-8859-1, but in practice it is often a Windows default code page.)
If a tag is marked as being in UTF-8 or a UTF-16 encoding it's assumed to be correct, and simply converted to UTF-8 if it isn't already. Personally I haven't seen ID3s marked as UTF (encodings 1-3) in error before. Luckily encoding 0 is easy to recover into its original bytes since ISO-8859-1 is a 1-to-1 direct mapping of the ordinal byte values.
When an encoding 0 tag is met, the script attempts to recast it as GB18030 first, then if it's not valid falls back to code page 1252. Single-byte encodings like cp1252 will tend to match most byte sequences, so it's best to put them at the end of the list of encodings to try.
If you have other encodings like cp1251 Cyrillic, or a lot of cp1252 filenames with multiple accented characters in a row, that get mistaken for GB18030, you'll need a cleverer guessing algorithm of some sort. Maybe look at the filename to guess what sort of characters are likely to be present?
mid3v2
is only half the solution. After trying it, it is definitely not good with with the misidentified encodings I am suffering from, i.e. an upgraded ID3 tag still displays wrong in Amarok. Mutagen fails my requirement about »smart about figuring out the original encoding«; it blithely assumesLatin1
/Windows-1252
, which is standard conformant, but useless for the messy real world. I am inclined to not accept this answer right now; I'll give a few more days opportunity for other answers. If nothing good comes, you get accepted. – daxim – 2010-01-02T19:23:10.683perfectly acceptable. if you're a python coder, you might consider writing a script that uses Mutagen to a) read in the existing tag, b) perform some smart conversion steps (based on what you suspect or know the source encoding to be), and c) write a new tag. sounds like it's mainly the conversion step that
mid3v2
isn't too bright at, and that doesn't surprise me... but i think python has some character encoding modules (perhapsiconv
or similar) that are smarter and might be useful to a DIY-er. – quack quixote – 2010-01-02T20:12:57.230i'm not much of a python coder myself or i'd have some more useful suggestions. maybe someone on stackoverflow would have better ideas if you end up writing your own script. – quack quixote – 2010-01-02T20:14:58.240