46
35
How can I batch-convert files in a directory for their encoding (e.g. ANSI->UTF-8) with a command or tool?
For single files an editor helps, but how to do the mass files job?
46
35
How can I batch-convert files in a directory for their encoding (e.g. ANSI->UTF-8) with a command or tool?
For single files an editor helps, but how to do the mass files job?
36
Cygwin or GnuWin32 provide Unix tools like iconv
and dos2unix
(and unix2dos
). Under Unix/Linux/Cygwin, you'll want to use "windows-1252" as the encoding instead of ANSI (see below). (Unless you know your system is using a codepage other than 1252 as its default codepage, in which case you'll need to tell iconv the right codepage to translate from.)
Convert from one (-f
) to the other (-t
) with:
$ iconv -f windows-1252 -t utf-8 infile > outfile
Or in a find-all-and-conquer form:
## this will clobber the original files!
$ find . -name '*.txt' -exec iconv --verbose -f windows-1252 -t utf-8 {} \> {} \;
Alternatively:
## this will clobber the original files!
$ find . -name '*.txt' -exec iconv --verbose -f windows-1252 -t utf-8 -o {} {} \;
This question has been asked many times on this site, so here's some additional information about "ANSI". In an answer to a related question, CesarB mentions:
There are several encodings which are called "ANSI" in Windows. In fact, ANSI is a misnomer. iconv has no way of guessing which you want.
The ANSI encoding is the encoding used by the "A" functions in the Windows API (the "W" functions use UTF-16). Which encoding it corresponds to usually depends on your Windows system language. The most common is CP 1252 (also known as Windows-1252). So, when your editor says ANSI, it is meaning "whatever the API functions use as the default ANSI encoding", which is the default non-Unicode encoding used in your system (and thus usually the one which is used for text files).
The page he links to gives this historical tidbit (quoted from a Microsoft PDF) on the origins of CP 1252 and ISO-8859-1, another oft-used encoding:
[...] this comes from the fact that the Windows code page 1252 was originally based on an ANSI draft, which became ISO Standard 8859-1. However, in adding code points to the range reserved for control codes in the ISO standard, the Windows code page 1252 and subsequent Windows code pages originally based on the ISO 8859-x series deviated from ISO. To this day, it is not uncommon to have the development community, both within and outside of Microsoft, confuse the 8859-1 code page with Windows 1252, as well as see "ANSI" or "A" used to signify Windows code page support.
4Don't use the same filename as input and output! iconv
seems to truncate files to 32,768 bytes if they exceed this size. As he writes in the file he's trying to read from, he manages to do the job if the file is small enough, else he truncates the file without any warning... – Niavlys – 2014-09-11T07:32:45.947
2FYI This question is tagged with osx and it doesn't look like either of the convert-all commands work on Yosemite or El Cap. The iconv version Apples ships doesn't support --verbose or -o, and the other syntax redirecting stdout doesn't work for some reason and just sends it to regular stdout. – Scott McIntyre – 2016-05-09T13:02:49.350
28
with powershell you can do something like this:
% get-content IN.txt | out-file -encoding ENC -filepath OUT.txt
while ENC is something like unicode, ascii, utf8, utf32. checkout 'help out-file'.
to convert all the *.txt files in a directory to utf8 do something like this:
% foreach($i in ls -name DIR/*.txt) { \
get-content DIR/$i | \
out-file -encoding utf8 -filepath DIR2/$i \
}
which creates a converted version of each .txt file in DIR2.
EDIT: To replace the files in all subdirectories use:
% foreach($i in ls -recurse -filter "*.java") {
$temp = get-content $i.fullname
out-file -filepath $i.fullname -inputobject $temp -encoding utf8 -force
}
Converting from ANSI to UTF via your first proposal does erase the whole content of my textfile... – Acroneos – 2015-05-09T07:24:36.960
@Acroneos: then you made a mistake: the in-file is IN.txt, the outfile is OUT.txt ... this way it is impossible to overwrite the original. if you used the same filename for IN.txt and OUT.txt then you overwrite the file you are reading from, obviously. – akira – 2015-05-10T06:06:42.240
Powershell will convert to UTF with BOM. find and iconv might be much easier. – pparas – 2017-08-23T14:25:44.767
6
The Wikipedia page on newlines has a section on conversion utilities.
This seems your best bet for a conversion using only tools Windows ships with:
TYPE unix_file | FIND "" /V > dos_file
3
The character encoding of all matching text files gets detected automatically and all matching text files are converted to utf-8
encoding:
$ find . -type f -iname *.txt -exec sh -c 'iconv -f $(file -bi "$1" |sed -e "s/.*[ ]charset=//") -t utf-8 -o converted "$1" && mv converted "$1"' -- {} \;
To perform these steps, a sub shell sh
is used with -exec
, running a one-liner with the -c
flag, and passing the filename as the positional argument "$1"
with -- {}
. In between, the utf-8
output file is temporarily named converted
.
The find
command is very useful for such file management automation.
Click here for more find
galore.
3
UTFCast is a Unicode converter for Windows which supports batch mode. I'm using the paid version and am quite comfortable with it.
UTFCast is a Unicode converter that lets you batch convert all text files to UTF encodings with just a click of your mouse. You can use it to convert a directory full of text files to UTF encodings including UTF-8, UTF-16 and UTF-32 to an output directory, while maintaining the directory structure of the original files. It doesn't even matter if your text file has a different extension, UTFCast can automatically detect text files and convert them.
Seems they cannot convert into the same folder, only into another destination folder. – Uwe Keim – 2016-08-09T19:49:19.243
The pro version allows in-place conversion. $20/3months. https://www.rotatingscrew.com/utfcast-version-comparison.aspx
– SherylHohman – 2019-01-30T19:19:50.287Oh, express (free) version is useless - it only "Detects" utf-8 WITH BOM !! (everyone can do that). Only Pro version that Auto-Renews every 3 months at $20 a pop, will auto-detect. Price is steep for a non-enterprise user. AND Beware if you try the basic version, and your file is already utf-8 (without BOM), then this converter will detect it as ASCII, then (re-)"convert" it to utf-8, which could result in gibberish. Be Aware if this before trying the express version! They have a demo version for the pro that produces no output - pointless IMHO cuz can't verify results before buying! – SherylHohman – 2019-01-30T19:38:16.700
1
Use this Python script: https://github.com/goerz/convert_encoding.py It works on any platform. Requires Python 2.7.
1
iconv -f original_charset -t utf-8 originalfile > newfile
run the above command in for loop.
I'm guessing original_charset
is just a placeholder here, not actually the magical "detect my encoding" feature we all might hope for. – mwfearnley – 2020-02-26T09:11:39.843
0
In my use-case, I needed automatic input encoding detection and there there was a lot of files with Windows-1250
encoding, for which command file -bi <FILE>
returns charset=unknown-8bit
. This is not valid parameter for iconv
.
I have had the best results with enca.
Convert all files with txt extension to utf-8
find . -type f -iname *.txt -exec sh -c 'echo "$1" && enca "$1" -x utf-8' -- {} \;
0
There is dos2unix
on unix.
There was another similar tool for Windows (another ref here).
How do I convert between Unix and Windows text files? has some more tricks
3dos2unix
is useful to convert line breaks, but the OP is looking for converting character encodings. – Sony Santos – 2014-04-17T03:01:43.157
0
You can use EncodingMaster. It's free, it has a Windows, Linux and Mac OS X version and works really good.
1The website you mention is closed. – Etienne Delavennat – 2018-09-26T08:23:52.003
http://stackoverflow.com/a/24713621/242933 – ma11hew28 – 2014-07-12T13:56:19.247
1
related: http://stackoverflow.com/questions/724083/unix-newlines-to-windows-newlines-on-windows
– None – 2009-08-21T09:18:25.730