12

I have Mercurial repositories running on Apache with mod_wsgi. Repositories have all filenames encoded in windows-1251. This encoding is used for historically reasons: they were converted to mercurial from svn, windows-1251 is default windows encoding for russian locale.

Now programmers want to use Crucible tool for code review. It can't undersand filenames in any other encoding than utf-8. So I need to convert them from windows-1251 to utf-8. Does anyone know how to do this? Mercurial convert extension doesn't have options to convert encodings.

hgweb.config:

[web]
#encoding = UTF-8
encoding = windows-1251
#allow_archive = gz, zip, bz2
allow_archive = zip
allow_push = *
push_ssl = false

[extensions]

[collections]
/data/mercurial = /data/mercurial
Martin Geisler
  • 1,271
  • 9
  • 23
Selivanov Pavel
  • 2,126
  • 3
  • 23
  • 47
  • 1
    Russian filenames?! Don't think about file-rename? – Lazy Badger Oct 06 '11 at 22:42
  • 2
    I'm system administrator, not developer, my duty is to provide stable working service, not to learn how to use it. – Selivanov Pavel Oct 06 '11 at 23:22
  • Matt Mackall [wrote](http://permalink.gmane.org/gmane.comp.version-control.bazaar-ng.general/21737) long time ago, maybe already outdated # If you'd like to serve pages with UTF-8 instead of your default # locale charset, you can do so by uncommenting the following lines. # Note that this will cause your .hgrc files to be interpreted in # UTF-8 and all your repo files to be displayed using UTF-8. # #import os #os.environ["HGENCODING"] = "UTF-8" – Lazy Badger Oct 13 '11 at 18:30
  • I've written in comments to below answer: Changing `encoding = windows-1251` to `encoding = UTF-8` just made filenames unreadable. This is because they are stored in windows-1251. "UTF-8"os.environ["HGENCODING"] = "UTF-8" is equal to this setting. – Selivanov Pavel Oct 13 '11 at 23:48

3 Answers3

7

You are right that the convert extension doesn't support this in a nice way currently. That is, you cannot ask it to recode from encoding X to encoding Y. However, you can ask it to rename the files one by one for you! First create a file called rename.py with

import sys
for path in sys.stdin:
    old = path[:-1] # strip newline
    new = old.decode("cp1251").encode("utf-8")
    print 'rename "%s" "%s"' % (old, new)

Then run

$ hg manifest --all | python rename.py > rename.txt

This creates your file map. You can now use

$ hg convert --filemap rename.txt cp1251-repo utf-8-repo

to convert the repository into a new repository. In the new repository, it will look like the files have always been saved using UTF-8 file names.

Note: The file names are now stored as UTF-8 in the repository. This means that checkouts will look fine on moderns Linux machines. Windows, however, does not use UTF-8 file names. The FixUtf-8 extension must be used to make Mercurial convert the UTF-8 file names into UTF-16 on the fly. This will create readable file names on Windows too.

Note: Everybody will have to re-clone the new repository! Changing any part of the history inevitably changes all the changesets hashes too. So to pull this off, you need to either

  1. make everybody push to the server,
  2. convert the repositories on the server,
  3. have people re-clone

or

  1. make everybody run the above commands on their local repositories
  2. convert the repositories on the server

Either way works since the conversion is deterministic and so your users can run it themselves if they have Python available. If they only have a TortoiseHg installation, then it's probably easiest if you convert for them on your server.

I looked at making the convert extension support this more directly and have sent a patch to the Mercurial mailinglist for more direct support for this.

Martin Geisler
  • 1,271
  • 9
  • 23
  • I've posted a [full patch](http://selenic.com/pipermail/mercurial-devel/2011-December/036470.html) on the mailinglist. – Martin Geisler Dec 19 '11 at 18:23
  • Crucible evangelist has left our company, so the problem disappeared. Thank you very much for writing code to fix my problem and improvement of mercurial :) I'll try this when I'll have some free time. – Selivanov Pavel Dec 20 '11 at 10:29
  • Ah, problem solved :-) Feel free to accept my answer when you've tried it out. – Martin Geisler Dec 20 '11 at 12:16
  • No hg --manifest all in mercurial 1.8. I've userd hg --manifest -r tip. Everything works on Linux machine: I see proper filenames in repo with encoding = UTF-8 in hgweb.config and filenames are correct in cloned repository. On Windows machine I've got "Пояснительная записка.docx" filename after clone. – Selivanov Pavel Dec 20 '11 at 16:03
  • Yes, this is to be expected — the file names are not recoded when you clone onto Windows and so your checkout becomes garbage. Use the FixUtf8 extension for now. – Martin Geisler Dec 20 '11 at 16:06
  • C:\Documents and Settings\selivanov\Рабочий стол\test-utf-8>hg addremove -s 100 Traceback (most recent call last): File "hg", line 42, in File "mercurial\dispatch.pyo", line 27, in run File "mercurial\dispatch.pyo", line 64, in dispatch File "mercurial\dispatch.pyo", line 127, in _runcatch File "mercurial\ui.pyo", line 606, in warn File "mercurial\extensions.pyo", line 182, in wrap File "s:\tortoisehg\hg-fixutf8\fixutf8.py", line 140, in f win32helper.rawprint(h, ''.join(args)) File "s:\tortoisehg\hg-fixutf8\win32helper.py", line 84, in rawprint – Selivanov Pavel Dec 20 '11 at 16:28
  • u = s.decode('utf-8') File "encodings\utf_8.pyo", line 16, in decode UnicodeDecodeError: 'utf8' codec can't decode byte 0xd0 in position 54: invalid continuation byte – Selivanov Pavel Dec 20 '11 at 16:28
  • FixUtf8 does not work for me – Selivanov Pavel Dec 20 '11 at 16:29
  • let us [continue this discussion in chat](http://chat.stackexchange.com/rooms/2040/discussion-between-martin-geisler-and-selivanov-pavel) – Martin Geisler Dec 20 '11 at 16:34
2

I had the same problem. I needed to convert bunch of repositories, so I wrote a script that converts all repositories given as list.

usage:

hg_convert_filenames_encoding.py [-h] [-i INPUT_ENCODING] [-o OUTPUT_ENCODING] [-b] [-u] [repositories [repositories ...]]

You can get from my repository at BitBucket.

rominf
  • 121
  • 6
0

Just extraction from Mercurial Wiki FYI

The following are explicitly treated as binary data in an unknown encoding:

  • file contents
  • file names

These items should be treated as binary data and preserved losslessly wherever possible.

Thus, I suppose, just changing presentation charset in encoding = may do the thick

If this assumption is wrong (it's always possible), try FixUtf8 Extension, read part Fixing existing filenames from readme carefully

Lazy Badger
  • 3,067
  • 14
  • 13
  • Changing `encoding = windows-1251` to `encoding = UTF-8` just made filenames unreadable. FixUtf8 didn't help too. – Selivanov Pavel Oct 13 '11 at 17:30
  • OK. Try to return back encoding = windows-1251 and try `AddDefaultCharset utf-8` in Apache. Another idea - change WAMP to LAMP with UTF8 as locale and do nothing with filenames but have `encoding = UTF-8` again – Lazy Badger Oct 14 '11 at 07:08
  • This _is_ LAMP. In old days repository was placed on WAMP with Subversion, later I converted it to Mercurial and moved to Linux machine, with UTF-8 locale. `encoding = windows-1251` was set because data was already in this encoding. – Selivanov Pavel Oct 14 '11 at 13:20
  • re-read Wiki! 1251 is only presentation layer, not storage – Lazy Badger Oct 14 '11 at 14:52
  • _are explicitly treated as binary data in an unknown encoding_ And this encoding is windows-1251, filenames are stored in it. I need some way to convert them to UTF-8 – Selivanov Pavel Oct 14 '11 at 19:39