How do I remove non-ascii characters from filenames?

6

1

I have several files with names containing various Unicode characters. I'd like to rename them to only contain the "printable" ASCII characters (32-126).

E.g,

Läsmig.txt         //Before
L_smig.txt         //After
Mike’s Project.zip 
Mike_s Project.zip 

Or for bonus points, transcribe to the closest character

Läsmig.txt
Lasmig.txt
Mike’s Project.zip
Mike's Project.zip

Ideally looking for an answer that doesn't require 3rd party tools. (Edit: Scripts encouraged; I'm just trying to avoid niche shareware apps that need to be installed to work)


Power shell snippet that finds the files I'm interested in renaming:

gci -recurse | where {$_.Name -match "[^\u0020-\u007E]"}

Unanswered similar python question - https://stackoverflow.com/questions/17870055/how-to-rename-a-file-with-non-ascii-character-encoding-to-ascii

RJFalconer

Posted 2013-08-24T19:06:30.243

Reputation: 9 791

Doesn't require 3rd party tools... by hand sounds about right. Define 3rd party. And Python has a cute function called casefold() that with a little work could give you what you wanted.

– Doktoro Reichard – 2013-08-24T19:11:24.300

I have the same PS script and love it; it is indispensable since I frequently still use FAT32 and Windows XP (and even DOS). However mine lets you choose where to look: gci -recurse $args[0] | where {$_.Name -match "[^\u0000-\u007F]"} – Synetech – 2013-08-24T19:34:46.220

1There’s no such thing as “extended ASCII”. – kinokijuf – 2013-08-24T20:11:58.380

1

@kinokijuf, tell that to everybody else.

– Synetech – 2013-11-29T20:04:48.543

@Synetech ASCII is characters 0x00 to 0x7F. – kinokijuf – 2013-11-29T20:20:33.250

@kinokijuf, nobody questioned that; the ASCII set is indeed 0-127. And the “extended ASCII set” is 128-255. You may not like it, but it is the de facto term and has been for decades. Google alone returns 170,000 hits for it. – Synetech – 2013-11-29T21:57:25.400

@Synetech There are no “characters 128–255”; Windows NT has used UTF-16 since the very beginning. – kinokijuf – 2013-11-29T22:45:29.283

1@kinokijuf, and of course, nothing existed before Windows NT. – Synetech – 2013-11-29T22:50:47.830

1@kinokijuf, Synetech is correct. The extended ASCII code set had been in existence for over a decade when Windows NT shipped. Every DOS program known to man used the extended ASCII set. – Roger – 2013-11-30T02:00:34.287

@Roger There is no single “extended ASCII” set. There are several language-specific OEM code pages.

– kinokijuf – 2013-11-30T21:27:33.820

2

@Kinokijuf, there are code pages now. That's not in dispute. DOS added code page support only in DOS 3.3. However, the extended ASCII character set was built into the ROM of the original IBM PC display adapters. See this site for more info.

– Roger – 2013-12-19T19:23:14.670

@Roger There were code pages back in 16-bit times. In NT they are only for legacy support. NTFS has stored names in UTF-16 since the beggining. Talking about “extended ASCII character 0xAF” is meaningless without specifying which code page. – kinokijuf – 2013-12-20T05:30:07.793

@Synetech @Roger tell me what is the “extended ASCII” code of . – kinokijuf – 2013-12-20T05:38:24.953

Just ignore kinokijuf, he’s trying to be pedantic and purposely being obtuse. Characters 0-127 are the standard ASCII set and characters 128-255 in any code-page are and have always been called the extended set whether he likes it or not. – Synetech – 2013-12-20T06:17:47.507

@Synetech The character i posted (a random Mongolian letter) is in no code page, so clearly it is not “extended ASCII” according to your definition, yet i’m sure the OP wants it replaced. – kinokijuf – 2013-12-20T18:15:45.333

And? Instead of obstinately arguing over de facto terminology that has been used by countless people for decades and decades, you could have simply edited the question to use non-ascii from the start. ◔_◔ – Synetech – 2013-12-20T18:23:32.010

Answers

1

I found a similar topic here on Stack Overflow.

With the following code most of the characters will be translated to their "closest character". Although i couldn't get the translated. (Maybe it does, i can't make a filename in the prompt with it ;) The ß also does not get translated.

function Remove-Diacritics {
param ([String]$src = [String]::Empty)
  $normalized = $src.Normalize( [Text.NormalizationForm]::FormD )
  $sb = new-object Text.StringBuilder
  $normalized.ToCharArray() | % {
    if( [Globalization.CharUnicodeInfo]::GetUnicodeCategory($_) -ne [Globalization.UnicodeCategory]::NonSpacingMark) {
      [void]$sb.Append($_)
    }
  }
  $sb.ToString()
}

$files = gci -recurse | where {$_.Name -match "[^\u0020-\u007F]"}
$files | ForEach-Object {
  $newname = Remove-Diacritics $_.Name
  if ($_.Name -ne $newname) {
    $num=1
    $nextname = $_.Fullname.replace($_.Name,$newname)
    while(Test-Path -Path $nextname)
    {
      $next = ([io.fileinfo]$newname).basename + " ($num)" + ([io.fileinfo]$newname).Extension
      $nextname = $_.Fullname.replace($_.Name,$next)
      $num+=1
    }
    echo $nextname
    ren $_.Fullname $nextname
  }
}

Edit:

I added some code to check if a filename already exists and add (1), (2) etc... if it does. (It's not smart enough to detect an already existing (1) in the filename to be renamed so in that case you would get (1) (1). But as always... everything is programmable ;)

Edit 2:

Here is the last one for tonight...

This one has a different function for replacing the characters. Also added a line to change unknown characters like ß and for example to _.

function Convert-ToLatinCharacters {
param([string]$inputString)
  [Text.Encoding]::ASCII.GetString([Text.Encoding]::GetEncoding("Cyrillic").GetBytes($inputString))
}

$files = gci -recurse | where {$_.Name -match "[^\u0020-\u007F]"}
$files | ForEach-Object {
  $newname = Convert-ToLatinCharacters $_.Name
  $newname = $newname.replace('?','_')
  if ($_.Name -ne $newname) {
    $num=1
    $nextname = $_.Fullname.replace($_.Name,$newname)
    while(Test-Path -Path $nextname)
    {
      $next = ([io.fileinfo]$newname).basename + " ($num)" + ([io.fileinfo]$newname).Extension
      $nextname = $_.Fullname.replace($_.Name,$next)
      $num+=1
    }
    echo $nextname
    ren $_.Fullname $nextname
  }
}

Rik

Posted 2013-08-24T19:06:30.243

Reputation: 11 800

1The ß also does not get translated. Probably because the eszett is supposed to get mapped to ss which is two characters. (Well either that or to B which would be dumb if you’re not trying to use 1337-speak.) There’s obviously no built-in mapping, so you would have to handle it separately. – Synetech – 2013-11-29T22:29:26.267

Thank you for all the time you put into this. Works a treat. – RJFalconer – 2014-06-08T18:12:27.063

2

I believe this will work...

$Files = gci | where {$_.Name -match "[^\u0020-\u007F]"}

$Files | ForEach-Object {
$OldName = $_.Name
$NewName = $OldName -replace "[^\u0020-\u007F]", "_"
ren $_ $NewName
}

I don't have that range of ASCII filenames to test against though.

Austin T French

Posted 2013-08-24T19:06:30.243

Reputation: 9 766

You can easily create some test files with right-click → New Text Document then type some ASCII characters mixed with some extended ANSI/Unicode characters. – Synetech – 2013-08-24T19:40:14.020

2

I just ran a test with most permutations. Not surprisingly, it worked for the most part, but you might run into errors if the ASCII-only filenames conflict with existing filenames (which could also happen if other files are renamed, e.g., resumé1.doc, resumé2.doc, resumé.doc, etc.)

– Synetech – 2013-08-24T20:02:30.707