How to find all files in directory that contain UTF-8 BOM (byte-order mark)?

8

11

On Windows, I need to find all files in a directory that contain UTF-8 BOM (byte-order mark). Which tool can do that and how?

It can be a PowerShell script, some text editor's advanced search feature or whatever.

Borek Bernard

Posted 2012-04-30T00:33:46.123

Reputation: 11 400

Answers

15

Here is an example of a PowerShell script. It looks in the C: path for any files where the first 3 bytes are 0xEF, 0xBB, 0xBF.

Function ContainsBOM
{   
    return $input | where {
        $contents = [System.IO.File]::ReadAllBytes($_.FullName)
        $_.Length -gt 2 -and $contents[0] -eq 0xEF -and $contents[1] -eq 0xBB -and $contents[2] -eq 0xBF }
}

get-childitem "C:\*.*" | where {!$_.PsIsContainer } | ContainsBOM

Is it necessary to "ReadAllBytes"? Maybe reading just a few first bytes would perform better?

Fair point. Here is an updated version that only reads the first 3 bytes.

Function ContainsBOM
{   
    return $input | where {
        $contents = new-object byte[] 3
        $stream = [System.IO.File]::OpenRead($_.FullName)
        $stream.Read($contents, 0, 3) | Out-Null
        $stream.Close()
        $contents[0] -eq 0xEF -and $contents[1] -eq 0xBB -and $contents[2] -eq 0xBF }
}

get-childitem "C:\*.*" | where {!$_.PsIsContainer -and $_.Length -gt 2 } | ContainsBOM

vcsjones

Posted 2012-04-30T00:33:46.123

Reputation: 2 433

2This saved my day!

Also learned that get-childitem -recurse to handle subdirectories as well. – diynevala – 2015-09-04T10:31:40.383

I wondered if there's a way to remove the BOMs using the above script? – tom_mai78101 – 2018-06-06T18:58:56.887

1Cool. Before I mark is as the answer, is it necessary to "ReadAllBytes"? Maybe reading just a few first bytes would perform better? – Borek Bernard – 2012-04-30T01:27:03.047

@Borek See edit. – vcsjones – 2012-04-30T02:10:23.087

2

As a side note, here's a PowerShell script that I use to strip the UTF-8 BOM charater(s) from my source files:

$files=get-childitem -Path . -Include @("*.h","*.cpp") -Recurse
foreach ($f in $files)
{
(Get-Content $f.PSPath) | 
Foreach-Object {$_ -replace "\xEF\xBB\xBF", ""} | 
Set-Content $f.PSPath
}

Scott Smith

Posted 2012-04-30T00:33:46.123

Reputation: 184

I just got a slew of files which differed only by the fact that some had a BOM and some did not. Your answer was just what I needed to clean it all up. Thank you! – Tevya – 2018-10-26T15:57:19.690

1

If you are on an enterprise computer (like me) with restricted privileges and can't run powershell script, you can use a portable Notepad++ with PythonScript plugin to do the task, with the following script:

import os;
import sys;
filePathSrc="C:\\Temp\\UTF8"
for root, dirs, files in os.walk(filePathSrc):
    for fn in files:
      if fn[-4:] != '.jar' and fn[-5:] != '.ear' and fn[-4:] != '.gif' and fn[-4:] != '.jpg' and fn[-5:] != '.jpeg' and fn[-4:] != '.xls' and fn[-4:] != '.GIF' and fn[-4:] != '.JPG' and fn[-5:] != '.JPEG' and fn[-4:] != '.XLS' and fn[-4:] != '.PNG' and fn[-4:] != '.png' and fn[-4:] != '.cab' and fn[-4:] != '.CAB' and fn[-4:] != '.ico':
        notepad.open(root + "\\" + fn)
        console.write(root + "\\" + fn + "\r\n")
        notepad.runMenuCommand("Encoding", "Convert to UTF-8 without BOM")
        notepad.save()
        notepad.close()

Credit goes to https://pw999.wordpress.com/2013/08/19/mass-convert-a-project-to-utf-8-using-notepad/

Hoàng Long

Posted 2012-04-30T00:33:46.123

Reputation: 127