Detect Character encoding of unknown characters in Notepad++, and find/replace

The black characters are caused by Notepad++ not being able to map those byte sequences to a UTF-8 endpoint that can be visualized. As you might know, each character is stored as a series of bytes. In this case, those bytes are

In Hexadecimal

ED A0 BD ED B8 8A

In Binary

1110 1101 1010 0000 1011 1101 1110 1101 1011 1000 1000 1010

This brings us to your questions.

What is the encoding of those black characters?

These are two UTF-8 characters. More specifically, the first one ED A0 BD is what is called a 'high surrogate' the second one ED B8 8A is a 'low surrogate'. Together, they form an UCS surrogate. Now that we know these are surrogate characters, we can reverse the surrogate pair calculation.

If you look up these byte sequences in the UTF-8 code-points table, you will find that the first one maps to U+D83D and the second one maps to U+DE0A. Hence, the pair is D83D+DE0A. This maps to U+1F60A, which is..... drumroll A smiley!

So, the sequence you are looking at... is a smiley.

A regular expression to find these characters

Now that we know this, we can use regex to find sequences like this like so:

\x{D83D}\x{DE0A}

And then for the question that you added in the comments later (this is an edit).

is there any other utility I can find the characters which are not in UTF8 in my csv file.

I have shown now that the characters in your file are UTF-8. However, to still answer your question, if you want a tool to manually try and find the encoding of some characters, you can use this website. Here you can enter some text, specify the encoding, and transform it into another encoding to see what characters it maps to.

Wouter

Posted 2017-06-06T06:19:20.557

Reputation: 1 259

is there any other utility I can find the characters which are not in UTF8 in my csv file. Kindly let me know – user2068804 – 2017-06-06T15:42:50.853

I figured it out! :) Check out my edit. The sequence you are looking at is a smiley! Problem solved. Case closed. – Wouter – 2017-06-07T09:26:23.477

Detect Character encoding of unknown characters in Notepad++, and find/replace

Answers