To search by Unicode codepoints using UTF-16 you'd use (\x{FEC1}
), and it works whether the file is encoded with UTF-8 or UTF-16.
Bear in mind you wouldn't need to search by the UTF-8 code, because you can search by the UTF-16 code. But to address the part of your question that asks how do you search for that character by the UTF-8 code...
You can't. Well, you sort of can, but it's a hideous hack and you really shouldn't.
The obvious thing to try would be to search for \xef\xbb\x81
in your UTF-8 encoded document, but that doesn't work. (Note there's no {}
here: Notepad++ expects either \xNN
for 2 hex digits, or \x{NNNN}
for 4 hex digits). That's because Notepad++ doesn't actually search for byte values, it searches for Unicode codepoints. So you can search for the codepoint U+FEC1, but not for the UTF-8 bytes 0xEF 0xBB 0x81, because Notepad++ "hides" the encoding details from you. (Because in nearly every scenario, someone editing a text file will care far more about finding the actual character than about finding the UTF-8 bytes.)
There's another trick you might try, which is to take that UTF-8 encoded file and choose the Encoding → Encode in ANSI
menu option, at which point ﻁﻁﻉﻁﻉﻁﻉ
appears to become ï»ï»ï»‰ï»ï»‰ï»ï»‰
. (I say "appears to become" rather than "becomes" because... well, read on.) This is because it has taken the UTF-8 text of your file, and reinterpreted it as "ANSI" (which is a terrible encoding name because it's completely wrong, and should really be called "Windows-1252", but that's a different question). (By the way, the reason that ﻁﻁﻉﻁﻉﻁﻉ
looks backwards in my text than the way it does in your screenshot: that's because Notepad++ doesn't care that Arabic is written right-to-left, so it shows the characters left-to-right in the order they were pasted into the file. But your browser does care about presenting Arabic in proper right-to-left order, the first two letters of that string (ﻁﻁ
) appear on the right-hand side of the string, not on the left-hand side as they seem to in Notepad++). Digressions aside, here's why this will be helpful. In the "ANSI" (really Windows-1252) encoding, each byte is a single character, and so now you're going to be able to search by individual bytes. Now, if you search for \xef\xbb\x81
(which doesn't need to be a regular expression, just an "Extended" search), it will find the characters. Sort of. It will look like it's highlighting the two characters ï»
, but it's really highlighting three characters: ï
, »
, and an "invisible" 0x81
character that doesn't really exist. (Because there is no character at the 0x81
point in Windows-1252 encoding: see for yourself.) And now you see why I said "appears to become" -- because your UTF-8 encoded text has really become ï»_ï»_ﻉï»_ﻉï»_ﻉ
, where _
represents an "invisible" character that doesn't officially exist in the Windows-1252 codepage. Anyway, now that you've found the sequence of three characters with the byte values 0xEF, 0xBB, and 0x81 in Windows-1252, and Notepad++ has highlighted them, you can choose the Encoding → Encode in UTF-8
menu option, and your text will convert itself back to UTF-8, while Notepad++ will keep the highlight in the same place -- and thus, you'll find that one ﻁ
character has been highlighted.
So why do I say that you really shouldn't do this? Because the only reason that it works is that Notepad++ didn't do the right thing when you switched codepages. The right thing to do when you find a missing character is to complain, or insert a character like the Unicode replacement character �
(or a simple ?
if you're in a legacy codepage that doesn't have �
in it), or do something so that the user will know they had an invalid character in their text. Errors should never be silently ignored, and having a 0x81
value in Windows-1252 text is an error. The only reason this trick works is because Notepad++ does the wrong thing with invalid characters (that is, it ignores them). So you really shouldn't rely on this trick: with any update to Notepad++, it could change its undocumented (and wrong) behavior, and start putting proper replacement characters in wrongly-encoded text, at which point this trick would fail. Stick to searching for real Unicode codepoints, and you'll be much better off.
By the way, the reason why your original attempt ([\uFEC1]
) failed is because, according to Notepad++'s regular expression syntax, \u
means "an uppercase letter". (Remember that in regular expressions, brackets represent "any of these characters"). The docs further say, "See note about lower case [sic] letters," and the note about lowercase letters says "this will fall back on "a word character" if the "Match case" search option is off." As it is in your screenshot. Therefore, the regex [\uFEC1]
is searching for "any word character, or F, or E, or C, or 1" -- which matches every single character in your sample text.
Phew, that turned out to be a very long answer for what I said would be "very simple". I hope this helps you understand Unicode a bit better; if so, the hour I spent typing this up will have been worth it.
Anybody voting to close should give a reason. This question asks re not just UTF 16 , but also UTF 8. – barlop – 2015-10-12T14:32:17.527
2This question don't make any sense: you can easy search for \x{FEC1} – duDE – 2015-10-12T14:36:17.173
For UTF-16 \x{FEC1} has been pointed out in an answer now deleted. (and granted one could just use UTF16). But The question still stands though re UTF8 which I asked about. – barlop – 2015-10-12T14:36:28.057
This works for UTF-8 as well, just try it! – duDE – 2015-10-12T14:41:01.330
@duDE I am talking about specifying the code using UTF-8. I know you can specify the code in UTF-16 even when the file is stored in UTF-8. Look at the UTF-8 code, see my question. I want to be able to specify using That code. UTF-8 (hex) 0xEF 0xBB 0x81 (efbb81) That is why that last line of my question said " i'm interested in searching for it by its UTF-8 code too" searching BY ITS utf-8 code – barlop – 2015-10-12T17:56:09.583