How do I find this character(by unicode search) in notepad++ ﻁ (\uFEC1 and only that character)

6

5

How do I find this character(by unicode search) in notepad++ ﻁ

If I go to charmap

and I pick this character

I type FEC1 in the unicode search box and hit ENTER and it finds the character

enter image description here

I look it up on fileformat.info

http://www.fileformat.info/info/unicode/char/fec1/index.htm

UTF-8 (hex)   0xEF 0xBB 0x81 (efbb81)
UTF-16 (hex)  0xFEC1 (fec1)

If I enter the character into the search box literally then it finds it

enter image description here

But I can't see what unicode to search for to find it

I'd like to be able to search for it in both UTF-8 and UTF-16

[\uFEC1] seems to find the character, but it finds more than that character

Now, if I throw a few FEC9s in there, then I see [\uFEC1] seems to find them too

enter image description here

So, how do I search for \uFEC1 and only that. And i'm interested in searching for it by its UTF-8 code too

barlop

Posted 2015-10-12T13:29:00.100

Reputation: 18 677

Anybody voting to close should give a reason. This question asks re not just UTF 16 , but also UTF 8. – barlop – 2015-10-12T14:32:17.527

2This question don't make any sense: you can easy search for \x{FEC1} – duDE – 2015-10-12T14:36:17.173

For UTF-16 \x{FEC1} has been pointed out in an answer now deleted. (and granted one could just use UTF16). But The question still stands though re UTF8 which I asked about. – barlop – 2015-10-12T14:36:28.057

This works for UTF-8 as well, just try it! – duDE – 2015-10-12T14:41:01.330

@duDE I am talking about specifying the code using UTF-8. I know you can specify the code in UTF-16 even when the file is stored in UTF-8. Look at the UTF-8 code, see my question. I want to be able to specify using That code. UTF-8 (hex) 0xEF 0xBB 0x81 (efbb81) That is why that last line of my question said " i'm interested in searching for it by its UTF-8 code too" searching BY ITS utf-8 code – barlop – 2015-10-12T17:56:09.583

Answers

13

To search by Unicode codepoints using UTF-16 you'd use (\x{FEC1}), and it works whether the file is encoded with UTF-8 or UTF-16.

Bear in mind you wouldn't need to search by the UTF-8 code, because you can search by the UTF-16 code. But to address the part of your question that asks how do you search for that character by the UTF-8 code...

You can't. Well, you sort of can, but it's a hideous hack and you really shouldn't.

The obvious thing to try would be to search for \xef\xbb\x81 in your UTF-8 encoded document, but that doesn't work. (Note there's no {} here: Notepad++ expects either \xNN for 2 hex digits, or \x{NNNN} for 4 hex digits). That's because Notepad++ doesn't actually search for byte values, it searches for Unicode codepoints. So you can search for the codepoint U+FEC1, but not for the UTF-8 bytes 0xEF 0xBB 0x81, because Notepad++ "hides" the encoding details from you. (Because in nearly every scenario, someone editing a text file will care far more about finding the actual character than about finding the UTF-8 bytes.)

There's another trick you might try, which is to take that UTF-8 encoded file and choose the Encoding → Encode in ANSI menu option, at which point ﻁﻁﻉﻁﻉﻁﻉ appears to become ï»ï»ï»‰ï»ï»‰ï»ï»‰. (I say "appears to become" rather than "becomes" because... well, read on.) This is because it has taken the UTF-8 text of your file, and reinterpreted it as "ANSI" (which is a terrible encoding name because it's completely wrong, and should really be called "Windows-1252", but that's a different question). (By the way, the reason that ﻁﻁﻉﻁﻉﻁﻉ looks backwards in my text than the way it does in your screenshot: that's because Notepad++ doesn't care that Arabic is written right-to-left, so it shows the characters left-to-right in the order they were pasted into the file. But your browser does care about presenting Arabic in proper right-to-left order, the first two letters of that string (ﻁﻁ) appear on the right-hand side of the string, not on the left-hand side as they seem to in Notepad++). Digressions aside, here's why this will be helpful. In the "ANSI" (really Windows-1252) encoding, each byte is a single character, and so now you're going to be able to search by individual bytes. Now, if you search for \xef\xbb\x81 (which doesn't need to be a regular expression, just an "Extended" search), it will find the characters. Sort of. It will look like it's highlighting the two characters ï», but it's really highlighting three characters: ï, », and an "invisible" 0x81 character that doesn't really exist. (Because there is no character at the 0x81 point in Windows-1252 encoding: see for yourself.) And now you see why I said "appears to become" -- because your UTF-8 encoded text has really become ï»_ï»_ﻉï»_ﻉï»_ﻉ, where _ represents an "invisible" character that doesn't officially exist in the Windows-1252 codepage. Anyway, now that you've found the sequence of three characters with the byte values 0xEF, 0xBB, and 0x81 in Windows-1252, and Notepad++ has highlighted them, you can choose the Encoding → Encode in UTF-8 menu option, and your text will convert itself back to UTF-8, while Notepad++ will keep the highlight in the same place -- and thus, you'll find that one character has been highlighted.

So why do I say that you really shouldn't do this? Because the only reason that it works is that Notepad++ didn't do the right thing when you switched codepages. The right thing to do when you find a missing character is to complain, or insert a character like the Unicode replacement character (or a simple ? if you're in a legacy codepage that doesn't have in it), or do something so that the user will know they had an invalid character in their text. Errors should never be silently ignored, and having a 0x81 value in Windows-1252 text is an error. The only reason this trick works is because Notepad++ does the wrong thing with invalid characters (that is, it ignores them). So you really shouldn't rely on this trick: with any update to Notepad++, it could change its undocumented (and wrong) behavior, and start putting proper replacement characters in wrongly-encoded text, at which point this trick would fail. Stick to searching for real Unicode codepoints, and you'll be much better off.

By the way, the reason why your original attempt ([\uFEC1]) failed is because, according to Notepad++'s regular expression syntax, \u means "an uppercase letter". (Remember that in regular expressions, brackets represent "any of these characters"). The docs further say, "See note about lower case [sic] letters," and the note about lowercase letters says "this will fall back on "a word character" if the "Match case" search option is off." As it is in your screenshot. Therefore, the regex [\uFEC1] is searching for "any word character, or F, or E, or C, or 1" -- which matches every single character in your sample text.

Phew, that turned out to be a very long answer for what I said would be "very simple". I hope this helps you understand Unicode a bit better; if so, the hour I spent typing this up will have been worth it.

rmunn

Posted 2015-10-12T13:29:00.100

Reputation: 301

5

Take a look: Anyone know how to use Regex in notepad++ to find Arabic characters?

Because Notepad++'s implementation of Regular Expressions requires that you use the

\x{NNNN}

notation to match Unicode characters.

enter image description here

In your example,

\x{FEC1} 

duDE

Posted 2015-10-12T13:29:00.100

Reputation: 14 097

Thanks, that works for UTF 16. Do you know if you can search with the UTF-8 code(that's the other part of my q)? – barlop – 2015-10-12T13:59:57.813

I didn't ask you if it works or doesn't work on your system. Notepad++ is Notepad++ so anything will work or not work for both of us. What I asked you (and it's in my question too) is if you can search with the UTF 8 code for that character, that is UTF-8 (hex) 0xEF 0xBB 0x81 (efbb81) – barlop – 2015-10-12T14:16:49.003

-1 i'll have to downvote you for not grasping this and not even understanding that you haven't understood the second half of the question. I've been pretty clear that I was also asking about UTF-8 and you haven't understood or made any effort to even see that you do not understand. It's one thing to not understand something, and to ask, it's another thing to not understand it and be completely oblivious to not understanding it. You insist that you answered my question, but I told you, there are two parts to it and you did not answer the second half(despite your insistence that you have). – barlop – 2015-10-12T14:27:58.763

1I don't care about you downvoting - I just wanted to help you! – duDE – 2015-10-12T14:30:04.920

[If you are answering (which you did) THEN] you should just be trying to answer the question - and honestly. – barlop – 2015-10-12T14:30:27.803