Very strange gibberish strings with Cyrillic characters that appear in random websites and elsewhere

Question

In my work with online orders, I started noticing an extreme abnormality in a few orders. In one field that wasn't restricted there appeared a string of over 3 million characters that were totally gibberish consisting mostly of Cyrillic characters. On closer examination using Python, it turned out it was actually a list of over a thousand of such gibberish strings. I dug deeper and found more instances of that, the worst with a string of over 58 million characters consisting of over 18000 list elements.

So we have a string that consists of several lists of strings, those strings again consist of several gibberish words separated by non-breaking spaces.

An example (I added linebreaks for readability):

'Р В Р’В Р вЂ™Р’В Р В РІР‚в„ўР вЂ™Р’В Р В Р’В Р Р†Р вЂљРІвЂћСћР В РІР‚в„ўР
 вЂ™Р’В Р В Р’В Р вЂ™Р’В Р В Р вЂ Р В РІР‚С™Р Р†РІР‚С›РЎС›Р В Р’В Р Р†Р
 вЂљРІвЂћСћР В РІР‚в„ўР вЂ™Р’В Р В Р’В Р вЂ™Р’В Р В РІР‚в„ўР вЂ™Р’В Р В Р’В Р
 В РІР‚В Р В Р’В Р Р†Р вЂљРЎв„ўР В Р вЂ Р Р†Р вЂљРЎвЂєР РЋРЎвЂєР В Р’В Р
 вЂ™Р’В Р В Р вЂ Р В РІР‚С™Р Р†РІР‚С›РЎС›Р В Р’В Р Р†Р вЂљРІвЂћСћР В РІР‚в„ўР
 вЂ™Р’В Р В Р’В Р вЂ™Р’В Р В РІР‚в„ўР вЂ™Р’В Р В Р’В Р Р†Р вЂљРІвЂћСћР В
 РІР‚в„ўР вЂ™Р’В Р В Р’В Р вЂ™Р’В Р В Р’В Р Р†Р вЂљР’В Р В Р’В Р вЂ™Р’В Р В Р
 вЂ Р В РІР‚С™Р РЋРІвЂћСћР В Р’В Р В РІР‚В Р В Р вЂ Р В РІР‚С™Р РЋРІР‚С”Р В Р
 Р‹Р РЋРІР‚С”Р В Р’В Р вЂ™Р’В Р В РІР‚в„ўР вЂ™Р’В Р В Р’В Р В РІР‚В Р В Р’В Р
 Р†Р вЂљРЎв„ўР В Р вЂ Р Р†Р вЂљРЎвЂєР РЋРЎвЂєР В Р’В Р вЂ™Р’В Р В Р вЂ Р В
 РІР‚С™Р Р†РІР‚С›РЎС›Р В Р’В Р Р†Р вЂљРІвЂћСћР В РІР‚в„ўР вЂ™Р’В Р В Р’В Р вЂ

The following is a count of the 10 most common words in the 58 million character string:

Р                     2453256
В                     1926812
Р’В                    895699
вЂ™Р’В                 822674
Р†Р                    399677
РІР‚в„ўР               382349
вЂ                     235180
РІР‚С™Р                185503
вЂљРІвЂћСћР            177792
РІР‚В                  109266
Р†РІР‚С›РЎС›Р          101490

Now take e.g. the string "РІР‚в„ўР" and put it into google. I'm getting over a million seemingly random sites where those strings are inserted into the source code of the sites.

I have absolutely no idea what to make of this, does anyone know what this is?

The Р and В are cyrillic characters with the ASCII numbers 1056 and 1042 respectively, see https://asecuritysite.com/coding/asc2?val=1024%2C1280 — Khris, Aug 07 '19 at 12:32
A "string" is just a decoding of bytes into characters using some defined decoder like UTF-8 or UTF-16, and then a rendering of these characters into glyphs by your display program (i.e. your browser or editor or whatever). The bytes may not in fact be intended as characters. Displaying them as characters effectively throws away information. If they don't make sense as strings then examine them as bytes. — President James K. Polk, Aug 07 '19 at 12:52
I'd guess it's an attempted attack. Someone is trying to inject something into your site. Searching for a substring online reveals some sites which had that injected into their site. You should set a limit to how many characters are allowed in input fields. Maybe the amount is an attempt at a DOS attack? — 123, Aug 07 '19 at 13:39
The OP seems to say this is happening in -other- websites, and in that case I'd think something was wrong with their own pc.. — George M Reinstate Monica, Aug 08 '19 at 19:12
@GeorgeM Please put the string `РІР‚в„ўР` into google and report your results. — Khris, Aug 09 '19 at 05:27

score 1 · Answer 1 · edited Sep 03 '19 at 08:50

I was looking around websites with the same problem as you.

One of them is a french website and here is the text inside:

Mon banquier ne mР В Р вЂ Р В РІР‚С™Р Р†РІР‚С›РЎС›appelle plus pour mon dР В РІР‚СљР вЂ™Р’В©couvert, nous Р В РІР‚СљР вЂ™Р’В©changeons dorР В РІР‚СљР вЂ™Р’В©navant sur mes nouveaux projets

The non-alphanumeric characters (not a-z/A-Z) are replaced by the 'Cyrillic' characters. In this text there are ',é ...

In this case, it looks like an ASCII encoding problem where a multi-byte character is considered multiple uni-byte characters. So a two-byte character will become two one-byte characters.

I wandered around and found out that it could be linked to database encoding databases format. But I'm not an expert in databases and maybe someone with greater knowledge can complete the explanation.

So as @james-k-polk said, the characters you're seeing are either not supposed to be displayed as characters orz in my opinionz they are just badly converted from one format to another.

Good find. So the big question would be how é becomes "Р В РІР‚СљР вЂ™Р’В©". I'm not that knowledgeable about encodings, so I have no clue what mechanism could blow up a character so much. — Khris, Sep 03 '19 at 10:02

score 1 · Accepted Answer · answered Sep 04 '19 at 07:07

With @Deunis help I found out what is going on here.

When you take a special character that is represented by at least 2 bytes in utf8, then decode it as utf8 and encode it as cp1251 (cyrillic) it gets blown up. If you do that repeatedly the string becomes longer and longer showing the exact patterns observed on those websites. Here's an example Python code that reproduces those patterns:

def encode_decode(s,e1,e2):
    t = s.encode(e1)
    o = t.decode(e2)
    return o

e1 = "cp1251"
e2 = "utf_8"
char = 'ä'
iterations = 6

print(char)
print(40*'-')
for _ in range(iterations):
    char = encode_decode(char,e2,e1)
    print(char)
    print(40*'-')
for _ in range(iterations):
    char = encode_decode(char,e1,e2)
    print(char)
    print(40*'-')

This yields the output:

ä
----------------------------------------
Г¤
----------------------------------------
Р“В¤
----------------------------------------
Р вЂњР’В¤
----------------------------------------
Р В РІР‚СљР вЂ™Р’В¤
----------------------------------------
Р В Р’В Р Р†Р вЂљРЎС™Р В РІР‚в„ўР вЂ™Р’В¤
----------------------------------------
Р В Р’В Р вЂ™Р’В Р В Р вЂ Р В РІР‚С™Р РЋРЎв„ўР В Р’В Р Р†Р вЂљРІвЂћСћР В РІР‚в„ўР вЂ™Р’В¤
----------------------------------------
Р В Р’В Р Р†Р вЂљРЎС™Р В РІР‚в„ўР вЂ™Р’В¤
----------------------------------------
Р В РІР‚СљР вЂ™Р’В¤
----------------------------------------
Р вЂњР’В¤
----------------------------------------
Р“В¤
----------------------------------------
Г¤
----------------------------------------
ä
----------------------------------------

Where did you find the explanation of doing the iteration 6 times ? Did you recover something from your data ? — Deunis, Sep 05 '19 at 14:15
Doing it 6 times was just for the example, as you can see it's the most iterations where the result still fit a single line. I did try to recover the actual characters in my data but either most of them were just non-breaking spaces or the strings were at some point changed so that it wasn't possible to recover it anymore. — Khris, Sep 06 '19 at 05:08

Very strange gibberish strings with Cyrillic characters that appear in random websites and elsewhere

2 Answers2