5

In my work with online orders, I started noticing an extreme abnormality in a few orders. In one field that wasn't restricted there appeared a string of over 3 million characters that were totally gibberish consisting mostly of Cyrillic characters. On closer examination using Python, it turned out it was actually a list of over a thousand of such gibberish strings. I dug deeper and found more instances of that, the worst with a string of over 58 million characters consisting of over 18000 list elements.

So we have a string that consists of several lists of strings, those strings again consist of several gibberish words separated by non-breaking spaces.

An example (I added linebreaks for readability):

'Р В Р’ВР
 ’ Р В РІР‚в„ўР вР
 ‚™Р’В Р В Р’В Р Р
  вЂ Р В РІР‚љРІвЂћСћР Р
 ’ РІР‚™Р’ВР
 ’ Р В Р’ Р’РВ
 ’ Р Р†Р РР
 †Р вЂљРЎв„ўР В Р вЂ Р Р†Р вЂљРЎвЂєР Р
 ЋРЎвЂєР В Р’ Р’ РІРР
 ІР‚љРІвЂћСћР В РІРВ
 ‚™Р’В РРвЂ

The following is a count of the 10 most common words in the 58 million character string:

Р                     2453256
В                     1926812
Р’В                    895699
’В                 822674
ІР                   399677
РІР‚в„ўР               382349
†                    235180
‚Р              185503
‚в„ўР           177792
†                 109266
ІвЂћСћР         101490

Now take e.g. the string "РІР‚в„ўР" and put it into google. I'm getting over a million seemingly random sites where those strings are inserted into the source code of the sites.

I have absolutely no idea what to make of this, does anyone know what this is?

schroeder
  • 123,438
  • 55
  • 284
  • 319
Khris
  • 161
  • 5
  • 3
    Those are mostly not cyrillic characters – mat Aug 07 '19 at 12:26
  • The Р and В are cyrillic characters with the ASCII numbers 1056 and 1042 respectively, see https://asecuritysite.com/coding/asc2?val=1024%2C1280 – Khris Aug 07 '19 at 12:32
  • A "string" is just a decoding of bytes into characters using some defined decoder like UTF-8 or UTF-16, and then a rendering of these characters into glyphs by your display program (i.e. your browser or editor or whatever). The bytes may not in fact be intended as characters. Displaying them as characters effectively throws away information. If they don't make sense as strings then examine them as bytes. – President James K. Polk Aug 07 '19 at 12:52
  • I'd guess it's an attempted attack. Someone is trying to inject something into your site. Searching for a substring online reveals some sites which had that injected into their site. You should set a limit to how many characters are allowed in input fields. Maybe the amount is an attempt at a DOS attack? – 123 Aug 07 '19 at 13:39
  • The OP seems to say this is happening in -other- websites, and in that case I'd think something was wrong with their own pc.. – George M Reinstate Monica Aug 08 '19 at 19:12
  • @GeorgeM Please put the string `РІР‚в„ўР` into google and report your results. – Khris Aug 09 '19 at 05:27

2 Answers2

1

I was looking around websites with the same problem as you.

One of them is a french website and here is the text inside:

Mon banquier ne m’appelle plus pour mon découvert, nous échangeons dorénavant sur mes nouveaux projets

The non-alphanumeric characters (not a-z/A-Z) are replaced by the 'Cyrillic' characters. In this text there are ',é ...

In this case, it looks like an ASCII encoding problem where a multi-byte character is considered multiple uni-byte characters. So a two-byte character will become two one-byte characters.

I wandered around and found out that it could be linked to database encoding databases format. But I'm not an expert in databases and maybe someone with greater knowledge can complete the explanation.

So as @james-k-polk said, the characters you're seeing are either not supposed to be displayed as characters orz in my opinionz they are just badly converted from one format to another.

schroeder
  • 123,438
  • 55
  • 284
  • 319
Deunis
  • 769
  • 1
  • 7
  • 16
  • Good find. So the big question would be how é becomes "Г©". I'm not that knowledgeable about encodings, so I have no clue what mechanism could blow up a character so much. – Khris Sep 03 '19 at 10:02
  • 1
    I found this [link](https://en.wikipedia.org/wiki/Mojibake) – Deunis Sep 03 '19 at 11:32
1

With @Deunis help I found out what is going on here.

When you take a special character that is represented by at least 2 bytes in utf8, then decode it as utf8 and encode it as cp1251 (cyrillic) it gets blown up. If you do that repeatedly the string becomes longer and longer showing the exact patterns observed on those websites. Here's an example Python code that reproduces those patterns:

def encode_decode(s,e1,e2):
    t = s.encode(e1)
    o = t.decode(e2)
    return o

e1 = "cp1251"
e2 = "utf_8"
char = 'ä'
iterations = 6

print(char)
print(40*'-')
for _ in range(iterations):
    char = encode_decode(char,e2,e1)
    print(char)
    print(40*'-')
for _ in range(iterations):
    char = encode_decode(char,e1,e2)
    print(char)
    print(40*'-')

This yields the output:

ä
----------------------------------------
Г¤
----------------------------------------
Г¤
----------------------------------------
Г¤
----------------------------------------
Г¤
----------------------------------------
Г¤
----------------------------------------
Г¤
----------------------------------------
Г¤
----------------------------------------
Г¤
----------------------------------------
Г¤
----------------------------------------
Г¤
----------------------------------------
Г¤
----------------------------------------
ä
----------------------------------------
Khris
  • 161
  • 5
  • Where did you find the explanation of doing the iteration 6 times ? Did you recover something from your data ? – Deunis Sep 05 '19 at 14:15
  • 1
    Doing it 6 times was just for the example, as you can see it's the most iterations where the result still fit a single line. I did try to recover the actual characters in my data but either most of them were just non-breaking spaces or the strings were at some point changed so that it wasn't possible to recover it anymore. – Khris Sep 06 '19 at 05:08