0

I am practicing some malware detection basics and it has caught my attention that the Cyrillic alphabet is not detected by practically any traditional string detection tool.

Source Code

    while (strcmp(password, user_input) != 0)
{
    printf("Guess the password | Введите пароль::\n");
    scanf("%30s", user_input);
}

BinText Output

enter image description here

FLOSS Output

Guess the password | 
%30s
Right!!
The flag was: %s
Unknown error
_matherr(): %s in %s(%g, %g)  (retval=%g)

Detect It Easy Output enter image description here

I understand that the main problem is the UTF-8.

Binaryninja can display these strings if the "Cyrillic" unicode block is previously selected. enter image description here

"Strings" (the tool) can do it to with "strings -e S $binary"

Guess the password | Введите пароль::
%30s
¡¡Right!!
The flag was: %s
Unknown error
_matherr(): %s in %s(%g, %g)  (retval=%g)

But it has caught my attention that the vast majority of tools, specifically FLOSS, in a line whose string is "Bye bye братка", FLOSS will identify it as "Bye bye ", something that in the field of malware analysis can be a very serious "false negative" in my humble opinion.

I have written some YARA rule to hunt the string and test it at hex level, its works, but would like to know if there is any approach related to basic static analysis phase to solve this problem.

I would also appreciate if someone could share some reference or information about why this functionality is absent in several of the most used tools or Im missing something?

Does it have to do with the variable byte's width in UTF8 and the difficulty of detecting it if you don't know what you're specifically looking for?

Thanks a lot.

  • The Bintext I found explicitly mentions that only detects ASCII and double byte Unicode (UTF-16) is supported (no surprise as it focuses on windows resources and these are the common encoding for resources). Detecting and processing unicode characters (not only UTF-8) is pretty complicated so a lot tools only support a subset. I think it just depends what binaries the developers of these tools were facing. – Robert Jan 18 '22 at 10:57

0 Answers0