You are trying to fulfill something impossible. If it is that easy, web malware would be dead few decades ago.
If you want to use mathematical tools to track malicious JavaScript code, you need first to know which features are employed by JavaScript malware. Once you understood these features, you may guess that it will be impossible to factor anything meaningful in one or several mathematical equations; so let's throw a glance over the employed and common features of JavaScript attacks:
- Server side polymorphism
Literally meaning many shapes, polymorphism is a technique used by malware authors to evade signatures based detectors. Polymorphism is qualified as being server sided when the engine which produces several but different copies of the malware is hosted on a compromised web server (Server-Side Polymorphism:
Crime-Ware as a Service Model (CaaS)). simulated metamorphic encryption generator (SMEG) version 1.0 was the first engine developed to implement the notion of polymorphism for computer viruses on the early 1990's (Parallel analysis of polymorphic viral code using automated deduction system)
- Code obfuscation
The other common feature you may find in malicious JavaScript code is that obfuscation is always used. This common factor -obfuscation- does not make even things simpler: because innocuous JavaScript code also uses obfuscation (for instance, some developers for example do not want their personal pretty JavaScript function to be understood by others as you can easily read HTML and JS pages codes). Along with server side polymorphism, code obfuscation is a widely used technique by malware authors to circumvent antivirus scanners. A myriad of techniques could be used to obfuscate JavaScript codes such as string reversing, Unicode and base 64 encoding, string splitting and document object model (DOM) interaction (Malware with your Mocha? Obfuscation and anti-emulation tricks
in malicious JavaScript.).
- Code unfolding
Code unfolding is the mechanism with which a new code is introduced at run time. In JavaScript, this is made concrete by invoking functions like document.write()
and eval()
in order to execute obfuscated portions of code and functions. (Weaknesses in Defenses Against Web-Borne Malware)
- Heap spray
This attack targets mainly web browsers. The user controllable data can corrupt the heap by a remote execution code if the miscreant has compromised the user's computer to the point he can have access to this vulnerable memory area (BuBBle: A Javascript Engine Level Countermeasure against Heap-Spraying Attacks)
- Drive-by download
Drive-by download attacks consist in downloading and and executing or installing malicious programs without the user's consent. Such attacks occur by exploiting browsers' vulnerabilities, their add-ons or plugins such as ActiveX controls or unpatched useful software such as Acrobat Reader and Adobe Flash Player (Drive-by download attacjs: effect and detection methods, MSc Information Security)
- Multi execution paths
It is possible to trigger an action only if certain conditions are fulfilled. Such circumstances could be the arrival of a given date or the existence of a file on the system on which the malware is intended to be executed. An other quick and well known example could be a denial of service attack that must be fired only if the number of the botnet's nodes has reached a certain value. That
is the notion of multi execution paths (Exploring Multiple Execution Paths for Malware Analysis)
- Implicit conditionals
This technique is mainly used against dynamic approach detectors. The main idea for this process is to execute a set of instructions by hiding the condition that fires it (Weaknesses in Defenses Against Web-Borne. Malware)
Given these common features and tactics used by JaaScript malware, if you want to detect this type of malware as you asked, you need first to study the state of the art of the methods used to detect that. Various methods have been developed so as to detect web (JavaScript) malware. We can divide them into two main categories as follows:
- Machine learning based classifiers
- Features: HTML and JavaScript codes distinguishing features extraction. These features are then evaluated to train a machine learning for classifier generation. The premise of this approach is that malicious webpages are likely to be different from benign ones (Thesis: Effective Analysis, Characterization, and Detection of Malicious Web Pages)
- Advantages: Lightweight approach, useful to deal with a bulk of websites analysis.
- Drawbacks: Obsolete against obfuscated JavaScript code and totally useless against new malicious code patters or zero attacks.
- Dynamic methods
- Features: Based on the dynamic behavior analysis, these techniques are implemented using either proxies where a page is rendered to the visitor only after its safety is checked, or a sandboxing environment relying on honeyclients (Same thesis: Effective Analysis, Characterization, and Detection of Malicious Web Pages).
- Advantages: Efficient against zero day attacks and obfuscated code.
- Drawbacks: Resources and time consuming. Sandboxing environments rely on low interaction honeyclients which themselves are based on virus signatures, and thus suffer from the same disadvantages as the static methods' ones.
What you have tried to do belongs to the first category.
Now, after you are well informed about all this, it can be useful for you to study some available tools dedicated for this purpose in order to implement your own technique. So let me mention you three important tools among so many others:
- Zozzle
Zoozle relies on Bayesian classification abstract syntax tree (AST) . It is legitimately classified as mostly static web malware detector because it embeds another engine that supervises the JavaScript code execution at run time. Its authors claim that it has a very low false positive rate of 0.0003% and is able to process over one megabyte of HTML and
JavaScript code per second. This tool is intended to be used as a browser plugin; its aim is to protect browsers against heap spray attack. It is time to point out how ZOZZLE operates.
How ZOZZLE operates? The following figure summarizes its core (ZOZZLE: Fast and Precise In-Browser JavaScript Malware Detection):
Extraction and labeling phase: The classifier needs training data. This data is extracted from obfuscated JavaScript code. Instead of developing an efficient de-obfuscation technique, Compile function interception calls is performed. Compile function is located in jscript.dll
library. It is a smart way to obtain plain JavaScript code because it is called each time <SCRIPT>
and <IFRAME>
tags, or eval()
and document.write()
functions have been called, which thing defines also the code context. Each code context is saved on the hard drive for further analysis.
Feature selection: JavaScript AST is used to tag each labeled context code for its safety or malignancy. The features are pre-selected using this formula:
Where:
- A: malicious context with feature
- B: benign context with feature
- C: malicious context without feature
D: benign context without feature
Classification: The Bayesian classifier is used for classification because even if it seems obsolete, in practice it gives good results and it is not time consuming.
Profiler
Profiler follows the static schema to detect web malware. It combines static features analysis of HTML and JavaScript code, including unified resource locator (URL)s. Then it uses machine learning techniques to teach a classifier that decides if a webpage embeds malicious content or not. Suspicious webpages are not processed by this tool. It rather forwards them to third party
technologies such as Wepawet (Prophiler: A Fast Filter for the Large-Scale Detection of Malicious Web Pages)
SpyProxy
SpyProxy follows the dynamic analysis principles. It monitors the active content of webpages within a virtual machine before deciding to render them to the visitor or not. The architecture of SpyProxy is illustrated through this figure (SpyProxy: Execution-based Detection of Malicious Web Content):
ICESHIELD performs in-line dynamic code analysis using a set of heuristics to verify attack attempts. Its authors take an inventory of the attacks that usually target the DOM properties of a website that are performed by injecting JavaScript into the website's source code. ICESHIELD supervises the running JavaScript code by predefining a set of rules related to functions calls and
applying heuristics on them in the hope to determinate whether the script is malicious or not (IceShield: Detection and Mitigation of Malicious Websites with a Frozen DOM).