9

I've collected a large number of "Web Shell by oRb" (a.k.a. "FilesMan" backdoor, a.k.a. antichat backdoor) files by running a WordPress honeypot, and searching pastebin. The code in the variants is obviously related. I'd like to figure out a "phylogeny" of WSO variants. I found D. Gusfield, Efficient algorithms for inferring evolutionary trees, which has a clearer explanation of the phylogeny generating algorithm in Caroline Uhler's Finding a Perfect Phylogeny.

Unfortunately, the folks who created some of the WSO variants borrowed code from other variants, so a "perfect" phylogeny doesn't exist - some variants have two or more immediate ancestors due to the malware writer equivalent of "horizontal gene transfer" as practiced by real life bacteria. The only thing I've been able to find about phylogenies that are directed acyclic graphs, is Constructing Computer Virus Phylogenies. The problem with that paper is that it doesn't seem to actually show you an algorithm - either no algorithm is present, or it's buried in the assorted lemmas, theorems, definitions and proofs.

Are there other papers that actually display an algorithm or pseudocode for an algorithm that can generate "phyloDAGs"? I'd settle for an implementation in a major programming language, even.

Bruce Ediger
  • 4,552
  • 2
  • 25
  • 26
  • 5
    When you write it, **please** come back here and answer your own question with a link to your code! – JaimeCastells Nov 13 '15 at 23:07
  • 1
    Have you tried to use Levenshtein distance (example on [StackOverflow](http://stackoverflow.com/questions/21511801/text-clustering-with-levenshtein-distances/21513338#21513338)) or something similar? – A.L Nov 18 '15 at 23:11
  • @A.L - I've used Normalized Compression Distance, and Neighbor Joinging to find phylogenies. Unfortunately, due to code chunks appearing in multiple versions, WSO doesn't have a perfect phylogeny. The biological assumptions of a single common ancestor don't hold, so if you look at actual code, Neighbor Joining based on NCD gives a false phylogeny. – Bruce Ediger Nov 18 '15 at 23:15

1 Answers1

3

Yes. You are looking for Ancestral Recombination Graph (ARG) algorithms. ARGs are the natural extension of phylogenies (trees) to DAGs. ARG algorithms tend to be exponential algorithms, while phylogeny algorithms are polynomial. This means that the dag problem appears to be significantly more difficult than the tree problem.

Please see the book:

D. Gusfield. Recombinatorics. MIT Press. 2014.

https://mitpress.mit.edu/books/recombinatorics