Chuck Leaver – The Importance Of Edit Difference First Part

Written By Jesse Sampson And Presented By Chuck Leaver CEO Ziften


Why are the same techniques being utilized by opponents over and over? The basic answer is that they are still working today. For example, Cisco’s 2017 Cyber Security Report informs us that after years of decline, spam email with destructive attachments is once again on the rise. Because traditional attack vector, malware authors normally mask their activities by utilizing a filename similar to a common system procedure.

There is not necessarily a connection between a file’s path name and its contents: anybody who has actually aimed to hide sensitive information by giving it a dull name like “taxes”, or altered the extension on a file attachment to circumvent email rules knows this principle. Malware authors understand this too, and will typically name malware to resemble common system processes. For example, “explore.exe” is Internet Explorer, however “explorer.exe” with an additional “r” could be anything. It’s simple even for professionals to ignore this small difference.

The opposite issue, known.exe files running in uncommon locations, is easy to fix, using SQL sets and string functions.


How about the other scenario, discovering close matches to the executable name? Most people begin their search for near string matches by arranging data and visually looking for discrepancies. This normally works well for a small set of data, maybe even a single system. To find these patterns at scale, however, needs an algorithmic technique. One established technique for “fuzzy matching” is to utilize Edit Distance.

Exactly what’s the very best approach to determining edit distance? For Ziften, our technology stack includes HP Vertica, which makes this task easy. The web is full of data scientists and data engineers singing Vertica’s praises, so it will suffice to point out that Vertica makes it easy to create custom functions that make the most of its power – from C++ power tools, to analytical modeling scalpels in R and Java.

This Git repo is maintained by Vertica enthusiasts operating in industry. It’s not an official offering, however the Vertica group is certainly aware of it, and additionally is thinking every day about the best ways to make Vertica better for data scientists – a great space to watch. Most importantly, it includes a function to calculate edit distance! There are also some other tools for the natural processing of langauge here like word stemmers and tokenizers.

By using edit distance on the top executable paths, we can rapidly discover the closest match to each of our top hits. This is an interesting dataset as we can sort by distance to find the nearest matches over the entire data set, or we can arrange by frequency of the leading path to see what is the closest match to our commonly utilized processes. This data can likewise emerge on contextual “report card” pages, to reveal, e.g. the leading five nearest strings for a given path. Below is a toy example to provide a sense of usage, based on genuine data ZiftenLabs observed in a customer environment.


Setting a threshold of 0.2 appears to find great results in our experience, however the take away is that these can be adapted to fit individual usage cases. Did we discover any malware? We discover that “teamviewer_.exe” (ought to be simply “teamviewer.exe”), “iexplorer.exe” (needs to be “iexplore.exe”), and “cvshost.exe” (must be svchost.exe, unless possibly you work for CVS pharmacy…) all look strange. Considering that we’re currently in our database, it’s also minor to get the associated MD5 hashes, Ziften suspicion ratings, and other attributes to do a much deeper dive.


In this particular real-life environment, it ended up that teamviewer_.exe and iexplorer.exe were portable applications, not familiar malware. We assisted the client with further examination on the user and system where we observed the portable applications because use of portable apps on a USB drive could be evidence of naughty activity. The more disturbing find was cvshost.exe. Ziften’s intelligence feeds suggest that this is a suspect file. Searching for the md5 hash for this file on VirusTotal validates the Ziften data, showing that this is a potentially major Trojan infection that may be part of a botnet or doing something even more destructive. As soon as the malware was found, nevertheless, it was easy to solve the problem and make sure it remains fixed using Ziften’s ability to eliminate and persistently obstruct procedures by MD5 hash.

Even as we establish sophisticated predictive analytics to identify destructive patterns, it is essential that we continue to improve our abilities to hunt for known patterns and old tricks. Even if brand-new hazards emerge doesn’t indicate the old ones go away!

If you liked this post, watch this space for part 2 of this series where we will apply this approach to hostnames to discover malware droppers and other harmful websites.