Chuck Leaver – The Importance Of Edit Difference Second Part

Written By Jesse Sampson And Presented By Chuck Leaver CEO Ziften

 

In the very first about edit distance, we took a look at hunting for destructive executables with edit distance (i.e., how many character edits it requires to make two matching text strings). Now let’s take a look at how we can utilize edit distance to search for destructive domains, and how we can build edit distance functions that can be combined with other domain name features to pinpoint suspicious activity.

Background to this Case Study

Exactly what are bad actors playing at with destructive domains? It might be merely using a similar spelling of a typical domain to fool careless users into viewing ads or picking up adware. Genuine sites are gradually catching onto this method, often called typo squatting.

Other malicious domains are the result of domain name generation algorithms, which could be utilized to do all sorts of wicked things like evade counter measures that obstruct recognized jeopardized websites, or overwhelm domain name servers in a distributed DOS attack. Older versions utilize randomly-generated strings, while further advanced ones include techniques like injecting typical words, additionally confusing protectors.

Edit distance can aid with both usage cases: here we will find out how. Initially, we’ll leave out typical domains, considering that these are normally safe. And, a list of typical domain names offers a standard for spotting anomalies. One good source is Quantcast. For this discussion, we will stick to domain names and avoid sub domains (e.g. ziften.com, not www.ziften.com).

After data cleaning, we compare each candidate domain name (input data observed in the wild by Ziften) to its potential neighbors in the very same top-level domain (the tail end of a domain name – classically.com,. org, etc. but now can be almost anything). The standard task is to discover the nearby next-door neighbor in terms of edit distance. By finding domains that are one step removed from their closest next-door neighbor, we can quickly identify typo-ed domains. By discovering domain names far from their neighbor (the normalized edit distance we introduced in Part 1 is beneficial here), we can also find anomalous domain names in the edit distance space.

What were the Results?

Let’s look at how these results appear in real life. Take care navigating to these domain names since they could contain harmful material!

Here are a few prospective typos. Typo-squatters target popular domains since there are more chances someone will visit. Several of these are suspicious according to our danger feed partners, but there are some false positives as well with charming names like “wikipedal”.

ed2-1

Here are some unusual looking domains far from their next-door neighbors.

ed2-2

So now we have produced two useful edit distance metrics for searching. Not just that, we have 3 functions to potentially add to a machine learning model: rank of nearby neighbor, distance from next-door neighbor, and edit distance 1 from neighbor, suggesting a threat of typo shenanigans. Other features that could be utilized well with these are other lexical features such as word and n-gram distributions, entropy, and string length – and network features like the number of failed DNS demands.

Simplified Code that you can Play Around with

Here is a streamlined variation of the code to play with! Created on HP Vertica, however this SQL should function with a lot of advanced databases. Keep in mind the Vertica editDistance function may vary in other implementations (e.g. levenshtein in Postgres or UTL_MATCH. EDIT_DISTANCE in Oracle).

ed2-3