Reading is like riding a bicycle: once you master it, it feels easy and automatic, and you quickly forget how much effort it took to learn. For example, we are normally not aware that we move our eyes 3 or 4 times per second as we read, glancing at each word on a screen or page for a few hundred milliseconds. Nor do we realize that only a portion of a word is visible and in focus at any given moment.1 Unfortunately, the speed and ease of reading can also be used against us during everyday tasks like reading email. In particular, a malicious actor can make a subtle change to an email address, transforming it into a lookalike domain that seems trustworthy and familiar when it actually may be the first step of a phishing attack.
Lookalike domains fool us by taking advantage of the fact that, in less than a second, we’ve already glanced at the subject line and the sender’s name and decided whether the sender’s email address is familiar or not. We thought we saw microsoft.com but in fact it was rnicrosoft.com. Or perhaps we read the domain as apple.com when something feels a little odd, and on second glance we realize it’s actually app1e.com. (Yes, look again – that “l” in the second apple is actually the number 1.)
Lookalike domains not only “hack into” the rapid and perhaps automatic way we read email, but they also take advantage of the halo effect, which is the tendency for positive feelings about a person, idea, or thing to “spread” or transfer to other aspects of experience.2 So when we think we see a familiar, trusted company or brand name in the email address, we’re more likely to assume the message itself (e.g., the content, links, attachments, etc., it includes) can be trusted. We lower our defenses, and click on the message, and if in fact, this is a carefully engineered phishing message, we’ve taken the bait.
What is a lookalike domain?
Lookalike domains -- let’s call them lookalikes -- can be divided into two categories. One set are domains that are created by modifying a known domain with letter substitutions, additions, or subtractions. A second set of lookalikes will often use the real domain (or optionally, a modified version) and embed it into a larger string, such as bestdeals-amazon.com, or perhaps it will use embedding plus modification, like bestdeals-amazan.com. (We have selected Amazon as our example since it is one of the most spoofed retails brands..)
Identifying lookalike domains: The challenges
What is the lookalike “target”?
It’s fairly trivial to recognize that an email address from the domain amazan.com -- once that third “a” is spotted -- is spoofing or targeting the domain amazon.com. Building an automated system that can identify the target domain is far from trivial. In particular, asking a machine to tell us, “Which domain does amazan.com look like?” from a universe of possibilities is computationally expensive (e.g., do we search through a giant list of known domains?). In other words, identifying the target of the lookalike automatically and efficiently is a difficult task.
There are a number of potential solutions. One option, as we hinted above, is to create a “dictionary” of known good domains, and to exhaustively compare the candidate lookalike to each entry in our domain dictionary (we can of course optimize this brute-force search in a number of ways). The advantage of the dictionary look-up approach is that, once we have a target, we can implement a fast and lightweight method for comparing the candidate and target domains. The cost, on the other hand, is the time and effort spent searching for potential targets to compare against.
Alternatively, we can construct a method that maps all our known, good domains into a manageable “metric space.” Then, given a candidate lookalike domain, we need only map the candidate into the same space, which gives us for free the targets that are nearest to the candidate. In other words, the advantage of this second strategy is that our mapping method generates one or more targets for a given candidate lookalike at virtually no cost. The downside, however, is the time and effort spent training a machine-learning model that computes this domain-mapping method.
How is similarity between a lookalike and its target measured?
So now we have our lookalike and its intended target in hand. The second challenge is defining a similarity metric. Of the following lookalike candidates, which one looks more like (is most similar to) amazon.com:
- amazan.com, arnazon.com, or amazn.com?
One approach to this question is to use edit-distance as our similarity metric. Similarity is defined as the cost (i.e., the type and number of edits) of transforming one word into another. “Lower cost” means “more similar.” For example, how many letters do we have to add/remove/substitute to go from:
- amazan.com to amazon.com?
The appeal of approaches like measuring edit-distance, which operate directly on the letters, is they are fast and easy to compute. Thus, they complement expensive target-search methods, like the dictionary look-up strategy we mentioned earlier: a low-cost similarity metric, but a high-cost search method for identifying targets.
Edit-distance methods have two additional important features. First, they tend to use “hand-coded” rules, not only for determining which letters (or more accurately, type-written characters) resemble each other, but also the relative cost of each edit (e.g., how much does one addition and two substitutions cost?). A second feature is that edit-distance methods are well-suited to smaller character sets, like the Latin alphabet. And once we include the larger universe of unicode characters, creating sets of character-level lookalikes (i.e., homoglyphs or “confusables”) becomes more challenging. And here, character features (like the fact that the "a” and "o" are both round and the same height) emerge without explicitly teaching the model to detect them.
A very different approach, which we have developed at Agari, uses image-based similarity. The image-based approach takes its inspiration from the field of reading research, and in particular, from the physiological and psychological mechanisms that support skilled reading. Instead of treating domains as strings of individual characters, we instead convert each string into an image (this is a real gift card design – hone in mostly on the typographic logo):
However, in this fake version, it is arrranged in 2D arrays of pixels and notice in a sense how our eyes “naturally” see it (e.g., how a tiger and a lion look similar, or an apple and a pear, etc.). The challenge, of course, is building a model that learns to “see” that in the image (compare with above):
Barring the obvious differences that are not "apples to apples" (i.e. the blurriness of the image, reversed swapped orientation, missing arrow beneath the "a" graphic and callout ribbon towards the bottom) if you look closely at the typographic amazon.com logo, you will notice that while the font appears to be the same, both the color of the arrow and words "gift card" do not not match the original exactly.
The image-based approach has several advantages over edit-distance methods. First, we don’t have to arbitrarily hand-code rules for which characters look like other similar ones, or for measuring how similar one character is to another. Plus, there are a variety of well-defined, well-understood distance metrics that measure how far apart two images are, that is, how similar or different they are.
The trade-off then is: image-based methods provide a quick and powerful way to measure the similarity between a lookalike and target domain, and more importantly, they also generate one or more potential targets that mimic a given candidate lookalike, without requiring brute-force search. The catch is that building, training, and validating a model that can do this takes time and effort.
Is a lookalike domain actually malicious?
The third challenge is demonstrating malicious intent. Just because an email arrives from a domain that looks an awful lot like amazon.com or microsoft.com does not prove it’s malicious -- we need more evidence besides close similarity. Addressing that challenge is beyond the scope of this article but some potential questions to ask are: Is the lookalike domain registered, and if so, how old is it? What do we know about the history of email from this domain? What other message features (e.g., the infrastructure used to deliver it, the subject line, the intended recipient, etc.) suggest malicious intent?
Implementing an image-based lookalike detection system
In the final section, we provide an end-to-end overview of implementing an image-based model for detecting lookalike domains, which is divided into 5 steps:
- Build an image library. We begin the process of training an image-based lookalike model by first building out an extensive dataset of well-known domains, which are converted into images like the examples above.
- Train a “bottleneck” model. Next, we need a model that takes these images as input and converts them into a highly compressed feature vector (sometimes called an “embedding”). There are a number of options for implementing a “bottleneck” encoding, including ordinary autoencoder networks and convolutional autoencoders, as well as pretrained image-processing models such as VGG (developed by the Oxford Visual Geometry Group).
- Identify lookalike targets. As we highlighted earlier, a key advantage of using an image-based method is that it gives us a set of lookalike targets “for free.” How is this accomplished? We achieve this by first pushing each of our well-known domains through the bottleneck model, which generates a set of corresponding feature vectors. Each vector is a compact representation of the original image in a high-dimensional, embedding space. Next, we gather these vectors into groups with an unsupervised clustering algorithm (e.g., k-means, Kohonen map, etc.), which creates a powerful and direct look-up method for projecting lookalikes into the embedding space: given a new feature-vector, we (1) compute which cluster it belongs to, and (2) retrieve the vectors for the well-known domains within that cluster. Note that by tuning the clustering algorithm, we have full control over how many domains are within each cluster.
- Measure lookalike-target similarity. Now that we have a manageable set of target domains to compare the candidate lookalike against, we need only compute the distance between the candidate feature vector and each of the target feature vectors that we generated in the previous step. Possible distance measures include: euclidean, cosine, Manhattan, and so on. Finally, we select the target that the lookalike candidate is nearest, and determine whether the distance between the two is less than a predetermined threshold. Lookalikes within the threshold are positives, while those that are further are negatives.
- Is the lookalike malicious? While the answer to this question is part of Agari’s proprietary lookalike algorithm, the rationale it relies on is common sense: what other evidence do we have that the sending domain is a well-known, trusted entity versus an unfamiliar and potentially malicious actor? Indeed, it’s sometimes the case that a completely legitimate “lookalike” domain happens to be very similar to another, trusted domain, so we want to carefully distinguish between these benign lookalike cases and those that raise a number of red flags.
References
- Eye movement in reading. https://en.wikipedia.org/wiki/Eye_movement_in_reading
- Halo effect. Wikipedia. https://en.wikipedia.org/wiki/Halo_effect