Snugglefish Provides Quick Pattern Matching Across Malware SamplesSeptember 5, 2013
As the number of malware samples steadily increases, CND teams have an ever-growing problem finding relationships among files. For example, if an analyst determines that a sample has a certain constant sequence of bytes, it is desirable to search all previously seen samples for the same occurrence. Doing so improves the analyst's odds of obtaining a better understanding of adversaries' TTPs.
To aid in such endeavors, MITRE developed an open source program we call "Snugglefish," which loosely stands for Simple N-Gram Fast Indexer & Searcher. The concept behind Snugglefish is based on a CMU CERT whitepaper titled "A Scalable Search Index for Binary Files." [Note: IEEE requires a login or payment for the paper.] If possible, we recommend you read the whitepaper.
To summarize, the paper describes a system that takes binary files and splits them into unique n-grams of a given size. It then stores these n-grams to disk for later searching. The indexing may take a while, but the searches are meant to be fast, allowing analysts to quickly search large numbers of malware files for specific sequences. The paper also goes into detail about design and implementation.
Snugglefish consists of two parts. First is indexing, which ingests files and outputs indices of the data. The second part, searching, takes the indices and a byte string as input, then searches to find files with the same string.
We describe these two parts of Snugglefish in the remainder of this post.
The indexing process takes input files and splits them up into n-grams using a 1-byte sliding window. By way of example, consider a file with content consisting of the string "google." If we set an n-gram size of 3, the result would be four 3-grams, each n-gram sliding over by one byte. See Figure 1 below.
The indexing code will discard any duplicate n-grams. So, if, for example, the sequence "ogl" appears in the file again, it will not be recorded again. Only the existence of an n-gram is recorded, not its ordering or location. The advantage is that minimal space is used; the disadvantage is that false positives increase (more on this below).
After the indexer has stored as much data as it can in memory (all of the n-grams from multiple files), it flushes everything in memory to disk, then iterates with any remaining files. The end result is one or more sets of indices that can later be used to search. Figure 2 below illustrates this concept.
After calculations and testing, we determined that a 3-byte n-gram (a 3-gram) would be ideal for our operational needs. We based our decision on the runtime complexity, disk space, and usability of the results. With 2-byte n-grams the number of false positives was too high to be usable, and with 4-byte n-grams, the space needed on disk and the runtime requirements were too high to warrant the decrease in false positives. As we're sure you've deduced, the speed of indexing depends on a number of factors, including the size of files being ingested and the CPU and I/O speeds of the system used.
The size of the indices that are produced depends on how much data is in the files to be indexed. If there are lots of unique n-grams across the dataset, your indices will be larger than if there were fewer unique n-grams. Unfortunately, it's difficult to give an accurate estimate of the growth between raw samples and indexed data.
The searching process works by taking a byte string, provided by a user, and breaking it into n-grams. Using the above example, "google," Snugglefish would use the resulting four unique 3-grams ("goo,", "oog," "ogl," and "gle") to search the indices, looking for files with all four 3-grams. The result is a list of all candidate filenames containing the pattern.
The resulting files are deemed "candidates" because, at this point, we only know that the given n-grams exist in the files. Without knowing the location of each n-gram, there's uncertainty that an exact match was found. Because of these false positives, a secondary verification phase must be used to verify resulting files contain the search string. (Note: Snugglefish does not do this verification.)
To be clear, Snugglefish's purpose is not to find all samples that match but instead to reduce the possible search space, thus making it more manageable for analysts. For example, if a CND team has one million samples, current searching mechanisms may take quite a while. But by using Snugglefish, the search space is significantly reduced, thus reducing the amount of time to search for true positives.
To give you an idea of how Snugglefish can improve analysts' time in searching malware samples, let's revisit the "google" example. We used a sample of about 200K files for which Snugglefish created an index. Then, using a string "google.com" as a search parameter with that index, Snugglefish returned a list of about 26,000 files in less than 1 second. We used
grep as a verification step, which took about 30 seconds to complete (the majority of that time was due to feeding filenames to
grep and not
grep itself). The end result is that of those 26K files, only one was a true match.
At first glance, this might appear as if Snugglefish had too high a false positive rate. However, consider the original sample set of 200K files, which Snugglefish reduced by over 85% in less than one second to a more manageable list of files (26K out of 200K). Then, in under a minute, we were able to pinpoint which of those 200K had the given string. By comparison, a traditional
grep of every file in the sample set took us about 5 minutes. So we were able to cut down the search time from 5 minutes to under a minute. Admittedly, this is a canned example, heavily affected by the sample set size, which is also small. But as your collection grows, you should experience the benefit of using a utility like Snugglefish.
One thing to consider when searching (and this is mentioned in the CMU paper), is the usefulness of your search term. If you have a dataset containing primarily Portable Executable (PE) files, then searching for "This program cannot" would not be the most suitable search string since it's likely to occur in the vast majority of your files.
Although we have only shown ASCII strings in these examples, it's worth pointing out that Snugglefish makes no distinction on type. It works off a series of bytes, so you can search for arbitrary sequences of instructions or data in any file, not just strings.
Snugglefish is in early alpha release and has room for improvement. The current version is single-threaded, so performance gains could be easily achieved by multi-threading the program (especially in search). Also, the referenced whitepaper describes an efficient method for compressing the data that is written to disk; because I/O-based searching is considerably higher than CPU usage, trading some I/O for CPU would produce performance gains and lead to faster searches.
If you have a growing collection of malware samples in your CND environment and do not yet have a capability to quickly find potentially related samples, then consider Snugglefish.
The code is available in our GitHub repository.
If you try it out, we'd like to hear from you. You can reach us via the github contacts.