The timestamping problem: A case study in data-driven malware analytics
At Invincea we build machine learning models for detecting, analyzing, and attributing cyber attacks. In this blog post we present analysis of one input that we’ve found boosts the performance of our models: the portable executable (PE) compile timestamp, a field included in every Windows .EXE and .DLL file that specifies when the file was compiled.
The existence of this field presents malware authors with a problem. If they leave the compiler-assigned value in the field they reveal information to network defenders and law enforcement about when they created their malware. If they assign a concocted value, their tampering can make them easier to detect or can reveal information about them via their choice of concocted value.
Because of this dilemma PE timestamps are useful for detecting malware and gaining intelligence about our adversaries. But how, in practice, do attackers decide to timestamp their binaries and what specific opportunities do these decisions create for us as defenders? In this blog post we’ll address these questions by looking at the overall landscape of malware timestamping, by showing an individual malware family’s timestamping strategy, and by showing the utility of PE timestamps for detecting previously unseen malware samples.
Let’s start by looking at the big picture. Below is a plot showing timestamping behavior for malware first seen on VirusTotal in the last year. The horizontal axis shows when files were first seen “in the wild” and the vertical axis shows their compile timestamps. The darkness of the pixels shows how many malware samples occurred at that position in the plot.
The plot reveals at least two trends. First, adversaries sometimes forge timestamps in a patently obvious way. Case in point, some malware samples have timestamps indicating that they were compiled in the future and many other samples claim to have been compiled in the Reagan era. It is of course unlikely that malware authors have time machines or that they are still deploying malware from a quarter century ago. Indeed, the adversaries who made these samples seem to have cared little about the plausibility of these timestamps, but simply cared about obscuring the time at which they actually compiled their malware.
A second trend, shown in the thin horizontal strips that run through this plot, suggests that some malware families re-use the same compile timestamp over and over again. This occurs because even while the attackers deploying these samples mutate individual samples to avoid detection, these mutations don’t involve repeated altering of the compile timestamp.
A third trend suggests that many attackers do not tamper with compile timestamps at all and simply leave the true, compiler-assigned value in this field. This becomes visible if we change the range of the y-axis on our plot to just show malware with “plausible” timestamps occurring around the time we first saw the malware. I do this below, plotting malware first seen in 2016 that also has timestamps suggesting it was compiled in 2016. The thick diagonal on this plot reveals a large number of samples whose timestamps claim they were compiled just before they were first seen. While it’s impossible to be sure, it seems likely that these compile timestamps are legitimate.
When individual malware families forge their timestamps, are there systematic patterns in their timestamping strategies? This topic deserves more attention than this blog post allows for, but in short, it turns out that there often are. To take one example, let’s look at the timestamping behavior of the allaple malware family, a common worm that spreads via LAN file shares. The plot below shows examples of allaple with the samples’ first seen timestamp on the horizontal axis and their PE timestamps on the vertical axis. It appears that when it generates new polymorphic copies of itself, allaple uses a random number generator to forge timestamps spanning the late 1980s to early 1990s. While we can’t learn when allaple was actually compiled by looking at these timestamps, the plot does give insight into how the worm algorithmically mutates itself.
A final question we’ll address here is how useful PE timestamps can be for doing malware detection. Before getting into how we can do this, look at the animated GIF below which compares malware to benignware timestamping behavior.
While we see some examples of benignware forging its PE timestamps in this plot, in general malware timestamping seems to involve many more forgeries than benignware. For example, while most benignware seen in 2015 and 2016 has compile timestamps from the 1990s, 2000s, and from the past six years, malware timestamps vary much more wildly and occur more frequently in the distant future and the distant past. In addition, malware appears to reuse identical timestamps again and again: this is because many near-duplicate copies of the same polymorphic malware families occur frequently in the wild. All of this seems to suggest that in theory, we should be able to use PE timestamps to detect malware.
To test this hypothesis I built a random forest machine learning malware detector which I trained on malware and benignware from 2016 and tested on malware and benignware from the first couple weeks of 2017. This malware detector uses only one feature: the PE timestamp of input binaries. The plot below shows that we can detect about 60% of previously unseen malware in 2017 at a false positive rate of about 1% by training on files seen in 2016.
While this is a toy approach to detecting malware, it is striking that we can do so well using just this one PE header field. For comparison, Invincea’s deep learning detection engine uses hundreds of millions of features to detect malware, including the PE timestamp, and achieves a detection rate greater than 99% at a 1% false positive rate. But it is noteworthy that we can get so much bang for our buck using just this single high value feature.
In conclusion, this blog post has shown how a data driven look at malware PE timestamps can help address multiple security problems, including understanding malware families, gaining intelligence about the time at which samples’ may have been compiled, and performing malware detection. Data science based security requires we design large-scale input spaces drawing from many information sources, and the detailed analysis shown here shows how we can go about carefully identifying these inputs.