Skip to content
Machine Learning - Sophos relies on artificial intelligence

Machine Learning - Sophos relies on artificial intelligence

Intercept X 2.0 is about to be released. The beta phase has been running for some time and the product is expected to ship later this month (January 2018) as a free update. This article is not about Intercept X 2.0 itself, but instead takes a deeper look at the technology it incorporates.

Artificial intelligence

Machine learning is currently on everyone’s lips, much as the cloud, VR (Virtual Reality) and AR (Augmented Reality) were a few years ago. Marketing departments often rebadge it as A.I. for Artificial Intelligence, or in German KI, short for “künstliche Intelligenz”. Whether it is chatbots, virtual assistants, self-driving cars, translation tools, smartphones or photo software - according to the vendors, everything now “has AI inside” to make our lives easier and their products smarter.

The latest smartphones ship with dedicated AI chips (Neural Processing Units, or NPUs) for pattern recognition and analysis. If you follow this development through, before long everyone will have a small supercomputer in their pocket that is almost always online. Richard Hendriks’ decentralised internet from the TV series “Silicon Valley” is fast approaching reality. Similar blockchain-based projects for decentralised storage and decentralised compute already exist.

Strong recommendation at this point for the TV series Silicon Valley!

Back to the topic at hand. AI is being advertised everywhere, but in many cases there is little genuine AI behind the label. The differences are immense. In practice, AI usually boils down to machine learning - which is also the foundation Sophos uses in Intercept X 2.0.

What exactly is this “machine learning”?

Wikipedia offers a concise and accurate description of machine learning:

“Maschinelles Lernen ist ein Oberbegriff für die „künstliche“ Generierung von Wissen aus Erfahrung: Ein künstliches System lernt aus Beispielen und kann diese nach Beendigung der Lernphase verallgemeinern. Das heisst, es werden nicht einfach die Beispiele auswendig gelernt, sondern es „erkennt“ Muster und Gesetzmässigkeiten in den Lerndaten. So kann das System auch unbekannte Daten beurteilen (Lerntransfer) oder aber am Lernen unbekannter Daten scheitern (Überanpassung).”

Strictly speaking, Sophos relies on deep learning, an advanced form of machine learning.

From this explanation, you can already imagine the huge impact machine learning can have on a product like Intercept X. It has been clear for some time that signature-based antivirus has not been decisive for malware detection since around 2005, because it can only deal with malware that is already known and catalogued. It is always an arms race between malware authors and signature writers. Malware authors naturally enjoy a small head start, which means their code is always unknown for a certain period. As soon as a new sample is identified, even minor modifications to the code are enough to make it appear “unknown” again to signature-based antivirus.

Sophos has already developed many additional methods for detecting malware and no longer relies on signatures alone. Nonetheless, the acquisition of Invincea at the beginning of 2017 was a smart move, adding technology to the portfolio that can protect against future threats and therefore also against unknown malware.

Machine learning itself is not brand new. The algorithms have existed since the 1980s and have not fundamentally changed. What was missing until recently was big data and the required processing power. Machine learning therefore experienced a revival around 2012. The same applies to genetic algorithms, which in my view malware authors will increasingly draw on in future.

How does machine learning work in theory?

In simple terms, you feed the machine with a very large volume of data. The algorithm dissects this data and analyses the features of the files. That might be something as basic as file size, but it can also include far more complex features such as entire code components. After this process, instead of just a hash value as in signature-based detection, you have a wealth of indicators. As a result, small changes to the code are no longer sufficient to disguise it as completely new malware, because other features remain unchanged.

Once the features have been extracted, you start building the so‑called “models”. For this you need a substantial amount of data. Conveniently, more than 390,000 new malware samples - over 16,000 per hour - appear every day. Sophos Sandstorm or Intercept X, where data is sent to Sophos Labs, also help by contributing data to train the models. Malicious URLs and spam likewise provide valuable training material. You do not only need malware, but also benign files, so that you do not end up with false positives later on.

Multiple models are evaluated in parallel and the one that delivers the best results is selected. From the model and the features a pattern emerges that describes what malware looks like and how it differs from a benign file. These patterns then allow you to score files and calculate the probability that a file is malicious. All this happens within milliseconds and requires significantly fewer resources (CPU and RAM) than other analysis techniques. During updates, only the pattern recognition is refined, instead of downloading new signatures every few seconds as in classic signature-based detection.

If you would like to dive a bit deeper into the subject, have a look at the technical paper from Sophos: Sophos Machine Learning how to build a better threat detection Model

The PDF is in English, but can be translated into many other languages using DeepL’s machine learning engine: https://www.deepl.com/translator. The well‑known Google Translator also uses machine learning, of course, but DeepL has been trained with better data and its models are noticeably better tuned.

Machine learning alone is not enough

Machine learning can already deliver impressively high detection rates, and the advantages over signature-based detection are obvious. Sophos, however, does not rely solely on these new patterns, but uses machine learning as one more technology in its stack to achieve the most comprehensive malware detection possible.

Thanks to machine learning, Intercept X 2.0 will be even more effective at detecting ransomware and exploits, complementing other technologies such as Exploit Prevention, Malicious Traffic Detection, CryptoGuard and the Synchronized Security Heartbeat. It is precisely these additional technologies that separate the wheat from the chaff - or, put differently, a standard antivirus from a professional solution.

Is Intercept X sufficient as the only protection?

You might now be wondering whether a traditional antivirus becomes obsolete once you have Intercept X with all these sophisticated technologies and, in future, machine learning as well. If you are using the Sophos Endpoint Client, you should definitely continue to run it alongside Intercept X. The reason is that the Sophos Endpoint Client is far more than just a basic antivirus that detects malware via signatures. The Sophos Endpoint Client can, for example, provide web security, web control / category-based URL filtering, device control and application control, to name just a few. You can find a complete overview of the differences between Sophos Endpoint Protection and Intercept X in this data sheet.

For all other “classic” antivirus programmes, I honestly do not see much of a future. For now, however, there is nothing wrong with running a traditional antivirus in parallel with Intercept X.

More on machine learning

On the Sophos Labs page, you can now find some excellent real‑time statistics on daily spam and malware activity, generated from a large pool of data.

For anyone who likes real‑time data, we have collected a few links. We still find it impressive to see, time and again, how many attacks are actually happening out there. It is astonishing what goes on behind the scenes:

Norse Attack Map

This site shows cyber attacks in real time. To achieve this, 8 million sensors and more than 6,000 applications are deployed on servers in 40 countries - so‑called honeypots, i.e. virtual traps. Altogether this results in over 7 petabytes of collected attack data.

Norse attack map

Norse maintains the world’s largest dedicated threat intelligence network. With over eight million sensors that emulate over six thousand applications - from Apple laptops, to ATM machines, to critical infrastructure systems, to closed-circuit TV cameras - the Norse Intelligence Network gathers data on who the attackers are and what they’re after. Norse delivers that data through the Norse Appliance, which pre-emptively blocks attacks and improves your overall security ROI, and the Norse Intelligence Service, which provides professional continuous threat monitoring for large networks.

FireEye - Cyber Threat Map

The FireEye Cyber Threat Map shows a daily summary of all global DDoS attacks.

Kaspersky - Cyber Map

Kaspersky’s cyberthreat real‑time map shows live attacks detected by its various source systems.

  • On‑access scanner
  • On‑demand scanner
  • Web Antivirus
  • Mail Antivirus
  • Intrusion detection system
  • Vulnerability scan
  • Kaspersky Anti‑Spam
  • Botnet activity detection
  • Kaspersky Cyberthreat real-time map

Akamai - Real‑Time Web Monitor

Akamai monitors global internet conditions around the clock. Using this real‑time data, they identify global regions with the highest volume of web attack traffic, cities with the slowest web connections (latency), and geographic areas with the highest web traffic (traffic density).

Check Point - Live Cyber Attack Threat Map

Check Point’s ThreatCloud also visualises attack data. It includes rankings of the top target countries.

Deutsche Telekom - Sicherheitstacho

The Sicherheitstacho visualises global cyber attacks on DTAG’s and its partners’ honeypot infrastructure.

Digital Attack Map

Visualised live data of global DDoS attacks. Developed jointly by Google Ideas and Arbor Networks, the tool provides anonymised attack data that allows users to explore historical trends and retrieve outage reports for a specific date.

Patrizio