Using systematic code reuse analysis to create robust YARA rules
10-17, 16:30–18:30 (Europe/Luxembourg), Vianden&Wiltz

YARA is a commonly used tool to detect and identify malware. There are roughly two types of YARA rules used on binary files: 1) based on metadata and strings and 2) based on code.
There are certain benefits by basing YARA rules on code. Since code reuse is frequent amongst binaries of a malware family, it offers plenty of options to base a YARA rule on. If the chosen code is heavily reused amongst the binaries, then it can result in very robust rules.
This approach comes with certain challenges. A key aspect is being able to find heavily reused code amongst many binaries of a malware family. Unless some sort of automation is at play, this quickly becomes difficult and time-consuming. Once suitable reused code is identified, it needs to be turned into a YARA rule, so that it works even when compiler differences, optimizations or instruction set changes are involved.
In this workshop we will create robust YARA rules for a handful of malware families based on automatically identifying shared code between many binaries of a family.


Required prior knowledge

This workshop is tailored to cybersecurity practitioners that either actively create
malware detection and identification rules with YARA or intend to start doing so.
Participants must be familiar with:

  • Basic understanding of YARA rules
  • Basic knowledge of static binary analysis with disassemblers
  • Basic knowledge of the x86/x64 instruction set

Required system setup
- Recommended OS: Ubuntu
- CPU Arch: Intel 32/64-bit (ARM not supported)
- Minimum of 8GB RAM, 16GB recommended
- Minimum of 100GB free disk space, 150GB recommended

Background

YARA is a commonly used tool to detect and identify malware. There are roughly two
types of YARA rules used on binary files: 1) based on metadata and strings and 2)
based on code / instruction sequences.
There are benefits by basing YARA rules on code. Since code reuse is frequent
amongst binaries of a malware family, it offers plenty of options to base a YARA rule
on. If the chosen code is stable across multiple variants of a malware, then it can
result in very robust rules.
This approach comes with certain challenges. A key aspect is being able to find
stable / heavily reused code amongst many binaries of a malware family. Unless some sort of automation is at play, this quickly becomes difficult and time-
consuming. Once suitable reused code is identified, it needs to be turned into a YARA rule, so that it works even when compiler differences, optimizations or instruction set changes are involved.
Addressing these challenges and adding some automation along the way, enables
the creation of robust YARA rules with less manual effort.

Workshop content

The goal of this workshop is to create robust YARA rules for a handful of malware
families based on automatically identifying shared code between many binaries of a
family.

The approach includes the following parts:

  • Study a set of good and bad examples of existing YARA rules to provide
    some background.
  • Pre-process a set of malware binaries, as well as goodware binaries to make
    their code searchable on the granularity of a function.
  • We automatically identify which functions are reused frequently for a malware
    family.
  • We need to exclude functions that are part of compilers, libraries or other
    malware families to avoid creating false positives.
  • From the set of reused functions, we will extract instruction sequences to
    create YARA rules with.
  • We will vet our new rules against the corpus of binaries to check for false
    positives and adjust the rule creation accordingly.

We will look at the following real-world challenges:

  • All binaries share library code from the compiler or 3rd party libraries. This
    code is not useful for malware identification and will need to be filtered out
    during the process.
  • How to reliably generate a YARA rule from a set of instruction sequences.
  • We need to make choices on how many and which instructions of a function
    and how many functions in total we want to consider building a Yara rule. A
    good balance has to be found.
  • The quality of the function similarity algorithm is crucial in finding the right
    matches. Especially since compiler versions, compiler optimization flags and
    instruction set differences have to be considered.
  • The quality of the disassembler in detecting functions and their content
    strongly influences the quality of results.

During the workshop, we will be exclusively using open source tools and a set of
publicly available binaries in unpacked form.

The takeaways for the participants of this workshop are:

  • Understanding the differences between good and bad YARA rules, be it
    based on code or based on strings/metadata.
  • Understanding the code reuse approach to YARA rules writing, with its
    benefits and challenges.
  • Understanding of the tooling required to identify code reuse over many
    binaries.
  • Understanding how to apply this process to real-world malware.

Jonas Wagner is the founder and CTO of Threatray and has built the technological foundation of its code search engine based on years of research and development. He holds a Masters Degree in Cybersecurity from the Bern University of Applied Sciences. He has previously spoken at botconf, FIRST CTI, BSides Zürich, DFRWS and many private events.

Carlos Rubio Ricote is a malware researcher at Threatray, where he is mainly responsible for reverse engineering malware to automate the detection process of new threats. In addition to researching new applications for code reuse technology that can help in different areas such as threat hunting, incident response, tracking the evolution of malware families, among others. He previously worked on reverse-engineering malware at Blueliv, S21sec Counter Threat Intelligence Unit and in the Panda Security Adaptive Defense team. He has previously spoken at Botconf (2022, 2019), BSides Zürich 2022, Virus Bulletin localhost 2020, as well as many closed-door private conferences.