Barely grasping the small picture: Hardening Allocators with ADI

Memory allocators handle a crucial role in any modern application/operating system: satisfy arbitrary-sized dynamic memory requests. Errors by the consumer in handling such buffers can lead to a variety of vulnerabilities, which have been regularly exploited by attackers in the past 15 years. In this blog entry, we'll look at how the ADI (Application Data Integrity) feature of the new Oracle M7 SPARC processors can help in hardening allocators against most of these attacks.

A quick memory allocator primer

Writing memory allocators is a challenging task. An allocator must be performant, limit fragmentation to a bare minimum, scale up to multi-thread applications and handle efficiently both small and large allocations and frequent/unfrequent alloc/free cycles. Looking in depth at allocator designs is beyond the scope of this blog entry, so we'll focus here on only the features that are relevant from an exploitation (and defense) perspective.

Modern operating systems deal with memory in page-sized chunks (ranging from 8K up to 16G on M7). Imagine an application that needs to store a 10 characters string: handing out a 8K page is certainly doable, but is hardly an efficient way to satisfy the request. Memory allocators solve the problem by sitting between the OS physical page allocator and the consumer (being it the kernel itself or an application) and efficiently manage arbitrary sized allocations, dividing pages into smaller chunks when small buffers are needed.

Allocators are composed of three main entities:

live buffers: a chunk of memory that has been assigned to the application and is guaranteed to hold at least the amount of bytes requested.
free buffers: chunks of memory that the allocator can use to satisfy an application request. Depending on the allocation design, these are either fixed-size buffers or buffers that can be sliced in smaller portions.
metadata: all the necessary information that the allocator must maintain in order to efficiently work. Again, depending on the allocator design, this information might be stored within the buffer itself (e.g. Oracle Solaris libc malloc stores most of the data along with the buffer) or separately (e.g. Oracle Solaris umem/kmem SLAB allocator keeps the majority of the metadata into dedicated structures placed outside the objects)

Since allocators divide a page in either fixed-size or arbitrary-size buffers, it's easy to see that, due to the natural flow of alloc/free requests, live buffers and free buffers end up living side by side in the linear memory space.

The period that goes from when a buffer is handed out to an application, up until is freed is generally referred to as the buffer lifetime. During this period, the application has full control of the buffer contents. After this period, the allocator regains control and the application is expected to not interfere.

Of course, bugs happen. Bugs can affect both the application working set of buffers or the allocator free set and metadata. If we exclude allocator intrinsic design errors (which, for long existing allocators, due to the amount of exercise they get, are basically zero), bugs always generate from the application mishandling of a buffer reference, so they always happen during the lifetime of a buffer and originate from a live buffer. It's no surprise that live buffer behavior is what both attackers and defenders start from.

Exploitation vectors and techniques

As we said, bugs originate from the application mishandling of allocated buffers:

mishandling of buffer size: the classic case of buffer overflow. The application writes past the buffer boundaries into adjacent memory. Because buffers are intermixed with other live buffers, free buffers and, potentially, metadata, each one of those entities becomes a potential target and attackers will go for the most reliable one.
mishandling of buffer references: a buffer is relinquished back to the allocator, but the attacker still holds a reference to it. Traditionally, these attacks are known as use after free (UAF), although, since this is an industry that loves taxonomies, it's not uncommon to see them further qualified as use after realloc (the buffer is reallocated, but the attacker is capable of unexpectedly modifying it through the stale reference) and double free (the same reference is passed twice to the free path). Sometimes an attacker is also capable of constructing a fake object and passing it to a free call, for example if the application erroneously calls free of a buffer allocated onto the stack. The degree of exploitability of these vulnerabilities (if we exclude the use after realloc case, which is application-specific) varies depending on what the allocator does during the free path and how many consistency/hardening checks are present.

With the notable exception of double free and "vanilla" use after free, both the above classes are extremely hard to detect at runtime from an allocator perspective, as they originate (and potentially inflict all the necessary damage) during the object lifetime and the allocator has little to none practical control on the buffer. For this reason, the defense focus has been on the next best thing when bug classes cannot be eradicated: hamper/mitigate exploitation techniques. Over the years (and at various degrees in different allocators) this has taken the form of:

entrypoint checks: add consistency check at the defined free and alloc entrypoints. As an example, an allocator could mark into the buffer associated metadata (or poison the buffer itself) that the buffer has been freed. It would then be able to check this information whenever the free path is entered and a double free could be easily detected. Many of the early days techniques to exploit heap overflows (~2000, w00w00 , PHRACK57 MaXX's and anonymous' articles) relied on modifying metadata that would then be leveraged during the free path. Over time, some allocators have added checks to detect some of those techniques.
design mitigations: attackers crave for control of the heap layout: in what sequence are buffer allocated, where are they placed, how can a buffer containing sensitive data be conveniently allocated in a specific location. Allocators can introduce statistical mitigations to hamper some of the techniques to achieve this level of control. As an example, free object selection can be randomized (although it ends up being pretty ineffective against a variety of heap spraying techniques and/or if the attacker has quite some control on the allocation pattern), free patterns can be modified (Microsoft IE Memory Protector) or sensitive objects can be allocated from a different heap space (dedicated SLAB caches, Windows IE Isolated Heap, Chrome PartitionAlloc, etc). The bottom line goal of these (and other) design approaches is to either reduce the amount of predictability of the allocator or increase the amount of precise control that the attacker needs to have in order to create the successful heap layout conditions to exploit the bug.

Of course, more invasive defenses also exist, but they hardly qualify for large scale application, as users tend to (rightfully) be pretty concerned about the performance of their applications/operating systems. This becomes even more evident when we compare the amount of defenses that are today enabled and deployed at kernel level versus the amount of defenses enabled at user level (and in browsers): different components have different (and varying) performance requirements.

The practical result is that slab overflows are today probably the most reliable type of vulnerability at kernel level and use after free are a close second in kernel land, while extensively targeted in user land, with only the browsers being significantly more hardened than other components. Extensive work is going on towards automating and abstracting the development of exploits for such bugs (as recently presented by argp at Zeronights), which makes the design of efficient defenses even more compelling.

ADI to the rescue

Enter the Oracle SPARC M7 processor and ADI, Application Data Integrity, that were both unveiled at HotChips and Oracle OpenWorld 2014. At its core, ADI provides memory tagging. Whenever ADI is enabled on a page entry, dedicated non-privileged load/store instructions provide the ability to assign a 4-bit version to each 64-byte cacheline that is part of the page. This version is maintained by the hardware throughout the entire non-persistent memory hierarchy (basically, all the way down to DRAM and back).

The same version can then be mirrored onto the (previously) unused 4 topmost bits of each virtual address. Once this is done, each time a pointer is used to access a memory range, if ADI is enabled (both at the page and per-thread level), the tag stored in the pointer is checked by the hardware against the tag stored in the cache line. If the two match, all is peachy. If they don't, an exception is raised.

Since the check is done in hardware, the main burden is at buffer creation, rather than at each access, which means that ADI can be introduced in a memory allocator and its benefit extended to any application consuming it without the need of extra instrumentation or special instructions into the application itself. This is a significant difference from other hardware-based memory corruption detection options, like Intel MPX, and minimizes the performance impact of ADI while maximizing coverage. More importantly, this means we finally have a reliable way to detect live object mishandling: the hardware does it for us.
ADI example

[ADI versioning at work. Picture taken from Oracle SPARC M7 presentation]

4 bits allow for a handful of possible values. There are two intuitive ways in which an ADI-aware allocator can invariantly detect a linear overflow from a buffer/object to the adjacent one:

introduce a redzone with a special tag
tag differently any two adjacent buffers

Introducing a redzone means wasting 64-byte per allocation, since 64-byte is the minimum granularity with ADI. Wasted memory scales up linearly with the number of allocations and might end up being a substantial amount. Also, the redzone entry must be 64-byte aligned as well, which practically translates in both buffers and the redzone to be 64-byte aligned. The advantage of this approach is that is fairly simple to implement: simply round up every allocation to 64-byte and add an extra complimentary 64-byte buffer. For this reason, it can be a good candidate for debugging scenarios or for applications that are not particularly performance sensitive and need a simple allocation strategy. For allocators that store metadata within the buffer itself, this redzone space could be used to store the metadata information. Mileage again varies depending on how big the metadata is and it's worth to point out that general purpose allocators usually strive to keep it small (e.g. Oracle Solaris libc uses 16 bytes for each allocated buffer) to reduce memory wastage.

Tagging differently two adjacent objects, instead, has the advantage of reducing memory wastage. In fact, the only induced wastage is the one from forcing the alignment to a 64-byte boundary. It also requires, though, to be able to uniquely pick a correct tag value at allocation time. Object-based allocators are a particularly good fit for this design because they already take some of the penalty for wasted memory (and larger caches are usually already 64-bit aligned) and their design (fixed size caches divided in a constant number of fixed size objects) allows to uniquely identify objects based on their address. This provides the ability to alternate between two different values (or range of values, e.g. odd/even tags, smaller/bigger than a median) based on the ibject position. For other allocators, the ability to properly tag the buffer depends on whether there is enough metadata to learn about the previous and next object tag. If there is, then this can still be implemented, if there isn't, one might decide to employ a statistical defense by randomizing the tag (note that the same point applies also to object-based allocators when we look at large caches, where effectively only a single object is present per cache).

A third interesting property of tagging is that it can be used to uniquely identify classes of objects, for example free objects. As we discussed previously, metadata and free objects are never the affector, but only the affectee of an attack, so one tag each suffices. The good side effect of devoting a tag each is that the allocator now has a fairly performant way to identify them and issues like double-frees can be easily detected. In the same way, it's also automatically guaranteed that a live object will never be able to overflow into metdata or free objects, even if a statistical defense (e.g. tag randomization) is employed.

Use-after-realloc and arbitrary writes

ADI does one thing and does it great: provides support to implement an invariant to detect linear overflows. Surely, this doesn't come without some constraints (64-byte granularity, 64-byte alignment, page-level granularity to enable it, 4-bit versioning range) and might be a more or less good fit (performance and design-wise) for an existing allocator, but this doesn't detract from its security potential. Heartbleed is just one example of a linear out-of-bound access and SLAB/heap overflow fixes have been in the commit logs of all major operating systems for years now. Invariantly detecting them is a significant win.

Use-after-realloc and arbitrary writes, instead, can't be invariantly stopped by ADI, although ADI can help in mitigating them. As we discussed, use-after-realloc rely on the ability, by the attacker, to hold a reference to a free-and-then-realloced object and then use this reference to modify some potentially sensitive content. ADI can introduce some statistical noise in this exploitation path, by looping/randomizing through different values for the same buffer/object. Note that this doesn't affect the invariant portion of, for example, alternate tagging in object-based allocators; it simply takes further advantage of the versioning space. Of course, if the attacker is in the position of performing a bruteforce attack, this mitigation would not hold much ground, but in certain scenarios, bruteforcing might be a limiting factor (kernel level exploitation) or leave some detectable noise.

Arbitrary writes, instead, depend on the ability of the attacker to forge an address and are not strictly related to allocator ranges only. Since the focus here is the allocator, the most interesting variant is when the attacker has the ability to write to an arbitrary offset from the current buffer. If metadata and free objects are specially tagged, they are unreachable, but other live objects with the same tag might be reached. Just as in the use-afte-realloc case, adding some randomization to the sequence of tags can help, with the very same limitations. In both cases, infoleaks would precisely guide the attacker, but this is basically a given for pretty much any statistical defense.

TL;DR

Oracle SPARC M7 processors come with ADI, Application Data Integrity, a feature that provides memory tagging. Memory allocators can take advantage of it both for debugging and security, in order to invariantly detect linear buffer overflows and statistically mitigate against use-after-free and offset-based arbitrary writes.

Barely grasping the small picture

Thursday, December 22, 2016

Hardening Allocators with ADI

A quick memory allocator primer

Exploitation vectors and techniques

ADI to the rescue

Use-after-realloc and arbitrary writes

TL;DR

No comments:

Post a Comment