Barely grasping the small picture: ADI vs ROP

A couple entries ago, I've covered how we planned to use ADI to protect against heap attacks. If you've been following the stream of patches for the Solaris userland gate, you may have noticed this commit a few months ago. This commit added the necessary macros to the userland gate to enable ADIHEAP and ADISTACK, two new security defenses that will show up in the upcoming release of 11.4.

The ADIHEAP extension allows to extend system libraries (e.g. libc:malloc()) with ADI-based protections, while still retaining the ability to control whether the extension is enabled or not through binary tagging and sxadm exec (both covered in this previous blog entry). A more detailed discussion of the implementation for specific libraries is deferred to the release of 11.4.

The ADISTACK extension, instead, uses ADI transparently to any arbitrary process/binary to mitigate against Return Oriented Programming (ROP) attacks. While I'm deferring a detailed analysis of this one to the release of 11.4 as well, recently Steve Sistare (one of the folks behind our ADISTACK implementation, along with Anthony Yznaga, myself and Steve Chessin) commented around the stack defense on the sparclinux mailing list, so I figured it might be a good time to recap it a bit in a blog entry.

Lookin' Through The (Register) Windows

ROP attacks, as the name suggests, target the return address of a procedure and build a sequence of entry points to gadgets (brief instruction sequences in the process address space that conclude with a return instruction) that compose the attacker payload. ROP attacks are popular today and are the direct consequence of the defensive side taking away the ability to store an attacker controlled payload and jump to it, through what is traditionally known as W^X/DEP (ensure that no mapping that is simultaneously writeable and executable exists in the process address space).

ROP attacks on the SPARC architecture have been throughly covered back in 2008, by the "When Good Instructions Go Bad:Generalizing Return-Oriented Programming to RISC" paper. In this entry, I'm just going to briefly summarize the characteristics of the SPARC architecture that make them and the respective mitigation possible.

The whole function call/return model on SPARC is predicated around the concept of register windows. A register window consist of 24 directly accessible registers, divided in three parts: 8 local (private to the function), 8 in and 8 out registers. In and out registers are the heart of parameter passing and value returning across functions. In fact, register windows overlap: when function A calls function B, the out registers from function A become the in registers for function B, which also gets a fresh set of local and out registers. Later, when B returns, it can use its in registers to make the return values pop into the A register window. Some registers have special meanings: %i6 contains the frame pointer (%fp) and %i7 contains the return address. This document, although centered to the V8 architecture (remember to double the size of registers), nicely recaps the whole window registers design and operation and so does the aforementioned paper.

The number of physical registers cannot grow indefinitely, so there is an upper limit to the number of register windows. This means that an application can find itself about to do a call, but with no register window to shift to. This situation will generate a SPILL trap, transferring control to the operating system and asking to please save the existing register window somewhere and buy a new one for the process. This "somewhere" is, of course, the stack (partially defeating the original purpose of register windows), whereby enough space was conveniently left aside by the save instruction in the function prologue. If we ignore for a moment local variables and any other use of the stack (and just concentrate on register windows), it's easy to realize that the save instruction needs to reserve at least 16 * register_size (8 in and 8 local) bytes which, on SPARC V9 (register_size == 8 bytes), puts us conveniently to 128 bytes (two cache lines). The sister condition to a SPILL trap happens when the program tries to restore back to a previous register window, but such window is currently invalid. In this case, the OS receives a FILL trap and needs to recover the saved registers from the stack and "provide" them back to the caller.

The whole stream of FILL and SPILL traps happens transparently to the application.

Leveraging SPILL/FILL traps with ADI

SPILL and FILL traps provide something pretty unique in the OS/arch landscape: the Operating System has a chance to check the saved register values right at the prologue and right after the epilogue of a function. This possibility hasn't really gone unnoticed in the past and back in 2002 a stack protection known as StackGhost was presented for OpenBSD. It proposed two different approaches to more or less minimize the performance impact and more or less effectively defeat stack smashing attacks:

obfuscate the return address: the OS encrypts the return address on SPILL and decrypts it on FILL. XOR is the obvious choice for speed and simplicity, but has some significant limitation (a sufficiently large infoleak allows to recover the process cookie and it's easily subject to perturbation of lower bits), a more sophisticated algorithm may require more instructions (despite the speed improvement with the crypto instructions since 2002). In either case, this approach directly affects userland: debuggers and the like need to be extended in order to understand the 'encrypted' return address (an extra cost that is better to avoid for adoption). The frame pointer and the other register contents are not protected at all.
create a shadow stack: the kernel keeps a in-kernel shadow stack and updates it at every SPILL/FILL event. This is a much more robust defense and one that has had a number of unfinished attempts (internally) also for Solaris. There are two tricky parts with it: the first one is the performance impact. SPILL/FILL traps are frequent and are hence hugely optimized, adding extra code to copy things to a shadow stack can impose a significant penalty. The second one is related to the well known CFI nightmares of longjmp(), setjmp() and makecontext(), which can create non linear modification to the control flow and require further complexity to clean the shadow stack properly. On top of this, also the memory management of the shadow stack space can add further food for thoughts, especially if one wants to protect all the saved registers.

Enter ADI. Once again, it acts as a game changer, since it can defer to the architecture the ungrateful job of evaluating memory accesses. In particular, the kernel can tag with a dedicated (randomized) color the register save area on the stack on SPILL and clear it up on FILL. Any attempt to overflow from an existing buffer on the stack into the save area during program execution is then automatically caught and stopped, leading to a SIGSEGV. Similarly, attempts to infoleak the save area contents are detected and stopped as well.

Adding a color requires a significantly smaller amount of instructions then a full shadow stack and, through the ADISTACK security extension, we know whether we have to pre-enable ADI over the stack pages of a process. This further helps reducing the impact of the protection on binaries/processes that have it disabled. On top of that, Steve Sistare and Anthony Yznaga came up with a pretty cool trick to completely eliminate any overhead of ADISTACK specific instructions from unaffected processes (and speed up ADISTACK itself), but I'm leaving this one a to a future (perhaps from them) entry. For this blog post, just consider that ADISTACK has basically no impact when disabled, something that is, indeed, pretty nice for a lightweight CFI solution.

longjmp(), setjmp() and makecontext() are augmented with proper version-clearing paths and similar code is exposed, through APIs, to all those pieces of software (Java I'm looking at you) that want to do internal stack management. Lastly, through the use of non-faulting loads, trusted paths in userland (e.g. when issuing a system call with a variadic number of parameters, and hence arbitrarily hitting stack pieces or when inspecting the stack from within the process) can be created. The trusted path code doesn't need to know the color value in advance, nor needs to go through the dance of retrieving it from the pointer - fixing the accessing pointer - do the load, but can instead simply use the ASI_NOFAULT identifier when accessing the value with a single, direct, load. The ability to create a trusted path is an often overlooked property of a security feature and one that can prove crucial when trying to meet acceptable performance results (as an example of this, I've some time ago fixed Python to work nicely with ADI through the use of non-faulting loads).

The use of ADI doesn't come without some limitation, which doesn't permit to this protection to elevate to the holy grail of invariant. First of all, an ADI protected region can be subject to an arbitrary write attack, whereby the attacker is capable of constructing the target pointer with the proper version. Randomizing the saving area version helps a bit, but the randomization range is extremely small and easy to bruteforce/guess. Life would be fantastic if instead of being able to tag with a color, we were able to temporarily mark the region as read-only, but that's a different story, which I hope to tell one day.

Second, the SPARC ABI mandates a 16-byte alignment for the stack, which translates to the register save area not being always aligned to 64-byte on existing software. Since the save area is 128 bytes, we're always guaranteed to version at least one cacheline worth of space. this may mean that the two key pieces that we want to protect (%i6 and %i7), which sit at the top of the save area, end up exposed (but with a tagged cacheline that catches linear overflows in between). This limitation can be reduced by imposing a 64-byte alignment: the kernel, the compiler and the userland crt objects can conspire to not create (or reduce to a minimum, depending on performance requirements) misalignment.

Leveraging ADI through the Compiler

ADISTACK has a few characteristics that make me a huge fan of it: it's transparent to the target process (which means that it can be easily applied to third party software), has a very low performance impact when enabled and has zero performance impact when disabled. ADISTACK focuses on one specific target: the saved registers, in particular %fp and the return address. This is both really good (do one thing, but do it right) and "bad" (as it ignores, by design, stack smashing attacks that target other adjacent local variables.

A more complex, but more comprehensive, protection can be designed to include detection of linear overflows across local variables, by having the compiler either separate each one into dedicated cachelines or leave redzone "gaps" between them, and providing an entry point to each function to call into a tagging subroutine. This is not dissimilar, impact-wise, from other cookie-based stack protection solutions, like GCC StackGuard/-fstack-protector and can implement similar solutions to improve performance, for example through the identification of functions that truly need a protection and those that don't have any overflowing object in their frame. In the very same way, such compiler based protection do leave some performance impact also when disabled, by virtue of adding some extra code to some/all functions.

Barely grasping the small picture

Tuesday, September 12, 2017

ADI vs ROP

Lookin' Through The (Register) Windows

Leveraging SPILL/FILL traps with ADI

Leveraging ADI through the Compiler

2 comments: