Monday, September 18, 2017

Getting started with ADI

After a number of entries on different uses of ADI, it's time to get our hands dirty and walk through the C API that allows to experiment with it directly. The whole API set is quite simple and we'll use the Solaris implementation as a walkthrough. Linux is getting ADI support as well and while the interfaces are going to be a bit different (as an example, Linux is fond of mprotect() whereby Solaris prefers memcntl()), most of what we discuss here is going to apply there as well.

If you own or have access to a SPARC T7/M7 system running Solaris 11.3 or later, then, lucky you, you can get there and play. Otherwise, you may try the developer trial account on swisdev.oracle.com.

Enabling ADI on the current thread


Unless you're running in a Kernel Zone and haven't enabled the proper host compatibility (adi or native), if you are on proper hardware, ADI is supported by the kernel and available for every 64-bit process on the system. What this means is that your process can decide to start using ADI, but it still has to state that explicitly. This is provided by:

             int adi_set_enabled(int arg) 

arg is either ADI_ENABLE or ADI_DISABLE, to, respectively, enable and disable ADI. Under the covers, this has the kernel set/clear the MCDE bit in the PSTATE register, which stands for Memory Corruption Detection Enable, showing trace of the original name by which the technology was created.

At any point in time, a userland thread can inquiry on whether it has ADI enabled, and the interface is, not surprisingly:

            int adi_get_enabled(void)

which will return either ADI_ENABLE or ADI_DISABLE.

Learning about ADI properties


Magic numbers are the worst and, as much as possible, a well written piece of software should derive dynamically the properties of a running system and adjust accordingly. ADI API is not here to have you write poor software and hence we have:

           int adi_blksz(void)
           int adi_version_max(void)
           int adi_version_nbits(void)

adi_blksz() returns the granularity to which the versioning applies. Today, ADI operates on 64 byte cache lines, so 64 bytes is also the necessary alignment. adi_version_nbits() returns the number of bits, starting from the topmost bit in the 64-bit virtual address, that are used to represent the associated ADI version and adi_version_max() the largest color that is reliably usable on the architecture. A common initialization routine for a piece of software (e.g. a malloc() implementation) would collect these values to tune itself, with code along these lines:

void
initialize_adi(void)
{
        if (adi_set_enabled(ADI_ENABLE) < 0) {
                perror("ADI initialization failed");
                adi_enabled = B_FALSE;
                return;
        }

        adi_enabled = B_TRUE;

        alignment = adi_blksz();
        nbits = adi_version_nbits();
                
        if (alignment < 0 || nbits < 0) {
                adi_enabled = B_FALSE;
                return;
        }

        maxversion = (uint_t)adi_version_max();
}

If we build and run it on a M7 and dump the values we get something along these lines:

darkhelmet$ ./adi_base_test 
Block size and alignment: 64 bytes.
Available version bits: 4
Maximum usable version: 13
[...]
darkhelmet$

64 and 4 match what we know and expect about the ADI implementation. The maximum usable version is, instead, a tiny bit surprising: we know that "all zeroes" or "all ones" (so, 0 and 15) are universal matches which allow a load/store with any arbitrary tag, but that should still leave 14 as an usable version. The reason why it isn't comes from an architecture implementation detail.

On M7, we keep version bits separately protected in the L2$ (all 8 ways of a set in a check-word). If an Unrecoverable Error (UE) happens, they are flushed and if a dirty line is present, it's written back with version 14, regardless of the original version. Since a userland program might decide to rely on ADI for correctness, it would be cumbersome to figure out whether an exception raised on the 14 version was a consequence of a legit condition or a UE, so we simply restrict the versioning space.

The upcoming M8 architecture lifts this limitation by doing the right thing on UE and writing back the correct color. This is a good example of why one should never rely on magic numbers: by gathering the maximum version at runtime, your software is guaranteed to take advantage of all the available versioning space.

Setting/Getting Tags for VA ranges


Now that we have enabled ADI for the running thread (PSTATE.mcde) and have all the necessary information to produce and place a tag on a virtual address (alignment, number of version bits, maximum version), all that is left is to effectively start tagging memory.

The first mandatory step is to enable ADI onto the target pages. In fact, ADI is not implicitly enabled for a TTE (Translation Table Entry), but, instead, a new bit is introduced (TTE.mcd), that specifies whether it's on or off. The main reason for this is that the processor disables store merging for ADI enabled pages, which might translate in some (generally minimal) performance impact. Setting TTE.mcd is up to the kernel and Solaris offers two different API to gently ask for it:

              mmap(hint_addr, len, prot, MAP_ADI, ...)
              memcntl(addr, len, MC_ENABLE_ADI, ...)

In both cases, assuming that everything is right, TTE.mcd will be set for all the pages that make up the requested len range. The kernel makes no promises on the versions that will be set for those pages initially (you might experience that,  e.g. for anonymous pages, they are zeroed out, but please don't rely on it) and leaves it up to the userland application.

Setting versions is a two steps process. First of all, we need to set the proper tag to the cache line bits, which are exposed through dedicated Address Space Identifiers (ASI) and then mirror the tag onto the pointer used to access the memory range. ASIs are basically the SPARC architecture swiss army knife: they allow to expose different address spaces (e.g. select between the primary or secondary context with a load/store), I/O ranges, internal registers or otherwise influence the load/store behavior. Alternate versions of load/store, identified by a 'a' in the mnemonic at the end of the name (e.g. stoa), allow to directly specify the ASI to operate on.

Two ASIs are of interest for ADI setting/getting of versions by unprivileged software:

  • ASI_MCDP (0x90): Memory Corruption Detection Primary, takes the virtual address as relative to the primary context and sets/gets the specified version.
  • ASI_MCDSBP (0x92): Memocry Corruption Detection Block Init Store Primary, which optimizes the operation of zero'ing a block (64 bytes) while also setting the version. 

Of course, one doesn't need to do any of this manually, but, instead, proper APIs are provided, that also deal correctly with misaligned addresses:

             caddr_t adi_clr_version(caddr_t addr, size_t size)
             int adi_get_version(caddr_t addr)
             caddr_t adi_set_version(caddr_t addr, size_t size, int version)
             caddr_t adi_memset(caddr_t addr, int c, size_t size, int version)

Their name is quite self-explanatory and so should be the behavior, given the introduction above (which might actually come back handy to you in case you find yourself willing to optimize some hot path). All the setting functions return an address which is already properly tagged, as we can see in the next example:

        caddr_t buffer;
        buffer = mmap(NULL, 8192, PROT_READ|PROT_WRITE,
            MAP_ANON|MAP_PRIVATE|MAP_ADI, -1, 0);
        if (buffer == (caddr_t)MAP_FAILED) {
                perror("mmap");
                exit(EXIT_FAILURE);
        }

        printf("Buffer address before versioning: %p\n", buffer);
        buffer = adi_set_version(buffer, 128, 7);
        printf("Versioned buffer address: %p\n", buffer);       

which, once run, gives us:

darkhelmet$ ./adi_base_test
[...]
Buffer address before versioning: ffffffff7f5ec000
Versioned buffer address: 7fffffff7f5ec000
[...]
darkhelmet$

with the topmost nbits properly set to reflect the ADI version we specified. If we attach and disassemble the call to add_set_version(), being careful to pick the symbol with the proper capability tag, eventually we see:

adi_base_test:580319*> ::nm ! grep adi_set_version
0x00000001001013c0|0x0000000000000000|FUNC |GLOB |0x0  |UNDEF   |adi_set_version
0xffffffff7ef289f0|0x00000000000001a8|FUNC |LOCL |0x0  |19      |adi_set_version%sun4v-adi
0xffffffff7edea9e0|0x000000000000001c|FUNC |GLOB |0x3  |19      |adi_set_version
adi_base_test:580319*> 0xffffffff7ef289f0::dis ! ggrep -B 2 stxa
libc.so.1`adi_set_version%sun4v-adi+0x108:      ldsb      [%l2], %o2
libc.so.1`adi_set_version%sun4v-adi+0x10c:      sra       %i2, 0x0, %o0
libc.so.1`adi_set_version%sun4v-adi+0x110:      stxa      %o0, [%i4] 0x90       
--
libc.so.1`adi_set_version%sun4v-adi+0x16c:      mov       %l5, %o5
libc.so.1`adi_set_version%sun4v-adi+0x170:      sra       %i2, 0x0, %o7
libc.so.1`adi_set_version%sun4v-adi+0x174:      stxa      %o7, [%o5] 0x90       
adi_base_test:580319*> 

with the calls to store-alternate with the expected ASI.

adi_memset() operates basically in the same way, although it is faster than just doing a memset() followed by an idi_set_version().

What happens on a ADI mismatch?


Let's trigger an ADI mismatch from within our code:

       /* Tag the first two cachelines */
        tagged_buffer = adi_set_version(buffer, 128, 7);
        printf("Versioned buffer address: %p\n", tagged_buffer);        

        /* Tag the next two cachelines with a different value */
        nextbuffer = adi_set_version(buffer + 128, 128, 8);
        printf("Versioned next buffer address: %p\n", nextbuffer);

        /* Access with correct version goes through */
        tagged_buffer[0] = 'a';
        printf("first store acces: %c\n", tagged_buffer[0]);
        /* Access with incorrect version traps */
        tagged_buffer[130] = 'b';
        /* segfault...*/        
        printf("second store access: %c\n", nextbuffer[2]);

The second access from tagged_buffer, performed with a pointer with version 7 over a memory line tagged with version 8 is detected by ADI, which nullifies the store (mismatching stores never go through) and raises an exception, whose final outcome is to send a SIGSEGV down to the process. Let's see it in action:

darkhelmet$ ./adi_base_test
[...]
Buffer address before versioning: ffffffff7f5ec000
Versioned buffer address: 7fffffff7f5ec000
Versioned next buffer address: 8fffffff7f5ec080
first store acces: a
Segmentation Fault (core dumped)
darkhelmet$

We got the expected Segmentation Fault and a nice core dump. Let's dig into it a little bit:

darkhelmet$ mdb ./core
Loading modules: [ libc.so.1 ld.so.1 ]
adi_base_test:core> ::status
debugging core file of adi_base_test (64-bit) from darkhelmet
file: /tmp/adi_base_test
initial argv: ./adi_base_test
threading model: raw lwps
status: process terminated by SIGSEGV (Segmentation Fault)
, ADI deferred mismatch, pc=100001188
adi_base_test:core> 100001188::dis
main+0x130:                     call      +0x100320     
[...]
main+0x158:                     stb       %l0, [%i4 + 0x82]
[...]
main+0x17c:                     restore
_init:                          save      %sp, -0xb0, %sp
adi_base_test:core> ::regs
%g0 = 0x0000000000000000                 %l0 = 0x0000000000000062 
%g1 = 0x0000000000000004                 %l1 = 0x0000000100000c80 
%g2 = 0x0000000000000000                 %l2 = 0x7fffffff7f5ec000 
%g3 = 0x0000000000000000                 %l3 = 0x0000000000000000 
%g4 = 0x00000000782f296a                 %l4 = 0x0000000000000061 
%g5 = 0xffffffff7f0486e4 libc.so.1`_sobuf+0x18 %l5 = 0x0000000100000cf0 
%g6 = 0x0000000000000000                 %l6 = 0x0000000000002800 
%g7 = 0xffffffff7f5c2a40                 %l7 = 0xffffffff7f548f08 
%o0 = 0x0000000100000d78                 %i0 = 0x0000000000000001 
%o1 = 0x0000000000000000                 %i1 = 0xffffffff7ffff808 
%o2 = 0x0000000000000000                 %i2 = 0xffffffff7ffff818 
%o3 = 0xffffffff7f030000                 %i3 = 0x8fffffff7f5ec080 
%o4 = 0x0000000000000000                 %i4 = 0x7fffffff7f5ec000 
%o5 = 0xffffffff7f037fdc libc.so.1`_iob+0xa4 %i5 = 0xffffffff7f5ec000 
%o6 = 0xffffffff7fffee71                 %i6 = 0xffffffff7fffef51 
%o7 = 0x0000000100001190      main+0x160 %i7 = 0x0000000100000ea8 _start+0x108

 %ccr = 0x99 xcc=NzvC icc=NzvC
   %y = 0x0000000000000000
  %pc = 0x0000000100101480 PLT=libc.so.1`printf
 %npc = 0x0000000100101484 PLT=libc.so.1`printf
  %sp = 0xffffffff7fffee71
  %fp = 0xffffffff7fffef51

 %asi = 0x82
%fprs = 0x07
adi_base_test:core> 

Two things are immediately evident:

  • ADI tells us that a trap was raised by the execution at pc=100001188 (main+0x158), which is where we expected it to happen, at the store into the mismatching cache line.
  • There is surprisingly scarce information about the trap: on what virtual address? because of what mismatching value? All we get reported is a program counter and that 'deferred' definition, which seems to be intuitively matched by the fact that %pc and %npc are not at the reported faulting instruction, but somewhere further down into the next printf() call.

For a feature that was born to be a debugging tool, this lack of details is fairly unexpected. So, what's going on here? There are two types of traps that can be risen by an ADI mismatch: a precise trap and a deferred trap. Precise trap stop the thread immediately and so the architecture and the kernel can conspire to send out to userland all the relevant information about the exception. Deferred traps happen some time later, based on a number of conditions that we'll see in a short.

Loads always raise a precise trap, stores, instead, a deferred one. This is because on SPARC we have a store buffer that queues (up to 64) stores on commit before they are performed by the L2$ and the version mismatch is not detected until the L2$ performs the store. This can happen on an explicit flush (membar) or when the buffer is full and goes through. There are a number of implicit situations that lead to a membar, e.g. the userland application performing a system call, but by the time the deferred trap is delivered (and the program stopped), some instructions have passed and the hardware has no ability to keep any information on the original condition beyond the PC at which it happened.

Disabling store buffering has a significant performance penalty and is recommended only on debugging scenarios, to further drill down the details of a bug. We offer an API to enable what we called precise mode that allows to control it:

          int adi_get_precise(void)
          int adi_set_precise(int mode)

where mode is either ADI_PRECISE_ENABLE or ADI_PRECISE_DISABLE. The simplest way when writing a memory allocator (or some other debugging/protection) with ADI is to provide a tweakable way (e.g. an environment variable or a security extension property) to control the ADI behavior. As mentioned, you'll want to run with precise mode disabled in production.

Let's see an example of the amount of information we can gather when we hit a precise trap. We'll do so by tweaking our code a little and and making it trap over a load, rather than a store.

       /*
         * Do a proper store access and modify the printf to read from
         * tagged_buffer, which has an incorrect version.
         */ 
        nextbuffer[2] = 'b';
        printf("second store access (correct): %c\n", nextbuffer[2]);
        printf("second store access (mismatch): %c\n", tagged_buffer[130]);

And run it, along with evaluating the expected core dump:

darkhelmet$ ./adi_base_test
[...]
first store acces: a
second store access (correct): b
Segmentation Fault (core dumped)
darkhelmet$ mdb ./core
Loading modules: [ libc.so.1 ld.so.1 ]
adi_base_test:core> ::status
debugging core file of adi_base_test (64-bit) from darkhelmet
file: /tmp/adi_base_test
initial argv: ./adi_base_test
threading model: raw lwps
status: process terminated by SIGSEGV (Segmentation Fault), pc=1000011e4
, ADI version 8 mismatch for VA ffffffff7f5ec082
adi_base_test:core> ::regs
%g0 = 0x0000000000000000                 %l0 = 0x0000000100101000 
%g1 = 0x0000000000000004                 %l1 = 0x0000000100000cf0 
%g2 = 0x0000000000000000                 %l2 = 0x8fffffff7f5ec082 
%g3 = 0x0000000000000000                 %l3 = 0x0000000000000062 
%g4 = 0x00000000222f2967                 %l4 = 0x0000000000000061 
%g5 = 0xffffffff7f0486f0 libc.so.1`_sobuf+0x24 %l5 = 0x0000000000002880 
%g6 = 0x0000000000000000                 %l6 = 0x0000000000002800 
%g7 = 0xffffffff7f5c2a40                 %l7 = 0xffffffff7f548f08 
%o0 = 0x0000000100000da0                 %i0 = 0x0000000000000001 
%o1 = 0x0000000000000021                 %i1 = 0xffffffff7ffff7f8 
%o2 = 0x0000000000000000                 %i2 = 0xffffffff7ffff808 
%o3 = 0xffffffff7f030000                 %i3 = 0x8fffffff7f5ec080 
%o4 = 0x0000000000000000                 %i4 = 0x7fffffff7f5ec000 
%o5 = 0xffffffff7f037fdc libc.so.1`_iob+0xa4 %i5 = 0xffffffff7f5ec000 
%o6 = 0xffffffff7fffee61                 %i6 = 0xffffffff7fffef41 
%o7 = 0x00000001000011e0      main+0x170 %i7 = 0x0000000100000ee8 _start+0x108

 %ccr = 0x99 xcc=NzvC icc=NzvC
   %y = 0x0000000000000000
  %pc = 0x00000001000011e4 main+0x174
 %npc = 0x0000000100101480 PLT=libc.so.1`printf
  %sp = 0xffffffff7fffee61
  %fp = 0xffffffff7fffef41

 %asi = 0x82
%fprs = 0x07
adi_base_test:core> 

We get all the nice debugging/detailed information that we expected: the instruction pointer, the ADI version mismatch that happened and for what virtual address. The virtual address is always reported normalized, although we can inquiry about the ADI version that was associated through the ::adviser command in mdb:

adi_base_test:core> ffffffff7f5ec082::adiver
addr: ffffffff7f5ec082 cache ver: 8
adi_base_test:core> ffffffff7f5ec070::adiver
addr: ffffffff7f5ec070 cache ver: 7
adi_base_test:core> 

Since we have an exact picture of the status of registers at the time of the trap, we can further look at the instruction pointer and registers to see from what virtual address and with what version the access was performed:

%o4 = 0x0000000000000000                 %i4 = 0x7fffffff7f5ec000 

adi_base_test:core> 0x00000001000011e4::dis
[...]
main+0x174:                     ldsb      [%i4 + 0x82], %o1

And, as expected, a version 7 address starting two cache lines before shows up.

Source Code Example


As a general reference, here is the full source code from the last example we discussed, with the mismatching trap on a load access.
#include 
#include 
#include 
#include 
#include 
#include 

/*
 * Global varaibles that control ADI behavior for the application.
 */
boolean_t       adi_enabled;
int             alignment;
int             nbits;
uint_t          maxversion;
uint_t          mask;

/*
 * Gather information at runtime about ADI
 */
void
initialize_adi(void)
{
        if (adi_set_enabled(ADI_ENABLE) < 0) {
                perror("ADI initialization failed");
                adi_enabled = B_FALSE;
                return;
        }

        adi_enabled = B_TRUE;

        alignment = adi_blksz();
        nbits = adi_version_nbits();
                
        if (alignment < 0 || nbits < 0) {
                adi_enabled = B_FALSE;
                return;
        }

        maxversion = (uint_t)adi_version_max();
        mask = (1 << nbits) - 1;
}

int main(int argc, char **argv)
{
        initialize_adi();

        if (adi_enabled) {
                printf("Block size and alignment: %d bytes.\n", alignment);
                printf("Available version bits: %d\n", nbits);
                printf("Maximum usable version: %d\n", maxversion);
        }

        caddr_t buffer;
        caddr_t nextbuffer;
        caddr_t tagged_buffer;

        buffer = mmap(NULL, 8192, PROT_READ|PROT_WRITE,
            MAP_ANON|MAP_PRIVATE|MAP_ADI, -1, 0);
        if (buffer == (caddr_t)MAP_FAILED) {
                perror("mmap");
                exit(EXIT_FAILURE);
        }

        printf("Buffer address before versioning: %p\n", buffer);
        /* Tag the first two cachelines */
        tagged_buffer = adi_set_version(buffer, 128, 7);
        printf("Versioned buffer address: %p\n", tagged_buffer);        

        /* Tag the next two cachelines with a different value */
        nextbuffer = adi_set_version(buffer + 128, 128, 8);
        printf("Versioned next buffer address: %p\n", nextbuffer);

        /* Access with correct version goes through */
        tagged_buffer[0] = 'a';
        printf("first store acces: %c\n", tagged_buffer[0]);
        
        /*
         * Do a proper store access and modify the printf to read from
         * tagged_buffer, which has an incorrect version.
         */ 
        nextbuffer[2] = 'b';
        printf("second store access (correct): %c\n", nextbuffer[2]);
        printf("second store access (mismatch): %c\n", tagged_buffer[130]);
}

Tuesday, September 12, 2017

ADI vs ROP

A couple entries ago, I've covered how we planned to use ADI to protect against heap attacks. If you've been following the stream of patches for the Solaris userland gate, you may have noticed this commit a few months ago. This commit added the necessary macros to the userland gate to enable ADIHEAP and ADISTACK, two new security defenses that will show up in the upcoming release of 11.4.

The ADIHEAP extension allows to extend system libraries (e.g. libc:malloc()) with ADI-based protections, while still retaining the ability to control whether the extension is enabled or not through binary tagging and sxadm exec (both covered in this previous blog entry). A more detailed discussion of the implementation for specific libraries is deferred to the release of 11.4.

The ADISTACK extension, instead, uses ADI transparently to any arbitrary process/binary to mitigate against Return Oriented Programming (ROP) attacks. While I'm deferring a detailed analysis of this one to the release of 11.4 as well, recently Steve Sistare (one of the folks behind our ADISTACK implementation, along with Anthony Yznaga, myself and Steve Chessin) commented around the stack defense on the sparclinux mailing list, so I figured it might be a good time to recap it a bit in a blog entry.

Lookin' Through The (Register) Windows


ROP attacks, as the name suggests, target the return address of a procedure and build a sequence of entry points to gadgets (brief instruction sequences in the process address space that conclude with a return instruction) that compose the attacker payload. ROP attacks are popular today and are the direct consequence of the defensive side taking away the ability to store an attacker controlled payload and jump to it, through what is traditionally known as W^X/DEP (ensure that no mapping that is simultaneously writeable and executable exists in the process address space).

ROP attacks on the SPARC architecture have been throughly covered back in 2008, by the "When Good Instructions Go Bad:Generalizing Return-Oriented Programming to RISC" paper. In this entry, I'm just going to briefly summarize the characteristics of the SPARC architecture that make them and the respective mitigation possible.

The whole function call/return model on SPARC is predicated around the concept of register windows. A register window consist of 24 directly accessible registers, divided in three parts: 8 local (private to the function), 8 in and 8 out registers. In and out registers are the heart of parameter passing and value returning across functions. In fact, register windows overlap: when function A calls function B, the out registers from function A become the in registers for function B, which also gets a fresh set of local and out registers. Later, when B returns, it can use its in registers to make the return values pop into the A register window. Some registers have special meanings: %i6 contains the frame pointer (%fp) and %i7 contains the return address. This document, although centered to the V8 architecture (remember to double the size of registers), nicely recaps the whole window registers design and operation and so does the aforementioned paper.

The number of physical registers cannot grow indefinitely, so there is an upper limit to the number of register windows. This means that an application can find itself about to do a call, but with no register window to shift to. This situation will generate a SPILL trap, transferring control to the operating system and asking to please save the existing register window somewhere and buy a new one for the process. This "somewhere" is, of course, the stack (partially defeating the original purpose of register windows), whereby enough space was conveniently left aside by the save instruction in the function prologue. If we ignore for a moment local variables and any other use of the stack (and just concentrate on register windows), it's easy to realize that the save instruction needs to reserve at least 16 * register_size (8 in and 8 local) bytes which, on SPARC V9 (register_size == 8 bytes), puts us conveniently to 128 bytes (two cache lines). The sister condition to a SPILL trap happens when the program tries to restore back to a previous register window, but such window is currently invalid. In this case, the OS receives a FILL trap and needs to recover the saved registers from the stack and "provide" them back to the caller.

The whole stream of FILL and SPILL traps happens transparently to the application.

Leveraging SPILL/FILL traps with ADI


SPILL and FILL traps provide something pretty unique in the OS/arch landscape: the Operating System has a chance to check the saved register values right at the prologue and right after the epilogue of a function. This possibility hasn't really gone unnoticed in the past and back in 2002 a stack protection known as StackGhost was presented for OpenBSD. It proposed two different approaches to more or less minimize the performance impact and more or less effectively defeat stack smashing attacks: 
  • obfuscate the return address: the OS encrypts the return address on SPILL and decrypts it on FILL. XOR is the obvious choice for speed and simplicity, but has some significant limitation (a sufficiently large infoleak allows to recover the process cookie and it's easily subject to perturbation of lower bits), a more sophisticated algorithm may require more instructions (despite the speed improvement with the crypto instructions since 2002). In either case, this approach directly affects userland: debuggers and the like need to be extended in order to understand the 'encrypted' return address (an extra cost that is better to avoid for adoption). The frame pointer and the other register contents are not protected at all.
  • create a shadow stack: the kernel keeps a in-kernel shadow stack and updates it at every SPILL/FILL event. This is a much more robust defense and one that has had a number of unfinished attempts (internally) also for Solaris. There are two tricky parts with it: the first one is the performance impact. SPILL/FILL traps are frequent and are hence hugely optimized, adding extra code to copy things to a shadow stack can impose a significant penalty. The second one is related to the well known CFI nightmares of longjmp(), setjmp() and makecontext(), which can create non linear modification to the control flow and require further complexity to clean the shadow stack properly. On top of this, also the memory management of the shadow stack space can add further food for thoughts, especially if one wants to protect all the saved registers. 
Enter ADI. Once again, it acts as a game changer, since it can defer to the architecture the ungrateful job of evaluating memory accesses. In particular, the kernel can tag with a dedicated (randomized) color the register save area on the stack on SPILL and clear it up on FILL. Any attempt to overflow from an existing buffer on the stack into the save area during program execution is then automatically caught and stopped, leading to a SIGSEGV. Similarly, attempts to infoleak the save area contents are detected and stopped as well.

Adding a color requires a significantly smaller amount of instructions then a full shadow stack and, through the ADISTACK security extension, we know whether we have to pre-enable ADI over the stack pages of a process. This further helps reducing the impact of the protection on binaries/processes that have it disabled. On top of that, Steve Sistare and Anthony Yznaga came up with a pretty cool trick to completely eliminate any overhead of ADISTACK specific instructions from unaffected processes (and speed up ADISTACK itself), but I'm leaving this one a to a future (perhaps from them) entry. For this blog post, just consider that ADISTACK has basically no impact when disabled, something that is, indeed, pretty nice for a lightweight CFI solution.

longjmp(), setjmp() and makecontext() are augmented with proper version-clearing paths and similar code is exposed, through APIs, to all those pieces of software (Java I'm looking at you) that want to do internal stack management. Lastly, through the use of non-faulting loads, trusted paths in userland (e.g. when issuing a system call with a variadic number of parameters, and hence arbitrarily hitting stack pieces or when inspecting the stack from within the process) can be created. The trusted path code doesn't need to know the color value in advance, nor needs to go through the dance of retrieving it from the pointer - fixing the accessing pointer - do the load, but can instead simply use the ASI_NOFAULT identifier when accessing the value with a single, direct, load. The ability to create a trusted path is an often overlooked property of a security feature and one that can prove crucial when trying to meet acceptable performance results (as an example of this, I've some time ago fixed Python to work nicely with ADI through the use of non-faulting loads).

The use of ADI doesn't come without some limitation, which doesn't permit to this protection to elevate to the holy grail of invariant. First of all, an ADI protected region can be subject to an arbitrary write attack, whereby the attacker is capable of constructing the target pointer with the proper version. Randomizing the saving area version helps a bit, but the randomization range is extremely small and easy to bruteforce/guess. Life would be fantastic if instead of being able to tag with a color, we were able to temporarily mark the region as read-only, but that's a different story, which I hope to tell one day.

Second, the SPARC ABI mandates a 16-byte alignment for the stack, which translates to the register save area not being always aligned to 64-byte on existing software. Since the save area is 128 bytes, we're always guaranteed to version at least one cacheline worth of space. this may mean that the two key pieces that we want to protect (%i6 and %i7), which sit at the top of the save area, end up exposed (but with a tagged cacheline that catches linear overflows in between). This limitation can be reduced by imposing a 64-byte alignment: the kernel, the compiler and the userland crt objects can conspire to not create (or reduce to a minimum, depending on performance requirements) misalignment.

Leveraging ADI through the Compiler


ADISTACK has a few characteristics that make me a huge fan of it: it's transparent to the target process (which means that it can be easily applied to third party software), has a very low performance impact when enabled and has zero performance impact when disabled. ADISTACK focuses on one specific target: the saved registers, in particular %fp and the return address. This is both really good (do one thing, but do it right) and "bad" (as it ignores, by design, stack smashing attacks that target other adjacent local variables.

A more complex, but more comprehensive, protection can be designed to include detection of linear overflows across local variables, by having the compiler either separate each one into dedicated cachelines or leave redzone "gaps" between them, and providing an entry point to each function to call into a tagging subroutine. This is not dissimilar, impact-wise, from other cookie-based stack protection solutions, like GCC StackGuard/-fstack-protector and can implement similar solutions to improve performance, for example through the identification of functions that truly need a protection and those that don't have any overflowing object in their frame. In the very same way, such compiler based protection do leave some performance impact also when disabled, by virtue of adding some extra code to some/all functions.