Monday, September 18, 2017

Getting started with ADI

After a number of entries on different uses of ADI, it's time to get our hands dirty and walk through the C API that allows to experiment with it directly. The whole API set is quite simple and we'll use the Solaris implementation as a walkthrough. Linux is getting ADI support as well and while the interfaces are going to be a bit different (as an example, Linux is fond of mprotect() whereby Solaris prefers memcntl()), most of what we discuss here is going to apply there as well.

If you own or have access to a SPARC T7/M7 system running Solaris 11.3 or later, then, lucky you, you can get there and play. Otherwise, you may try the developer trial account on swisdev.oracle.com.

Enabling ADI on the current thread


Unless you're running in a Kernel Zone and haven't enabled the proper host compatibility (adi or native), if you are on proper hardware, ADI is supported by the kernel and available for every 64-bit process on the system. What this means is that your process can decide to start using ADI, but it still has to state that explicitly. This is provided by:

             int adi_set_enabled(int arg) 

arg is either ADI_ENABLE or ADI_DISABLE, to, respectively, enable and disable ADI. Under the covers, this has the kernel set/clear the MCDE bit in the PSTATE register, which stands for Memory Corruption Detection Enable, showing trace of the original name by which the technology was created.

At any point in time, a userland thread can inquiry on whether it has ADI enabled, and the interface is, not surprisingly:

            int adi_get_enabled(void)

which will return either ADI_ENABLE or ADI_DISABLE.

Learning about ADI properties


Magic numbers are the worst and, as much as possible, a well written piece of software should derive dynamically the properties of a running system and adjust accordingly. ADI API is not here to have you write poor software and hence we have:

           int adi_blksz(void)
           int adi_version_max(void)
           int adi_version_nbits(void)

adi_blksz() returns the granularity to which the versioning applies. Today, ADI operates on 64 byte cache lines, so 64 bytes is also the necessary alignment. adi_version_nbits() returns the number of bits, starting from the topmost bit in the 64-bit virtual address, that are used to represent the associated ADI version and adi_version_max() the largest color that is reliably usable on the architecture. A common initialization routine for a piece of software (e.g. a malloc() implementation) would collect these values to tune itself, with code along these lines:

void
initialize_adi(void)
{
        if (adi_set_enabled(ADI_ENABLE) < 0) {
                perror("ADI initialization failed");
                adi_enabled = B_FALSE;
                return;
        }

        adi_enabled = B_TRUE;

        alignment = adi_blksz();
        nbits = adi_version_nbits();
                
        if (alignment < 0 || nbits < 0) {
                adi_enabled = B_FALSE;
                return;
        }

        maxversion = (uint_t)adi_version_max();
}

If we build and run it on a M7 and dump the values we get something along these lines:

darkhelmet$ ./adi_base_test 
Block size and alignment: 64 bytes.
Available version bits: 4
Maximum usable version: 13
[...]
darkhelmet$

64 and 4 match what we know and expect about the ADI implementation. The maximum usable version is, instead, a tiny bit surprising: we know that "all zeroes" or "all ones" (so, 0 and 15) are universal matches which allow a load/store with any arbitrary tag, but that should still leave 14 as an usable version. The reason why it isn't comes from an architecture implementation detail.

On M7, we keep version bits separately protected in the L2$ (all 8 ways of a set in a check-word). If an Unrecoverable Error (UE) happens, they are flushed and if a dirty line is present, it's written back with version 14, regardless of the original version. Since a userland program might decide to rely on ADI for correctness, it would be cumbersome to figure out whether an exception raised on the 14 version was a consequence of a legit condition or a UE, so we simply restrict the versioning space.

The upcoming M8 architecture lifts this limitation by doing the right thing on UE and writing back the correct color. This is a good example of why one should never rely on magic numbers: by gathering the maximum version at runtime, your software is guaranteed to take advantage of all the available versioning space.

Setting/Getting Tags for VA ranges


Now that we have enabled ADI for the running thread (PSTATE.mcde) and have all the necessary information to produce and place a tag on a virtual address (alignment, number of version bits, maximum version), all that is left is to effectively start tagging memory.

The first mandatory step is to enable ADI onto the target pages. In fact, ADI is not implicitly enabled for a TTE (Translation Table Entry), but, instead, a new bit is introduced (TTE.mcd), that specifies whether it's on or off. The main reason for this is that the processor disables store merging for ADI enabled pages, which might translate in some (generally minimal) performance impact. Setting TTE.mcd is up to the kernel and Solaris offers two different API to gently ask for it:

              mmap(hint_addr, len, prot, MAP_ADI, ...)
              memcntl(addr, len, MC_ENABLE_ADI, ...)

In both cases, assuming that everything is right, TTE.mcd will be set for all the pages that make up the requested len range. The kernel makes no promises on the versions that will be set for those pages initially (you might experience that,  e.g. for anonymous pages, they are zeroed out, but please don't rely on it) and leaves it up to the userland application.

Setting versions is a two steps process. First of all, we need to set the proper tag to the cache line bits, which are exposed through dedicated Address Space Identifiers (ASI) and then mirror the tag onto the pointer used to access the memory range. ASIs are basically the SPARC architecture swiss army knife: they allow to expose different address spaces (e.g. select between the primary or secondary context with a load/store), I/O ranges, internal registers or otherwise influence the load/store behavior. Alternate versions of load/store, identified by a 'a' in the mnemonic at the end of the name (e.g. stoa), allow to directly specify the ASI to operate on.

Two ASIs are of interest for ADI setting/getting of versions by unprivileged software:

  • ASI_MCDP (0x90): Memory Corruption Detection Primary, takes the virtual address as relative to the primary context and sets/gets the specified version.
  • ASI_MCDSBP (0x92): Memocry Corruption Detection Block Init Store Primary, which optimizes the operation of zero'ing a block (64 bytes) while also setting the version. 

Of course, one doesn't need to do any of this manually, but, instead, proper APIs are provided, that also deal correctly with misaligned addresses:

             caddr_t adi_clr_version(caddr_t addr, size_t size)
             int adi_get_version(caddr_t addr)
             caddr_t adi_set_version(caddr_t addr, size_t size, int version)
             caddr_t adi_memset(caddr_t addr, int c, size_t size, int version)

Their name is quite self-explanatory and so should be the behavior, given the introduction above (which might actually come back handy to you in case you find yourself willing to optimize some hot path). All the setting functions return an address which is already properly tagged, as we can see in the next example:

        caddr_t buffer;
        buffer = mmap(NULL, 8192, PROT_READ|PROT_WRITE,
            MAP_ANON|MAP_PRIVATE|MAP_ADI, -1, 0);
        if (buffer == (caddr_t)MAP_FAILED) {
                perror("mmap");
                exit(EXIT_FAILURE);
        }

        printf("Buffer address before versioning: %p\n", buffer);
        buffer = adi_set_version(buffer, 128, 7);
        printf("Versioned buffer address: %p\n", buffer);       

which, once run, gives us:

darkhelmet$ ./adi_base_test
[...]
Buffer address before versioning: ffffffff7f5ec000
Versioned buffer address: 7fffffff7f5ec000
[...]
darkhelmet$

with the topmost nbits properly set to reflect the ADI version we specified. If we attach and disassemble the call to add_set_version(), being careful to pick the symbol with the proper capability tag, eventually we see:

adi_base_test:580319*> ::nm ! grep adi_set_version
0x00000001001013c0|0x0000000000000000|FUNC |GLOB |0x0  |UNDEF   |adi_set_version
0xffffffff7ef289f0|0x00000000000001a8|FUNC |LOCL |0x0  |19      |adi_set_version%sun4v-adi
0xffffffff7edea9e0|0x000000000000001c|FUNC |GLOB |0x3  |19      |adi_set_version
adi_base_test:580319*> 0xffffffff7ef289f0::dis ! ggrep -B 2 stxa
libc.so.1`adi_set_version%sun4v-adi+0x108:      ldsb      [%l2], %o2
libc.so.1`adi_set_version%sun4v-adi+0x10c:      sra       %i2, 0x0, %o0
libc.so.1`adi_set_version%sun4v-adi+0x110:      stxa      %o0, [%i4] 0x90       
--
libc.so.1`adi_set_version%sun4v-adi+0x16c:      mov       %l5, %o5
libc.so.1`adi_set_version%sun4v-adi+0x170:      sra       %i2, 0x0, %o7
libc.so.1`adi_set_version%sun4v-adi+0x174:      stxa      %o7, [%o5] 0x90       
adi_base_test:580319*> 

with the calls to store-alternate with the expected ASI.

adi_memset() operates basically in the same way, although it is faster than just doing a memset() followed by an idi_set_version().

What happens on a ADI mismatch?


Let's trigger an ADI mismatch from within our code:

       /* Tag the first two cachelines */
        tagged_buffer = adi_set_version(buffer, 128, 7);
        printf("Versioned buffer address: %p\n", tagged_buffer);        

        /* Tag the next two cachelines with a different value */
        nextbuffer = adi_set_version(buffer + 128, 128, 8);
        printf("Versioned next buffer address: %p\n", nextbuffer);

        /* Access with correct version goes through */
        tagged_buffer[0] = 'a';
        printf("first store acces: %c\n", tagged_buffer[0]);
        /* Access with incorrect version traps */
        tagged_buffer[130] = 'b';
        /* segfault...*/        
        printf("second store access: %c\n", nextbuffer[2]);

The second access from tagged_buffer, performed with a pointer with version 7 over a memory line tagged with version 8 is detected by ADI, which nullifies the store (mismatching stores never go through) and raises an exception, whose final outcome is to send a SIGSEGV down to the process. Let's see it in action:

darkhelmet$ ./adi_base_test
[...]
Buffer address before versioning: ffffffff7f5ec000
Versioned buffer address: 7fffffff7f5ec000
Versioned next buffer address: 8fffffff7f5ec080
first store acces: a
Segmentation Fault (core dumped)
darkhelmet$

We got the expected Segmentation Fault and a nice core dump. Let's dig into it a little bit:

darkhelmet$ mdb ./core
Loading modules: [ libc.so.1 ld.so.1 ]
adi_base_test:core> ::status
debugging core file of adi_base_test (64-bit) from darkhelmet
file: /tmp/adi_base_test
initial argv: ./adi_base_test
threading model: raw lwps
status: process terminated by SIGSEGV (Segmentation Fault)
, ADI deferred mismatch, pc=100001188
adi_base_test:core> 100001188::dis
main+0x130:                     call      +0x100320     
[...]
main+0x158:                     stb       %l0, [%i4 + 0x82]
[...]
main+0x17c:                     restore
_init:                          save      %sp, -0xb0, %sp
adi_base_test:core> ::regs
%g0 = 0x0000000000000000                 %l0 = 0x0000000000000062 
%g1 = 0x0000000000000004                 %l1 = 0x0000000100000c80 
%g2 = 0x0000000000000000                 %l2 = 0x7fffffff7f5ec000 
%g3 = 0x0000000000000000                 %l3 = 0x0000000000000000 
%g4 = 0x00000000782f296a                 %l4 = 0x0000000000000061 
%g5 = 0xffffffff7f0486e4 libc.so.1`_sobuf+0x18 %l5 = 0x0000000100000cf0 
%g6 = 0x0000000000000000                 %l6 = 0x0000000000002800 
%g7 = 0xffffffff7f5c2a40                 %l7 = 0xffffffff7f548f08 
%o0 = 0x0000000100000d78                 %i0 = 0x0000000000000001 
%o1 = 0x0000000000000000                 %i1 = 0xffffffff7ffff808 
%o2 = 0x0000000000000000                 %i2 = 0xffffffff7ffff818 
%o3 = 0xffffffff7f030000                 %i3 = 0x8fffffff7f5ec080 
%o4 = 0x0000000000000000                 %i4 = 0x7fffffff7f5ec000 
%o5 = 0xffffffff7f037fdc libc.so.1`_iob+0xa4 %i5 = 0xffffffff7f5ec000 
%o6 = 0xffffffff7fffee71                 %i6 = 0xffffffff7fffef51 
%o7 = 0x0000000100001190      main+0x160 %i7 = 0x0000000100000ea8 _start+0x108

 %ccr = 0x99 xcc=NzvC icc=NzvC
   %y = 0x0000000000000000
  %pc = 0x0000000100101480 PLT=libc.so.1`printf
 %npc = 0x0000000100101484 PLT=libc.so.1`printf
  %sp = 0xffffffff7fffee71
  %fp = 0xffffffff7fffef51

 %asi = 0x82
%fprs = 0x07
adi_base_test:core> 

Two things are immediately evident:

  • ADI tells us that a trap was raised by the execution at pc=100001188 (main+0x158), which is where we expected it to happen, at the store into the mismatching cache line.
  • There is surprisingly scarce information about the trap: on what virtual address? because of what mismatching value? All we get reported is a program counter and that 'deferred' definition, which seems to be intuitively matched by the fact that %pc and %npc are not at the reported faulting instruction, but somewhere further down into the next printf() call.

For a feature that was born to be a debugging tool, this lack of details is fairly unexpected. So, what's going on here? There are two types of traps that can be risen by an ADI mismatch: a precise trap and a deferred trap. Precise trap stop the thread immediately and so the architecture and the kernel can conspire to send out to userland all the relevant information about the exception. Deferred traps happen some time later, based on a number of conditions that we'll see in a short.

Loads always raise a precise trap, stores, instead, a deferred one. This is because on SPARC we have a store buffer that queues (up to 64) stores on commit before they are performed by the L2$ and the version mismatch is not detected until the L2$ performs the store. This can happen on an explicit flush (membar) or when the buffer is full and goes through. There are a number of implicit situations that lead to a membar, e.g. the userland application performing a system call, but by the time the deferred trap is delivered (and the program stopped), some instructions have passed and the hardware has no ability to keep any information on the original condition beyond the PC at which it happened.

Disabling store buffering has a significant performance penalty and is recommended only on debugging scenarios, to further drill down the details of a bug. We offer an API to enable what we called precise mode that allows to control it:

          int adi_get_precise(void)
          int adi_set_precise(int mode)

where mode is either ADI_PRECISE_ENABLE or ADI_PRECISE_DISABLE. The simplest way when writing a memory allocator (or some other debugging/protection) with ADI is to provide a tweakable way (e.g. an environment variable or a security extension property) to control the ADI behavior. As mentioned, you'll want to run with precise mode disabled in production.

Let's see an example of the amount of information we can gather when we hit a precise trap. We'll do so by tweaking our code a little and and making it trap over a load, rather than a store.

       /*
         * Do a proper store access and modify the printf to read from
         * tagged_buffer, which has an incorrect version.
         */ 
        nextbuffer[2] = 'b';
        printf("second store access (correct): %c\n", nextbuffer[2]);
        printf("second store access (mismatch): %c\n", tagged_buffer[130]);

And run it, along with evaluating the expected core dump:

darkhelmet$ ./adi_base_test
[...]
first store acces: a
second store access (correct): b
Segmentation Fault (core dumped)
darkhelmet$ mdb ./core
Loading modules: [ libc.so.1 ld.so.1 ]
adi_base_test:core> ::status
debugging core file of adi_base_test (64-bit) from darkhelmet
file: /tmp/adi_base_test
initial argv: ./adi_base_test
threading model: raw lwps
status: process terminated by SIGSEGV (Segmentation Fault), pc=1000011e4
, ADI version 8 mismatch for VA ffffffff7f5ec082
adi_base_test:core> ::regs
%g0 = 0x0000000000000000                 %l0 = 0x0000000100101000 
%g1 = 0x0000000000000004                 %l1 = 0x0000000100000cf0 
%g2 = 0x0000000000000000                 %l2 = 0x8fffffff7f5ec082 
%g3 = 0x0000000000000000                 %l3 = 0x0000000000000062 
%g4 = 0x00000000222f2967                 %l4 = 0x0000000000000061 
%g5 = 0xffffffff7f0486f0 libc.so.1`_sobuf+0x24 %l5 = 0x0000000000002880 
%g6 = 0x0000000000000000                 %l6 = 0x0000000000002800 
%g7 = 0xffffffff7f5c2a40                 %l7 = 0xffffffff7f548f08 
%o0 = 0x0000000100000da0                 %i0 = 0x0000000000000001 
%o1 = 0x0000000000000021                 %i1 = 0xffffffff7ffff7f8 
%o2 = 0x0000000000000000                 %i2 = 0xffffffff7ffff808 
%o3 = 0xffffffff7f030000                 %i3 = 0x8fffffff7f5ec080 
%o4 = 0x0000000000000000                 %i4 = 0x7fffffff7f5ec000 
%o5 = 0xffffffff7f037fdc libc.so.1`_iob+0xa4 %i5 = 0xffffffff7f5ec000 
%o6 = 0xffffffff7fffee61                 %i6 = 0xffffffff7fffef41 
%o7 = 0x00000001000011e0      main+0x170 %i7 = 0x0000000100000ee8 _start+0x108

 %ccr = 0x99 xcc=NzvC icc=NzvC
   %y = 0x0000000000000000
  %pc = 0x00000001000011e4 main+0x174
 %npc = 0x0000000100101480 PLT=libc.so.1`printf
  %sp = 0xffffffff7fffee61
  %fp = 0xffffffff7fffef41

 %asi = 0x82
%fprs = 0x07
adi_base_test:core> 

We get all the nice debugging/detailed information that we expected: the instruction pointer, the ADI version mismatch that happened and for what virtual address. The virtual address is always reported normalized, although we can inquiry about the ADI version that was associated through the ::adviser command in mdb:

adi_base_test:core> ffffffff7f5ec082::adiver
addr: ffffffff7f5ec082 cache ver: 8
adi_base_test:core> ffffffff7f5ec070::adiver
addr: ffffffff7f5ec070 cache ver: 7
adi_base_test:core> 

Since we have an exact picture of the status of registers at the time of the trap, we can further look at the instruction pointer and registers to see from what virtual address and with what version the access was performed:

%o4 = 0x0000000000000000                 %i4 = 0x7fffffff7f5ec000 

adi_base_test:core> 0x00000001000011e4::dis
[...]
main+0x174:                     ldsb      [%i4 + 0x82], %o1

And, as expected, a version 7 address starting two cache lines before shows up.

Source Code Example


As a general reference, here is the full source code from the last example we discussed, with the mismatching trap on a load access.
#include 
#include 
#include 
#include 
#include 
#include 

/*
 * Global varaibles that control ADI behavior for the application.
 */
boolean_t       adi_enabled;
int             alignment;
int             nbits;
uint_t          maxversion;
uint_t          mask;

/*
 * Gather information at runtime about ADI
 */
void
initialize_adi(void)
{
        if (adi_set_enabled(ADI_ENABLE) < 0) {
                perror("ADI initialization failed");
                adi_enabled = B_FALSE;
                return;
        }

        adi_enabled = B_TRUE;

        alignment = adi_blksz();
        nbits = adi_version_nbits();
                
        if (alignment < 0 || nbits < 0) {
                adi_enabled = B_FALSE;
                return;
        }

        maxversion = (uint_t)adi_version_max();
        mask = (1 << nbits) - 1;
}

int main(int argc, char **argv)
{
        initialize_adi();

        if (adi_enabled) {
                printf("Block size and alignment: %d bytes.\n", alignment);
                printf("Available version bits: %d\n", nbits);
                printf("Maximum usable version: %d\n", maxversion);
        }

        caddr_t buffer;
        caddr_t nextbuffer;
        caddr_t tagged_buffer;

        buffer = mmap(NULL, 8192, PROT_READ|PROT_WRITE,
            MAP_ANON|MAP_PRIVATE|MAP_ADI, -1, 0);
        if (buffer == (caddr_t)MAP_FAILED) {
                perror("mmap");
                exit(EXIT_FAILURE);
        }

        printf("Buffer address before versioning: %p\n", buffer);
        /* Tag the first two cachelines */
        tagged_buffer = adi_set_version(buffer, 128, 7);
        printf("Versioned buffer address: %p\n", tagged_buffer);        

        /* Tag the next two cachelines with a different value */
        nextbuffer = adi_set_version(buffer + 128, 128, 8);
        printf("Versioned next buffer address: %p\n", nextbuffer);

        /* Access with correct version goes through */
        tagged_buffer[0] = 'a';
        printf("first store acces: %c\n", tagged_buffer[0]);
        
        /*
         * Do a proper store access and modify the printf to read from
         * tagged_buffer, which has an incorrect version.
         */ 
        nextbuffer[2] = 'b';
        printf("second store access (correct): %c\n", nextbuffer[2]);
        printf("second store access (mismatch): %c\n", tagged_buffer[130]);
}

2 comments:

  1. Many :)

    But it depends on what are you looking for. I have numbers (which are easy to reproduce) on the speed of alternate-stores with ASI_MCDP vs ASI_MCDSBP, the (pretty significant) effect of turning on precise mode over disrupting or the upper limit of disabling store buffering.

    IMHO, though, those micro benchmark don't tell much about ADI. As an example, the worst case scenario for disabling store merging (single strand, byte-size stores) is not very representative of any existing workload and, in fact, when a more representative one is used, the impact is basically negligible. Similarly, the cost of tagging a memory region is surely important, but every real-life application using an ADI aware allocator will see more or less impact from ADI side effects (64-byte alignment, different buffer size, etc.) and the extra code paths, more than the ADI instructions themselves.

    I'm happy to help you through whatever performance analysis you have in mind. Feel free to reach out directly ( lazytyped at gmail )

    Unfortunately, until 11.4 is out, I cannot show more practical results on the defenses that we've worked on.

    ReplyDelete