Barely grasping the small picture: December 2016

Saturday, December 31, 2016

Hardware-based buffer overflow defenses compared: SSM/ADI vs MPX

One of the most common questions when discussing SPARC M7 SSM/ADI (Silicon Secured Memory/Application Data Integrity, from here on only referred to as ADI) is "how does it compare to XYZ?", where XYZ is some other architecture security feature. In this blog entry we'll see how ADI stacks against another buffer overflow protection security feature, Intel Memory Protection eXtensions (MPX).

Before starting, a somewhat obvious, but necessary, observation: these features do not provide security "by themselves", instead, they provide building blocks to develop protections on top. In other words, they are meant to, and need to, be (extensively) leveraged by software implementations to be in any way effective.

ADI

I've already covered ADI extensively in another blog entry, so I won't dwell too much into the details here and just provide a quick recap. ADI implements a form of memory tagging and checking: individual cache lines can be assigned a color (a numeric value ranging from 0 to 16) and at load/store time such color is checked against the color saved onto the 4 topmost unused bits of the target virtual address. If the color matches, the operation goes through; if it doesn't, an exception is raised and the instruction is ignored. 0 and 16 are universal matches: a cache line colored with either of those will not fault regardless of the color used in the virtual address. ADI only works on the data path, it has not effect on the instruction fetch and execute process.

Memory tagging is not a revolutionary concept: it has actually floated a lot across academia for a while. What really makes ADI stand out, though, is that for the first time tagging is both granular enough (64 bytes -- a cache line) and seamlessly integrated in the architecture to be used at large for general purpose scenarios (e.g. a memory allocator). Oh, and it's fairly performant, too. In this respect, SPARC is the first "mainstream" architecture to achieve that.

From a security perspective, ADI does one thing and does it great: detect linear overflows. A linear overflow happens when a program reads past the boundaries of a buffer. If we can guarantee that two adjacent buffers never share the same color, we can guarantee that a read starting from one will never successfully hit the other. If this doesn't seem like much, look at it as a way to put a definitive end to heap overflow exploitation.

ADI does reasonably well also for non malicious stray pointers: by randomizing the assigned colors at runtime, there is a fair chance that a stray pointer will not have a matching color. For malicious (read: attacker forged) stray pointers, the music changes. In fact ADI isn't particularly suited to protect against arbitrary read/write patterns. While it does add a deterrence (the attacker still needs to leak or guess the target color), the range is super small, just fourteen values, and easy to brute force/guess.

When used for system level defenses, e.g. inside memory allocators, ADI works well with lots of software, but still changes two main assumptions: (1) math can be exercised directly onto pointers and (2) the ability to access one byte into a page implies the ability to access any other byte in the page in the same way. (1) can be solved by normalizing pointers before any math operation, while (2) can be solved by either loading the version and adjusting the pointer accordingly or, only for reading, by using a non faulting load. Non faulting loads are particularly interesting because they provide an efficient way to implement trusted paths (portions of the program that have a legitimate reason to bypass ADI checking when accessing a given memory location). As an example of this, see this recent fix for Python to make it work well with an ADI enhanced memory allocator.

MPX

Another approach to detect buffer overflows is to add instrumentation where the buffer is accessed, to verify whether the read/write operation stays within the boundaries. This approach has two drawbacks that make it less suitable for production scenarios: (1) it adds some (non marginal) performance penalty, since extra instructions need to be executed and (2) it cannot be switched on/off when deploying a system or needing relief when hitting a false positive (if binary recompilation is used, then a problem arises if the package manager or some other entity relies on the hash of the binary). Intel proposed solution to the above two problems is the Memory Protection eXtensions technology.

MPX introduces a couple of new registers and a handful of instructions that operate on them. The new registers BND[0..3] are 128 bits long, with 64-bit used to store the upper bound and 64-bit used to store the lower bound of a buffer. Three new instructions allow to check a pointer against said bounds: BNDCL (Bound Check Lower Bound), BNDCU (Bound Check Upper Bound) and BNDCN (Bound Check Upper Bound not in 1s Complement). For example, bndcl (%rax), %bnd0 compares the contents of RAX against the lower bound set in BND0. If the check fails, a new #BR exception is raised. BNDC* instructions are very fast, to reduce the performance penalty.

Of course, 4 bound registers aren't enough for every buffer used in a program, so MPX supports a number of ways to swap back and forth the necessary upper/lower bound values: BNDMK (Bound Make) stores a pair of addresses into one of the BNDx registers, BNDMOV (Bound Move) loads a pair from a location in memory and BNDLDX/BNDSTX manage the Bound Table, which stores information about a pointer and its bounds. Bound Tables are arranged in a two-level directory in memory and the root address is stored in BNDCFGU (user land, CPL=3) or BNDCFGS (kernel land, CPL=0). BND*X and BNDMOV instructions simplify bounds management, but do logically introduce a larger performance hit.

MPX relies heavily on the compiler/instrumentation to be effective: while the programmer can add manual checks, it's the compiler that needs to identify the places where a check is necessary and introduce the proper instruction sequences there. The smarter this logic is, the better the performance is going to be. A quick analysis of MPX performance (and more) is available on the AddressSanitizer wiki.

MPX is fully retro compatible, as its instructions use prefixes that are treated as NOPs on older architectures. This allows to build one single binary and distribute it around. The same also happens when MPX is disabled, which allows admins to toggle on/off the protection on a binary basis. MPX interoperates well with existing code, allowing to mix instrumented and non instrumented components into the same process (with some caveat). The idea there is to allow MPX to be introduced gradually in large applications, starting with the sensitive modules.

MPX vs ADI

Both MPX and ADI aim at detecting buffer overflows, although they look at the buffer from two different angles. Intel MPX aims at providing instructions so that an about to be performed operation can check whether what it wants to do over/underflows the buffer boundaries, while ADI aims at letting the operation go, but detect, through color violations, whether it does something wrong as it happens. From this perspective, ADI scales better for long lived buffers that might be accessed in various places within an application, as you only have to to tag the buffer once at creation time and the hardware will do the rest. The chance of missing one access point, due to a particularly odd construct, is a non issue with ADI. This is a particularly nice property when looking at building an invariant on top of it.

Similarly, the above also means that ADI is simpler to retrofit into legacy applications. Where MPX would require a recompilation/translation, with ADI one can design system level defenses (e.g. at the allocator level, which is the only entity responsible of the tagging) that can be enabled on legacy applications. Of course, if an application uses its own memory allocator or if one is looking at protecting other data regions, code changes (or compiler support, e.g. to track .data segment or automatic variables) is necessary, just like with MPX.

ADI imposes more constraints on the programmer. Minimum granularity is 64 bytes and pointers need to be aligned to - or contained within - the cache line boundaries. MPX has much better precision, as it can detect up to 1-byte overflows in just about any scenario (e.g. buffer overflow across structure members). More generally, any type of erroneous pointer access can be detected, provided that the proper bounds are set. This makes MPX simpler to apply to an arbitrary part of an application, regardless of what kind of data it is operating on.

Both technologies are meant to mix well within an application (ADI use is for the most part implicit, as demonstrated by the ability to retrofit into legacy applications) and do not require a special binary to run on older architectures. When disabled, MPX still imposes some performance penalty, both directly (very minimal, the NOPs take space and need to be executed) and indirectly (side effect of disabling certain optimizations that collide with the instrumentation engine). With ADI, at least for the cases where no compiler support is needed, there is no impact when it is disabled.

Both technologies are meant to be fast, but interesting side effects show up with large applications. MPX memory consumption (for Bound Tables or BNDMOV backup storage) grows significantly if the application manipulates a large number of pointers and similarly performance drops down due to the large amount of swapping necessary to set BND[0..3] registers. Also, while it's true that MPX can be gradually introduced into an application (by limiting the places where it is used), it is still pretty much black and white, as all the introduced instructions are there in the .text segment. In other words, it's hard to create different defenses based on MPX and only selectively enable some of them on a target application (unless, of course, this is done at build time).

On the contrary, ADI makes the above much simpler, since only the producer of the memory region is involved in the tagging. This also makes it generally better performant for a system level defense, although certain applications (especially those that make a substantial use of small buffers) might experience slow downs, due to the larger memory footprint and the smaller amount of objects fitting into a single cache line. A tradeoff is possible there, by packing more objects under the same color, losing the ability to detect a small overflow between them.

Summing up (and feel free to look for bias here and call me out on it), ADI is a superior solution for system level defenses for production scenarios, while MPX is a much stronger debugging tool, showing all its Pointer Checker heritage. Don't get me wrong: both can be used in both scenarios -- I can certainly envision adding MPX protections to binaries (and a security extension to control it) and Oracle Studio Discover does a great job of leveraging ADI to find bugs, but the trade offs in MPX (mandatory recompilation, significant compiler support, higher performance impact, higher precision) tilt the balance towards the debugging scenario, while ADI trade offs (more significant offload to hardware, possibility to design system level defenses that can be applied to legacy binaries, smaller performance impact, smaller precision geared towards more common cases) make it a better tool for production environments.

The Security Extensions Framework

Security defenses usually come with a cost. Such cost can be in terms of performance (added instrumentation, extra memory usage, different memory layout, etc.) and/or compatibility (the defense constrains some border line behavior the application relies on and the application breaks). If either of these costs is not marginal, the system level defense cannot be enabled at large. In particular, we try really hard to never break user systems and we know that many of those run legacy applications.

For these reasons, system level security protections need to be integrated gradually. There is a large number of sensitive security binaries, such as privileged/setuid binaries (e.g. /bin/su) and networking daemons (e.g. sshd) that tend to not be performance sensitive and that definitely would benefit from an hardening of their perimeter. Holding those hostage of some other breakage would just not be smart.

The Security Extensions Framework solves this problem by providing three different models for any userland security feature:

"all" : the extension is enabled for all processes
"tagged-files" : the extension is enabled only for binaries that explicitly opt-in
"disabled" : the extension is disabled system wide

This is used in conjunction with binary tagging. At build time, we support the ability to mark the binary to explicitly state whether a given extension should be enabled or not, through the ld -z<extension>=[enabled|disabled] linker directive. Such settings end up into the ELF .dynamic section, where each security extension has a dedicated entry.

markuse$ elfdump -d /bin/su
Dynamic Section:  .dynamic
[...]
     [35]  SUNW_ASLR       0x2        ENABLE
     [36]  SUNW_NXHEAP     0x2        ENABLE
     [37]  SUNW_NXSTACK    0x2        ENABLE

This leads to three main scenarios:

Opt-in (enabled, model=tagged-files): only the binaries that are explicitly tagged as wanting the protection get it. This is by far the most common deployment model. Even for well established protections (e.g. ASLR), there are still certain applications that don't work well with it and prevent us from force enabling the protection system wide. For more recent defenses, the number of tagged binaries is usually smallish (in contrast, with ASLR is well over 90%).
Opt-out (enabled, model=all): every process on the system gets the defense enabled, except those that are explicitly tagged to disable it. Only the NXSTACK (non-executable stack) security defense is delivered as such. We do run a number of systems with the all the extensions set to 'all', to proactively identify stuff that breaks. This effort also usually brings up a number of existing nasty bugs into various applications (the newer the defense, the more likely to find). We recently let loose internally a new defense, based on SSM/ADI, and found over 70 bugs in different (opensource and internal) applications.
Disabled: the extension is fully disabled and binary tagging has no effect. This configuration is a last resort large hammer for administrators in case they would hit some pathological issue that we didn't anticipate. It also lets admins concerned by performance side effects to rule the defense completely out.

Along with system level and binary level controls, the Security Extensions Framework offers process level control. Through sxadm exec -s <extension>=[enable|disable], one can run a process overriding any extension setting. The configuration is inherited by all child processes and, of course, standard security rules apply (a less privileged process cannot affect a more privileged process). This ends up being fairly useful during debugging, in both ways:

sxadm exec -s aslr=disable /bin/bash creates a debugging environment where launching arbitrary processes gives repeatable results
sxadm exec -s <somenewext>=enable /bin/bash creates a testing environment while developing/evaluating the effects of a new extension

sxadm exec can also be wrapped around a third party application to run it with the extensions enabled (e.g. from within an SMF method script).

Administration and Deployment

From an administrative point of view, the Security Extensions Framework is managed through sxadm(1m). The command allows to configure extensions (sxadm enable/disable) and their properties (sxadm get/set), check the current configuration (sxadm status) and execute programs with arbitrary settings (sxadm exec). This offers a common and centralized interface for all the extensions. Learn once, use every time.

Under the cover, the configuration is stored into the svc://system/security/security-extensions SMF service, to which sxadm acts both as the frontend and as the start method. The start method is implemented through the private command sxadm apply, which consumes the secsys() system call. This call acts as the gate between the user and kernel portion of the framework and is used to make the kernel aware of the requested configuration.

Storing the configuration into SMF has a number of key advantages in terms of ease of deployment and integration.

From a deployment point of view, admins can customize the framework configuration through a site-profile and deploy it to an arbitrary number of systems (the SMF profile gets delivered by IPS and is applied on the first boot after installation). Admins do not need to learn a new special configuration lingo.

RAD SMF bindings can be used for remote administration. E.g. the following snippet of code gathers the model for the ASLR security extensions through RAD:

markuse$ cat rad_secext.py 
import rad.client as radcli
import rad.connect as radon
import rad.bindings.com.oracle.solaris.rad.smf_1 as sbind

# Create a connection
con = radcon.connect_unix()

# Retrieve a particular service
service = con.get_object(sbind.Service(), radcli.ADRGlobPattern({"service" : "system/security/security-extensions"}))
model = service.readProperty("aslr/model")

print "ASLR model is %s" % model.values

markuse$ python rad_secext.py
ASLR model is ['all']
markuse$

By connecting remotely (connect_tls()) rather than locally and using, for example, writeProperty() one could change the property setting on a number of different systems remotely.

In Solaris 11.3, the security-extensions service SMF properties haven't been committed yet. The framework is still somehow young and so we felt like retaining the flexibility of a few changes in light of upcoming improvements. Committing the properties is a mandatory step to provide full, mature, reliability beyond the sxadm interface (AI site profiles, RAD administration, etc.). We are working on it for the upcoming releases.

In terms of integration with the rest of the system, SMF is a central part of Solaris so every system wide feature that we offer is well integrated with it. Keeping a security angle on the discussion, Immutable Zones and RBAC authorizations are fully supported by, and implemented in, SMF. For example, out of the box, just by virtue of having its configuration stored in SMF, the configuration of security extensions is prevented inside an Immutable Global Zone, unless the admin is on the Trusted Path.

On top of that, the Compliance Framework can easily parse SMF properties and hence gather information and report about any extension configuration or status.

TL;DR

Security defenses need an administrative interface in order to be successfully introduced in the operating system and really be taken advantage of. The Security Extensions Framework provides such interface, through process level, binary level (tagging) and system level support to manage security extensions.

Thursday, December 22, 2016

Hardening Allocators with ADI

Memory allocators handle a crucial role in any modern application/operating system: satisfy arbitrary-sized dynamic memory requests. Errors by the consumer in handling such buffers can lead to a variety of vulnerabilities, which have been regularly exploited by attackers in the past 15 years. In this blog entry, we'll look at how the ADI (Application Data Integrity) feature of the new Oracle M7 SPARC processors can help in hardening allocators against most of these attacks.

A quick memory allocator primer

Writing memory allocators is a challenging task. An allocator must be performant, limit fragmentation to a bare minimum, scale up to multi-thread applications and handle efficiently both small and large allocations and frequent/unfrequent alloc/free cycles. Looking in depth at allocator designs is beyond the scope of this blog entry, so we'll focus here on only the features that are relevant from an exploitation (and defense) perspective.

Modern operating systems deal with memory in page-sized chunks (ranging from 8K up to 16G on M7). Imagine an application that needs to store a 10 characters string: handing out a 8K page is certainly doable, but is hardly an efficient way to satisfy the request. Memory allocators solve the problem by sitting between the OS physical page allocator and the consumer (being it the kernel itself or an application) and efficiently manage arbitrary sized allocations, dividing pages into smaller chunks when small buffers are needed.

Allocators are composed of three main entities:

live buffers: a chunk of memory that has been assigned to the application and is guaranteed to hold at least the amount of bytes requested.
free buffers: chunks of memory that the allocator can use to satisfy an application request. Depending on the allocation design, these are either fixed-size buffers or buffers that can be sliced in smaller portions.
metadata: all the necessary information that the allocator must maintain in order to efficiently work. Again, depending on the allocator design, this information might be stored within the buffer itself (e.g. Oracle Solaris libc malloc stores most of the data along with the buffer) or separately (e.g. Oracle Solaris umem/kmem SLAB allocator keeps the majority of the metadata into dedicated structures placed outside the objects)

Since allocators divide a page in either fixed-size or arbitrary-size buffers, it's easy to see that, due to the natural flow of alloc/free requests, live buffers and free buffers end up living side by side in the linear memory space.

The period that goes from when a buffer is handed out to an application, up until is freed is generally referred to as the buffer lifetime. During this period, the application has full control of the buffer contents. After this period, the allocator regains control and the application is expected to not interfere.

Of course, bugs happen. Bugs can affect both the application working set of buffers or the allocator free set and metadata. If we exclude allocator intrinsic design errors (which, for long existing allocators, due to the amount of exercise they get, are basically zero), bugs always generate from the application mishandling of a buffer reference, so they always happen during the lifetime of a buffer and originate from a live buffer. It's no surprise that live buffer behavior is what both attackers and defenders start from.

Exploitation vectors and techniques

As we said, bugs originate from the application mishandling of allocated buffers:

mishandling of buffer size: the classic case of buffer overflow. The application writes past the buffer boundaries into adjacent memory. Because buffers are intermixed with other live buffers, free buffers and, potentially, metadata, each one of those entities becomes a potential target and attackers will go for the most reliable one.
mishandling of buffer references: a buffer is relinquished back to the allocator, but the attacker still holds a reference to it. Traditionally, these attacks are known as use after free (UAF), although, since this is an industry that loves taxonomies, it's not uncommon to see them further qualified as use after realloc (the buffer is reallocated, but the attacker is capable of unexpectedly modifying it through the stale reference) and double free (the same reference is passed twice to the free path). Sometimes an attacker is also capable of constructing a fake object and passing it to a free call, for example if the application erroneously calls free of a buffer allocated onto the stack. The degree of exploitability of these vulnerabilities (if we exclude the use after realloc case, which is application-specific) varies depending on what the allocator does during the free path and how many consistency/hardening checks are present.

With the notable exception of double free and "vanilla" use after free, both the above classes are extremely hard to detect at runtime from an allocator perspective, as they originate (and potentially inflict all the necessary damage) during the object lifetime and the allocator has little to none practical control on the buffer. For this reason, the defense focus has been on the next best thing when bug classes cannot be eradicated: hamper/mitigate exploitation techniques. Over the years (and at various degrees in different allocators) this has taken the form of:

entrypoint checks: add consistency check at the defined free and alloc entrypoints. As an example, an allocator could mark into the buffer associated metadata (or poison the buffer itself) that the buffer has been freed. It would then be able to check this information whenever the free path is entered and a double free could be easily detected. Many of the early days techniques to exploit heap overflows (~2000, w00w00 , PHRACK57 MaXX's and anonymous' articles) relied on modifying metadata that would then be leveraged during the free path. Over time, some allocators have added checks to detect some of those techniques.
design mitigations: attackers crave for control of the heap layout: in what sequence are buffer allocated, where are they placed, how can a buffer containing sensitive data be conveniently allocated in a specific location. Allocators can introduce statistical mitigations to hamper some of the techniques to achieve this level of control. As an example, free object selection can be randomized (although it ends up being pretty ineffective against a variety of heap spraying techniques and/or if the attacker has quite some control on the allocation pattern), free patterns can be modified (Microsoft IE Memory Protector) or sensitive objects can be allocated from a different heap space (dedicated SLAB caches, Windows IE Isolated Heap, Chrome PartitionAlloc, etc). The bottom line goal of these (and other) design approaches is to either reduce the amount of predictability of the allocator or increase the amount of precise control that the attacker needs to have in order to create the successful heap layout conditions to exploit the bug.

Of course, more invasive defenses also exist, but they hardly qualify for large scale application, as users tend to (rightfully) be pretty concerned about the performance of their applications/operating systems. This becomes even more evident when we compare the amount of defenses that are today enabled and deployed at kernel level versus the amount of defenses enabled at user level (and in browsers): different components have different (and varying) performance requirements.

The practical result is that slab overflows are today probably the most reliable type of vulnerability at kernel level and use after free are a close second in kernel land, while extensively targeted in user land, with only the browsers being significantly more hardened than other components. Extensive work is going on towards automating and abstracting the development of exploits for such bugs (as recently presented by argp at Zeronights), which makes the design of efficient defenses even more compelling.

ADI to the rescue

Enter the Oracle SPARC M7 processor and ADI, Application Data Integrity, that were both unveiled at HotChips and Oracle OpenWorld 2014. At its core, ADI provides memory tagging. Whenever ADI is enabled on a page entry, dedicated non-privileged load/store instructions provide the ability to assign a 4-bit version to each 64-byte cacheline that is part of the page. This version is maintained by the hardware throughout the entire non-persistent memory hierarchy (basically, all the way down to DRAM and back).

The same version can then be mirrored onto the (previously) unused 4 topmost bits of each virtual address. Once this is done, each time a pointer is used to access a memory range, if ADI is enabled (both at the page and per-thread level), the tag stored in the pointer is checked by the hardware against the tag stored in the cache line. If the two match, all is peachy. If they don't, an exception is raised.

Since the check is done in hardware, the main burden is at buffer creation, rather than at each access, which means that ADI can be introduced in a memory allocator and its benefit extended to any application consuming it without the need of extra instrumentation or special instructions into the application itself. This is a significant difference from other hardware-based memory corruption detection options, like Intel MPX, and minimizes the performance impact of ADI while maximizing coverage. More importantly, this means we finally have a reliable way to detect live object mishandling: the hardware does it for us.
ADI example

[ADI versioning at work. Picture taken from Oracle SPARC M7 presentation]

4 bits allow for a handful of possible values. There are two intuitive ways in which an ADI-aware allocator can invariantly detect a linear overflow from a buffer/object to the adjacent one:

introduce a redzone with a special tag
tag differently any two adjacent buffers

Introducing a redzone means wasting 64-byte per allocation, since 64-byte is the minimum granularity with ADI. Wasted memory scales up linearly with the number of allocations and might end up being a substantial amount. Also, the redzone entry must be 64-byte aligned as well, which practically translates in both buffers and the redzone to be 64-byte aligned. The advantage of this approach is that is fairly simple to implement: simply round up every allocation to 64-byte and add an extra complimentary 64-byte buffer. For this reason, it can be a good candidate for debugging scenarios or for applications that are not particularly performance sensitive and need a simple allocation strategy. For allocators that store metadata within the buffer itself, this redzone space could be used to store the metadata information. Mileage again varies depending on how big the metadata is and it's worth to point out that general purpose allocators usually strive to keep it small (e.g. Oracle Solaris libc uses 16 bytes for each allocated buffer) to reduce memory wastage.

Tagging differently two adjacent objects, instead, has the advantage of reducing memory wastage. In fact, the only induced wastage is the one from forcing the alignment to a 64-byte boundary. It also requires, though, to be able to uniquely pick a correct tag value at allocation time. Object-based allocators are a particularly good fit for this design because they already take some of the penalty for wasted memory (and larger caches are usually already 64-bit aligned) and their design (fixed size caches divided in a constant number of fixed size objects) allows to uniquely identify objects based on their address. This provides the ability to alternate between two different values (or range of values, e.g. odd/even tags, smaller/bigger than a median) based on the ibject position. For other allocators, the ability to properly tag the buffer depends on whether there is enough metadata to learn about the previous and next object tag. If there is, then this can still be implemented, if there isn't, one might decide to employ a statistical defense by randomizing the tag (note that the same point applies also to object-based allocators when we look at large caches, where effectively only a single object is present per cache).

A third interesting property of tagging is that it can be used to uniquely identify classes of objects, for example free objects. As we discussed previously, metadata and free objects are never the affector, but only the affectee of an attack, so one tag each suffices. The good side effect of devoting a tag each is that the allocator now has a fairly performant way to identify them and issues like double-frees can be easily detected. In the same way, it's also automatically guaranteed that a live object will never be able to overflow into metdata or free objects, even if a statistical defense (e.g. tag randomization) is employed.

Use-after-realloc and arbitrary writes

ADI does one thing and does it great: provides support to implement an invariant to detect linear overflows. Surely, this doesn't come without some constraints (64-byte granularity, 64-byte alignment, page-level granularity to enable it, 4-bit versioning range) and might be a more or less good fit (performance and design-wise) for an existing allocator, but this doesn't detract from its security potential. Heartbleed is just one example of a linear out-of-bound access and SLAB/heap overflow fixes have been in the commit logs of all major operating systems for years now. Invariantly detecting them is a significant win.

Use-after-realloc and arbitrary writes, instead, can't be invariantly stopped by ADI, although ADI can help in mitigating them. As we discussed, use-after-realloc rely on the ability, by the attacker, to hold a reference to a free-and-then-realloced object and then use this reference to modify some potentially sensitive content. ADI can introduce some statistical noise in this exploitation path, by looping/randomizing through different values for the same buffer/object. Note that this doesn't affect the invariant portion of, for example, alternate tagging in object-based allocators; it simply takes further advantage of the versioning space. Of course, if the attacker is in the position of performing a bruteforce attack, this mitigation would not hold much ground, but in certain scenarios, bruteforcing might be a limiting factor (kernel level exploitation) or leave some detectable noise.

Arbitrary writes, instead, depend on the ability of the attacker to forge an address and are not strictly related to allocator ranges only. Since the focus here is the allocator, the most interesting variant is when the attacker has the ability to write to an arbitrary offset from the current buffer. If metadata and free objects are specially tagged, they are unreachable, but other live objects with the same tag might be reached. Just as in the use-afte-realloc case, adding some randomization to the sequence of tags can help, with the very same limitations. In both cases, infoleaks would precisely guide the attacker, but this is basically a given for pretty much any statistical defense.

TL;DR

Oracle SPARC M7 processors come with ADI, Application Data Integrity, a feature that provides memory tagging. Memory allocators can take advantage of it both for debugging and security, in order to invariantly detect linear buffer overflows and statistically mitigate against use-after-free and offset-based arbitrary writes.