Wednesday, March 21, 2018

Splice up your life

While I'm no longer an Oracle employee, there are still a few projects that landed in Solaris 11.4 that I'd like to talk about. The one that has occupied most of my last few years is definitely Ksplice on Solaris. Back in 2011, Oracle bought Ksplice, a company that provided runtime patching to the Linux kernel. Ksplice on Linux, today, is many things:

  • for customers, it's a service, that provides updates to both kernel and userland libraries for all Linux CVEs.
  • for administrators, a set of interfaces to install, revert and manage such splices, rooted in the uptrack tool.
  • for patch developers, a set of tools that allow (semi)automatic extraction and generation of splices from code changes.
  • for Ksplice developers, the code that makes all this possible, shared between the kernel framework that handles splices, the tools that generate them and the userland infrastructure.
The last two bullets happen behind Oracle walls and generate what customers and administrators ultimately see. Ksplice on Solaris is, today, in a different situation. If you harvest through 11.4 packages you'll find a kernel module (/kernel/drv/$arch/ksplice), a userland tool (spliceadm, for which there is a publicly accessible man page) and, if you look a bit further under the cover, a new SMF service (svc:/system/splice:default). On top of that, platinum customers have had a chance to experience the framework at play, receiving a couple of test splices for bugs encountered during the program. In a nutshell, the strictly technical fundaments to generate runtime kernel patches in Solaris are there, but nothing is set in stone yet (and I personally don't know) about how the service will look like.

In this blog post, I'll walk through the technical side of Ksplice on Solaris and the evolution it had from the initial "hey, we should probably have this, too" conversation with Jan, through the legal evaluation to make sure that we were doing all the right things (necessary disclaimer: we did!), to what is in the repository today.

Why Ksplice

Before dwelling into the technical details, a small digression on why we embarked into the whole effort. Patching is a key step of every deployment/security strategy and one of those that rank higher in the risk analysis scale. Many are the horror stories of systems that do not come back successfully after patching, of legacy software that just breaks down or critical, unexpected, security fixes that need to be rolled out quickly across an organization.

Solving patching pain and providing seamless updates is one of the greatest things that modern operating systems can do for users. At the same time, customers needs also have to be captured: you can't expect someone to disrupt its operations every week for a patching window, just as much as you don't want another one sitting on outdated software for too long.

With Solaris 11, we've done a tremendous amount of work to modernize and improve the patching experience and you can see it touching pretty much any area of the system. We have a new packaging system, IPS, which ensures that things move forward coherently, we leverage ZFS copy on write to provide lightweight boot environments that allow for easy rollback/fallback. We have SMF, handling the job of restarting services on updates, so that you never end up running stale code and fast reboot to quickly move across environments saving long firmware POSTs.

Ksplice was just a great fit in this overall story, opening up the possibility of both improving the IDR experience (one-off patches that fix a specific customer issue) and offering to customers a minimal reboot train with security and critical fixes. As I've previously mentioned, at the time of writing there is no commitment by Oracle that any of the above will be eventually provided.

Basic Blocks

Ksplice is composed of four key parts: the generation tools, that compare and extract differences between compilation units, creating the necessary metadata to build the splices, which are the fundamental patching blobs. The kernel framework, which loads and applies splices in memory and the administrative tools, which allow to configure the system for splice application/reversal and also manually inspect their state.

On the surface, Ksplice on Linux and Ksplice on Solaris look very similar: they both use a two pass build process, to create compilation units pre and post the patch that are later compared, and the splice contents have corresponding metadata names (if you dump the ELF sections you'll see the familiar .ksplice_relocs, .ksplice_symbols, etc sections). Also the splice format is similar, with the so called new_code and old_code pairs for each module target. But the similarities kind of stop there.

The ON build infrastructure is fundamentally different from the Linux one and is controlled by lullaby. The work that Tim, Mark and James did there is a tremendous improvement over the old nightly world and is the foundation of our extracting process. The generation tools have also been, for the most part, rewritten and are based on our libelf implementation. libelf  is basically the assembly of ELF files: it gives you useful primitives to manipulate, read and generate ELF files, but doesn't do anything fancy on top of that (if you're used to the GNU libbfd way, you know what I mean). The kernel core is of course different and even the compilers are, since we use Oracle Developer Studio rather than GCC. We also have our own delivery mechanism, through IPS/pkg and our configuration (SMF) and reporting (FMA) interfaces, that spliceadm and the kernel framework consume.

In a nutshell, this was not much a port, but rather, as Scott Michael put it, "a technology transplant". Notwithstanding this, the help we got from the Ksplice team was huge. I've lost count of the number of chats/mails/random pings that I've sent up to Jamie and others while working on this and in retrospect, maintaining some of the building blocks (metadata, patch generation, validation and application steps, etc) hugely helped.

As we were busy playing catch up with the kernel world, the Ksplice folks have also introduced userland splicing, which is a great addition towards a rebootless world, as you can now fix at runtime your behemoth applications when the next blockbuster library bug comes out. At the time of writing, this is not available in Solaris.

Preparing the Kernel

To simplify patch extraction and application, and for good measure, we want to reduce to a minimum the changes from a software fix. In particular, the waterfall effect of relative offsets changing can be particularly nasty. To avoid that, we follow the Ksplice on Linux steps of building with fragmentation, separating each function or variable into its own section and so transforming relative jumps/memory accesses into relocations (much easier to process and compare). The Studio idiom to enable fragmentation is -xF=func -xF=gbldata -xF=lcldata. 

Running elfdump -c over a so built unit shows it in action, as highlighted by the section names:
Section Header[4]:  sh_name: .text%splicetest_unused_func
 sh_addr:      0                   sh_flags:   [ SHF_ALLOC SHF_EXECINSTR ]
 sh_size:      0x1f                sh_type:    [ SHT_PROGBITS ]
 sh_offset:    0xce0               sh_entsize: 0
 sh_link:      0                   sh_info:    0
 sh_addralign: 0x20              

Section Header[5]:  sh_name: .text%splicetest_attach
 sh_addr:      0                   sh_flags:   [ SHF_ALLOC SHF_EXECINSTR ]
 sh_size:      0x68                sh_type:    [ SHT_PROGBITS ]
 sh_offset:    0xd00               sh_entsize: 0
 sh_link:      0                   sh_info:    0
Section Header[27]:  sh_name: .rodata%splicetest_string
 sh_addr:      0                   sh_flags:   [ SHF_ALLOC ]
 sh_size:      0x8                 sh_type:    [ SHT_PROGBITS ]
 sh_offset:    0x1318              sh_entsize: 0
 sh_link:      0                   sh_info:    0
 sh_addralign: 0x8               
Section Header[32]:  sh_name: .data%splicetest_dev_ops
 sh_addr:      0                   sh_flags:   [ SHF_WRITE SHF_ALLOC ]
 sh_size:      0x58                sh_type:    [ SHT_PROGBITS ]
 sh_offset:    0x1620              sh_entsize: 0
 sh_link:      0                   sh_info:    0
 sh_addralign: 0x10              
The above output is from an internal testing module, which we call splicetest to demonstrate that programmers shine thanks to their originality.

Fun story about fragmentation, the first time we enabled it for the SPARC kernel, we got greeted with an early boot panic. Turns out that SPARC uses a very simple boot allocator that has an imposed limit on the number - not total size - of allocations. In krtld (the kernel runtime linker) we use the boot allocator when parsing genuinx, since better memory management will come from genuinx itself later on. Parsing genuinx means parsing an ELF file and allocating space for its sections: the driven up number of them, especially .rela sections, just exceeded the total number of available memory slots.

Luckily, we didn't have to modify the boot allocator, but just collapse the sections back together again, as krtld would end up doing that anyway. We did this first through a linker script and later the linker aliens promoted it as a linking feature for -ztype=kmod objects.

Fun story number two about reducing the footprint of changes: we build ON in two different ways, debug and non-debug. Normally you'd run the non-debug bits, but you can get the others through pkg change-variant debug.osnet=true. Internally, developers tend to run on the slower, but mdb friendly, debug bits. In any case, we wanted splices for both, but for a long time only worked with non-debug bits. At some point, we started testing our preliminary splice tools on debug units and the number of detected changes just exploded. Thank you very much, ASSERT() and VERIFY().

These developer loved macros include in the output the line number, via __LINE__, which of course changes at each source patch, waterfalling into all the functions that use either ASSERT() or VERIFY() and that follow the fixed one. There are a number of cumbersome ways to reduce noise, from playing games with the blank lines to coding things up in funny ways, but we didn't really like that. Kuriakose and Jonathan came to the rescue by stealing a page from DTrace SDT probes and the special relocations that we use to signal them to the kernel runtime linker.

In practice, instead of placing directly the line number in the macro, we create a global variable with a reserved name that contains the line number. This creates a relocation to a symbol that has, in the name, enough information for krtld to do a clever patching of the generated assembly code so that the number is directly returned. Similarly, this allows the Ksplice tools and core framework to properly identify the relocation to the special symbol and just skip it during comparison, bringing us back to a sane number of detected changes.

A central part of this implementation is visible in sys/debug.h, which is a publicly delivered file. Go take a look for some pure engineering joy.


The fundamental unit of patching are splices. Splices are identified by a monotonically increasing, eight digits, id number. We do this for a very specific reason: prevent dim sum. We don't want customers to create unicorn configurations that we haven't tested in house, and so we look at Ksplice fixes as a stream of changes, one on top of the other, rather than a collection that you can pick from. The idea is that this should also simplify your workflow. If a previous splice doesn't successfully apply for whatever reason, the framework won't allow the next to go in.

Splices are regular kernel module that get in the system through modload. We produce a pair of modules for each target module that we want to fix, a new_code, that contains the updated stuff, and a old_code, that contains the expected contents to be found on the running system, which we verify before attempting any splice operation. new_code and old_code need to be loaded in a specific order, but instead of stuffing this logic into a script or a tool, we use module dependencies to link them and to link the whole splice together, thanks to an extra, Solaris specific, module that we call the dependency module. If a splice is delivered to your system, you can find the dependency module in /kernel/ksplice/$arch/splice-$id.

Recursively dumping this module dependencies shows the interconnections and the targets (as outlined by our fresh new kldd tool in action in all its glory):
root@kzx42-01-z15:/kernel/ksplice/amd64# kldd ./splice-90000001 
        drv/splicetest_90000001_old =>  /kernel/drv/amd64/splicetest_90000001_old
        genunix_90000001_old => /kernel/amd64/genunix_90000001_old
        drv/splicetest_90000001_new =>  /kernel/drv/amd64/splicetest_90000001_new
        genunix_90000001_new => /kernel/amd64/genunix_90000001_new
        unix  (parent) =>       /platform/i86pc/kernel/amd64/unix
        genunix  (parent dependency) => /kernel/amd64/genunix

By virtue of modloading splice-9000001, splicetest_90000001_old and genunix_90000001_old get brought in as a dependency and each one brings in the _new counterpart. Later on, this chain allows to leave in memory only the new_code modules and get rid of the old_code and dependency module to save some space.

Splices also come with one extra module, known as module.kid or, whether you talk with a Linux or Solaris person. This module is an updated copy of the target module that contains the fix. The Ksplice framework interposes into the module loading code so that if you try to load a module that wasn't in memory at the time of splicing, we pick up the updated copy. can be a bit annoying in a reverse situation, because if the module joins in as a dependency or is otherwise locked (e.g. a userspace application holding a descriptor to the device that the module provides), we can't unload it and, hence, can't reverse the splice. Reversing splices is something customers expressed fondness for, so we try to limit as much as possible this situation by loading any target module before running a splice application, de facto forcing a memory patch every time.

Could have we gotten rid of, then? Unfortunately not, as it is still necessary for edge cases where we deliver a splice that fixes a module that isn't installed. If, later on, the module gets installed and loaded, we'd have no chance to splice it 'at runtime' (just imagine the can of worms that opens up if this operation fails for whatever reason) and so we let the interposing code pick the right copy.

Kernel Framework

The kernel framework is the heart and soul of Ksplice on Solaris. Splice operations start from an ioctl to the /dev/ksplice device, which is provided by the ksplice kernel module. This module contains the Solaris implementation of the run-pre algorithm, the preflight safety checks and the patching support. Along with the kernel module, a small portion of the framework is provided by genuinx, mostly to maintain metadata and state information about the loaded splices. This split allows for the ksplice module to be loaded/unloaded at will, so that we can update it at runtime.

Function patching is performed by placing a trampoline from the original function to the patched one. The trampoline is 5 bytes on x86 (jmp offset) and 12 bytes on SPARC (sethi, jmpl, nop) and so, by the sacred rules of self-modifying code, cannot be placed safely without stopping all the cpus except the one running the patching code. While the world is stopped, the framework also takes the chance to walk through all the existing thread stacks, looking for any target pointer stored there, as that might lead to inconsistencies or crashes after the patching. This operation, internally referred to as stack-check, needs to run fast, to prevent any network or cluster timeout/heartbeat from hitting.

Fun story about stack-check. For a while we have just not paid attention to how long the operation was taking, because testing machines tend to not have too much traffic or network sensitive applications on them (the operation time grows linearly with the number of processes). The original stack-check algorithm was kind of simplistic, starting from the top of the stack and comparing 8 bytes at the time all the way down, but effective. It also felt fast enough.

Later on, reality kicked in, especially on SPARC where stacks are significantly larger compared to x86. Our clustering code started panic'ing here and there with heartbeat timeouts and that became very fast a P1 bug. We worked out a quicker, but slightly riskier algorithm, in which we walked the stack frame by frame and only evaluated function linkage data (e.g. return addresses or passed in parameters). That relieved the problem, but was still somewhat close to the time limit when testing with a very large number of processes. On top of that, for splices removing a symbol, we still had to make sure somehow that no local variable contained a reference to it, or fully embrace the yolo mentality. Basically, we had duck-taped the issue, but not really solved it.

Turns out that there is a third, much better way: instead of performing the whole stack check while cpus are stopped, we perform an initial pass while the world is running. If we hit a conflict we back off for a bit and try again. Rinse and repeat for three times before definitively bailing out. If we pass this step, then we stop the world and re-perform the stack-check, but this time we skip all the threads that haven't had any cpu time since the last check, as they haven't had any chance to make progress. This takes away a huge chunk of stack walking and makes things fast, so fast that we default to the full stack check again (but keep frame checking around for good measure and even compare the two on debug kernels).

Fun story about stack-check and SPARC, take two. At some point, all splice applications on SPARC started failing with a stack-check violation. Every single one of them had the issuing process (spliceadm) hitting a false positive in its call chain. We hadn't made any recent significant change to the algorithm, just some code reordering, so this was even more puzzling. First came the frustration-induced, draconian idea: always ignore stack-check failures that come from the thread that is running within the ksplice code path. Basically functioning, but really not pretty - so we kept debugging.

Oh beloved register windows, we meet again, Turns out that our code reordering led to the compiler leaving some of the to-be-checked pointers into registers that survived across a register window and ended up in the next, happily saved onto the stack right before the full stack-check. We solved this by adding a clear_window() routine that basically exhausted all the register windows and repeatedly set all registers to 0, so that we could start from a clean state. Small, cute and elegant - this worked for a while, until at some point false positives started popping up again.

On SPARC there is extra stack space that is saved for aggregate return values and an extra area for the callee to store register arguments. If this extra space ends up unused and unluckily aligned over some dead stack that contains the pointers we played with in the framework to prepare the check, a false positive arises again. As much as we had ways to solve this by rearranging the code, this felt fragile over time, so on top of the register window clearing, we now also zero out all the dead stack before walking down the stack checking algorithm, ensuring to do that from a call site that is higher than the shortest depth that the algorithm can hit.

Ksplice and DTrace

Along with stack-check, the most interesting safety check that we run is the one that guarantees interoperability between Ksplice and DTrace. Actually, this is more than just a safety check, as these two guys really like to fiddle with the .text segments and have to communicate to avoid stepping on each others toes.

The story of DTrace support is fairly tortuous and spans over a few years before we got to its final form, with various people alternating and, occasionally, walking down deep and dark alleys. If there is one thing that I've learned from this is that failure is, indeed, progress. We had to prove ourselves that some of the ideas were batshit crazy to really reach the final state we're now happy with.

Let's start with the problem to solve. DTrace has two unstable providers that interact with the text/symbols layout: FBT (Function Boundary Tracing) and SDT (Statically Defined Tracing). The former places probes at each function entry and return point, while the latter needs to be explicitly written into the source code and allows the programmer to collect data at arbitrary points within a function. They are both "unstable" as they are intimately tied with the kernel implementation, which we reserve the right to change at will.

One of the key ideas behind Ksplice is that things get updated, but you really don't notice that. As an example, we take care to not change user/kernel interfaces with it. When it comes to DTrace scripts, ideally we'd want something written prior to a splice to keep working even if the splice has detoured execution of one of the traced points. Defining working is the big deal. The unstability of the SDT and FBT providers gives us a bit of a leeway, but we have internal products that we want to splice, and that rely on SDT/FBT behavior (e.g. ZFSSA). Also, it would be silly to not strive for the best possible experience with one of Solaris finest tools, of course always factoring in the complexity.

Here is what we came up with. First of all, we need to distinguish between two macro scenarios: a script is running or a script has been written, but will be started later. In the first case, if it is currently enabling SDT or FBT probes within units that we need to evaluate or consume (e.g. run-pre/splice framework), we abort the splice operation and return the list of such scripts/pids to the admin. Trying to do anything on our own only leads to too much complexity. Say that we termporarily stop the script, do the patching and the logic of the function changes - would the script still make sense? What if the script tries to access a parameter that we no longer pass? What if the function was even deleted? Better have the admin relaunch the script and DTrace catch all these situations. This also solves the problem of DTrace modifying the .text segment of functions that we need to compare, as we ensure that no DTrace script will ever interfere during a splice operation.

For the second scenario, whereby a script exists but it will be (re)launched after the splice operation, there are a couple of troublesome situations:

  • Every patched function is inside a new module (the new_code) and part of the 4-touple that identifies a DTrace probe point (provider:module:function:name) relies on the module name. A script may think it's enabling the right SDT point, but it might be the "old" one and never fire.
  • DTrace providers are loadable kernel modules and build the list of probe points when loaded, by parsing all the already loaded modules. On top of that, there are hooks at every modload/modunload. Building the list means, for FBT, walking the symbol table and finding entry/exit points by pattern matching on known prologue/epilogues. Ksplice patches the prologue, so the view, pre and post a splice for a module, has a different number of entries and can lead to stale contents. Stale contents with DTrace are a panic waiting to happen.
  • Users might be confused if all of a sudden more than a single probe is enabled for a touple that doesn't specify the module name (new_code functions maintain the same name as the target ones).

We solve these problems differently for SDT and FBT. For SDT we implement what we call probe translation, so that the new_code SDT probe, if present and identical, overwrites the one from the patched function. The opposite operation happens during reverse, restoring the old SDT chain.

For FBT, we bite the bullet of letting the touple change with respect to the module definition. Say you have a script that hooks on fbt:splicetest:splicetest_math:entry and we patch splicetest_math; that script won't work anymore, because after the splice, splicetest:splicetest_math no longer has an expected prologue and is not recognized by DTrace as a valid point. Similarly, also splicetest_math:return goes away, solving the problem of an FBT return probe that never fires. Scripts in the form fbt::splicetest_math:{entry|return} instead just work seamlessly, as the last new_code module in the chain will be the only one providing the symbol. This form is by far the most common and the one that we use internally, so we "optimize" for it.

The above sort of works on x86 with the existing code, just by calling into DTrace modload/modunload callbacks, but is a total mess on SPARC. This is because on SPARC probes are set up through a two-pass algorithm, whereby in the first run we count the number of probes and allocate the necessary handling structures and on the second run, populate them. The simplistic calls into the modload/modunload routines would find a pre-allocated table and things would go south from there. It's also a bit gross, reflecting the attempt of a Ksplice person doing DTrace-y things, which is a classic sentinel of bad.

Thankfully, Tomas Jedlicka and Tomas Kotal came to the rescue by designing and implementing a much better interface in DTrace, that invents a new probe state, HIDDEN, that behaves like DISABLED, but cannot be enabled, ever. Its whole point is to stay around keeping metadata information. The only transition allowed is from HIDDEN to DISABLED and vice versa.

This HIDDEN state captures all the splice interaction scenarios: the target module is spliced and later parsed by FBT? All the spliced points get included in the list of probes, but marked HIDDEN. The splice is lifted? The probe points become DISABLED. The list has already been built, but we apply a splice? No problem, just get the list of targets from Ksplice and make the associated probes HIDDEN.

The HIDDEN concept is at the framework level and same goes for the new refresh callback, introduced to not overload modload/modunload and now consumed by Ksplice. By making these changes at the framework level, any future provider that might need to do something reacting to splice operations already has all the necessary entry points in place.  On top of that we also provide a couple of helper functions to request the original function contents (in case one wants to walk the .text segment as if the splice wasn't there) or the list of targets/symbols of a splice operation.

As of today, FBT and SDT are the only two consumers of the above.

User Experience

All the architecture, code, cute designs and long debugging sessions are pointless if you don't make your stuff usable. Staying with the idea that things get updated, but you really don't notice that, applying a splice to the system is as simple as installing/updating any other package, which, not to brag, is so damn cool (I might be biased by the amount of manual loading that I've done during development). This is achieved through the SMF svc://system/ksplice:default service, which coordinates automatic splice operations.

This service is responsible of four main things:
  • apply splices on delivery, by getting refreshed by pkg
  • control freezing and unfreezing of splices
  • on a reboot, apply all the splices at boot time
  • collect and store splice logs
Freezing is a Ksplice on Solaris specific concept, which roots on the fact that splices have a monotonically incremental id. At any point in time, an admin can specify a maximum ID value that the system can be at. If there are splices with a bigger ID currently applied, they get reversed, if new splices with a bigger ID get delivered, they are not loaded. The idea of freezing is to capture scenarios where admins want to download the splices, but still apply them during a quiet period (to maximize chances of success) or a potential downtime window (for a new technology such as ours, some testing of the water has to be expected). It also provides a very simple instrument to temporarily blacklist a problematic splice, while we frantically work on fixing it. Of course, we never release problematic splices, so you will never need that - right? If that was ever to happen, though, we also leverage the freezing concept to prevent reboot loops, by leaving a grace period before when the freshly applied splice will also be applied on reboot.

Freezing is controlled by spliceadm(1M), through the freeze <id> and unfreeze commands and highlighted by the status command. These three commands, along with log, are the only ones you should have ever to interact with for regular administration of Ksplice on Solaris, but we also provide a few more for our support folks to troubleshoot issues and manually interact with splices (apply/reverse/cleanup).

Lastly, there is spliceadm sync, which is what the SMF method calls. Its job is to walk the list of existing splices on the system and compare it with the freeze configuration to establish the list of splices to apply or reverse.

spliceadm man page describes the command in details and you can bet that, whenever the first splice will be out, a lot more documentation with examples and screenshots will be available. Since I'm now a user and no longer a developer, I'm really looking forward to that.

Closing Words/Shootouts

This project was huge and a number of people joined in at various stages to help along since the early days when Jan Setje-Eilers dragged me into this under Scott Michael's managerial supervision. Kuriakose Kuruvilla and Pete Dennis have been stably part of the "Solaris Ksplice Team", Rod Evans and Ali Bahrami (the linker aliens) have joined mid-way and made the tooling and krtld so much better, Mark J Nelson is one of the three people in the organization that understand everything lullaby does and that can express desires in Makefile form; if the infrastructure has gotten this efficient and anywhere sustainable, it's mostly thanks to his magic-fu. Xinliang Li and Raja Tummalapalli have both tolerated our occasional "what if we do that?" and turned it into code. Testing infrastructure was Albert White's work and the gate autografting and management was Adam Paul's and Gabriel Carrillo's bread and butter.

Bottom line, I mostly just got to tell the story :-)

No comments:

Post a Comment