The 5-year journey to bring restartable sequences to Linux
The EfficiOS team on 08 February 2019
Concurrency control algorithms, paired with per-CPU data, are integral to ensuring low-level libraries and high-performance applications scale properly on today's hardware. These algorithms guarantee that user-space data structures are kept consistent in the face of simultaneous accesses, and that modifications are done wholly so that threads see either the before or after state, but not in between.
There are a number of ways to design these algorithms. The most common, though least scalable, is with mutual exclusion which works by allowing only one thread to modify the data structure at any one time — and that's the thread holding the mutex.
But mutual exclusion doesn't scale well, particularly with per-CPU data. Since only one thread can be within the critical section bookended by the mutex lock and unlock sequence, it's possible to have many threads waiting to acquire the lock. And time waiting for locks is time not spent doing useful work.
The next step up on the list of scalable algorithms is atomic compare and swap.
In this model, a single instruction is used such as cmpxchg
or
the lock
prefix in the case of x86.
The problem here though is that atomic instructions are relatively expensive on
modern processors. The x86 lock
prefix can easily add many cycles to the cost
of the same instruction without the prefix. To make matters worse, regardless
of whether there's only one thread executing the instruction or not the lock
prefix will be executed.
And not all architectures provide a single instruction for atomically updating data. For instance, ARM uses link-load/store-conditional instructions to read the old data, modify it, and write the new data. If two threads try to write the same data simultaneously only one will succeed and the other will fail and restart the read-modify-write sequence.
The Linux kernel uses other methods of protecting per-CPU data, such as disabling preemption or interrupts, or using “percpu” local atomic operations. Unfortunately, these mechanisms are either not readily available to user-space, or they're comparatively slow (as is the case with the atomic instructions). Instead, what's needed is a lightweight mechanism for user-space to protect per-CPU data. This is the motivation for restartable sequences.
How Restartable Sequences Work
Restartable sequences are built with a new system call, rseq(2)
,
that tells the kernel where the restartable sequence's thread-local storage
ABI (a struct rseq
object) is located for the
current thread. This object contains a rseq_cs
field which is a
pointer to the currently active restartable sequence critical section
descriptor (a struct rseq_cs
object). Only one critical section
can be active at any time. This ABI has two purposes: user-space restartable
sequences and a way to quickly read the current CPU number.
The critical section is subdivided into preparatory and commit stages, where the commit step is a single CPU instruction. The critical section is said to have been interrupted if any of the following occur before the commit step:
- The thread is migrated to another CPU
- A signal is delivered to the thread
- The thread is preempted
Additionally, because a fallback mechanism is required when the thread is interrupted, the kernel sets the instruction pointer to the start of an abort handler which can take some corrective action, for example retrying the preparatory step.
With this design, the optimistic case (where the thread isn't interrupted) is extremely fast because expensive locks and atomic instructions can be entirely avoided.
Getting down into the details, the current restartable sequence critical
section is described by a struct rseq_cs
object which is
referenced by a struct rseq
object. Here's a diagram to illustrate
the relationship and the structure members.
The start and end of the restartable sequence are denoted by the
start_ip
and post_commit_ip
(which points to the
instruction after the commit instruction), and the abort_ip
points
to the first instruction of the abort handler.
There are restrictions on the implementation of both the critical section and the abort handler. For one, the abort handler must be outside the critical section. Secondly, system calls are not permitted within a critical section, and attempting to execute a system call will cause the process to be terminated via a segmentation fault.
The cpu_id
field is updated by the kernel whenever a thread
migrates to a new CPU and user-space programs are expected to read
cpu_id_start
at the beginning of the critical section and compare
the two values. If they differ then the running thread was interrupted and the
restartable sequence needs to be retried.
The rseq_cs
field is modified by both the kernel and user-space.
When beginning a restartable sequence, user-space code needs to set the pointer
to the current critical section descriptor. The kernel sets this pointer to
NULL
whenever it preempts or delivers a signal while executing a region of code
that lays outside of the critical section range described by the current
rseq_cs
descriptor.
A Brief History of rseq
Support for restartable sequences was merged in Linux 4.18. The concept of restartable sequences, or rseq, was originally proposed by Paul Turner and Andrew Hunter in 2013 as a way to safely access per-CPU data from user-space data without locks or expensive atomic instructions. But at the time no patches were available.
After two years — and in an effort to cajole them into posting their patches to the Linux kernel mailing list — Mathieu Desnoyers submitted his own patches for per-CPU critical sections in May 2015. A month later, the very first patch series for restartable sequences was posted by Paul. After momentum died down on Paul's version, Mathieu picked it up in 2016 and submitted a new series and covered them in a talk at Linux Plumbers Conference 2016. He had hoped to get the patches merged for version 4.15 of the Linux kernel but there was a snag…
While benchmark data was available with pretty much every version of the patch series, Linus made it clear that hypothetical use cases were not reason enough to merge the restartable sequences feature, and that concrete performance numbers were necessary.
Facebook had already provided results for using the patches with the jemalloc memory allocator. So, Mathieu set about collecting more benchmark results and getting rseq support ready for other projects such as LTTng-UST, Userspace RCU, and glibc.
Finally, after five years the series was merged into the Linux kernel, and Mathieu gave a talk at the Open Source Summit Europe 2018 entitled Improve Linux User-Space Core Libraries with Restartable Sequences which covered the multi-year effort of bringing restartable sequences to Linux.
How to Use rseq in your Library or Application
The preferred approach to using restartable sequences is to use the
librseq library which provides all the
per-CPU operations you're likely to need such as making rseq(2)
available to the current thread (rseq_register_current_thread()
),
looking up the CPU number of the current thread
(rseq_current_cpu()
), and updating per-CPU data
(rseq_cmpeqv_storev()
).
But if you want to roll your own operations, read on for a more detailed explanation.
Using rseq(2)
requires two steps. First, you need to enable the
functionality for the current thread with the rseq(2)
system call.
The system call has the following prototype:
sys_rseq(struct rseq *rseq, uint32_t rseq_len, int flags, uint32_t sig)
The purpose of the system call is to register a struct rseq
object with the
kernel. The flags
argument is 0
for registration and RSEQ_FLAG_UNREGISTER
for
unregistration. The sig
argument is a signature that can be used to validate
the rseq context, in other words the signature used for registration must be
the same one used for unregistration.
Let's assume you want to increment a per-CPU counter using
rseq(2)
. To do that, you need to get the CPU number of the current
thread (stored in the cpu_id_start
field of struct
rseq
) and modify the per-CPU counter using a restartable sequence. This
is done with a mixture of C and assembly. Here's the code to do that.
#define _GNU_SOURCE
#include <linux/rseq.h>
#include <stdio.h>
#include <stdlib.h>
#include <syscall.h>
#include <stdint.h>
#include <unistd.h>
#include <sys/syscall.h>
static __thread volatile struct rseq __rseq_abi;
#define RSEQ_SIG 0x53053053
static int sys_rseq(volatile struct rseq *rseq_abi, uint32_t rseq_len,
int flags, uint32_t sig)
{
return syscall(__NR_rseq, rseq_abi, rseq_len, flags, sig);
}
static void register_thread(void)
{
int rc;
rc = sys_rseq(&__rseq_abi, sizeof(struct rseq), 0, RSEQ_SIG);
if (rc) {
fprintf(stderr, "Failed to register rseq\n");
exit(1);
}
}
#define RSEQ_ACCESS_ONCE(x) (*(__volatile__ __typeof__(x) *)&(x))
static int rseq_addv(intptr_t *v, intptr_t count, int cpu)
{
__asm__ __volatile__ goto (
".pushsection __rseq_cs, \"aw\"\n\t"
".balign 32\n\t"
"1:\n\t"
".long 0, 0\n\t"
/* start_ip, post_commit_offset, abort_ip */
".quad 2f, (3f-2f), 4f\n\t"
".popsection\n\t"
".pushsection __rseq_cs_ptr_array, \"aw\"\n\t"
".quad 1b\n\t"
".popsection\n\t"
"2:\n\t"
"leaq 1b(%%rip), %%rax\n\t"
"movq %%rax, %[rseq_cs]\n\t"
"cmpl %[cpu_id], %[current_cpu_id]\n\t"
"jnz 4f\n\t"
"addq %[count], %[v]\n\t" /* final store */
"3:\n\t"
".pushsection __rseq_failure, \"ax\"\n\t"
/* Disassembler-friendly signature: ud1 <sig>(%rip),%edi. */
".byte 0x0f, 0xb9, 0x3d\n\t"
".long 0x53053053\n\t" /* RSEQ_SIG */
"4:\n\t"
"jmp %l[abort]\n\t"
".popsection\n\t"
: /* gcc asm goto does not allow outputs */
: [cpu_id] "r" (cpu),
[current_cpu_id] "m" (__rseq_abi.cpu_id),
[rseq_cs] "m" (__rseq_abi.rseq_cs),
/* final store input */
[v] "m" (*v),
[count] "er" (count)
: "memory", "cc", "rax"
: abort
);
return 0;
abort:
return -1;
}
int main(int argc, char **argv)
{
int cpu, ret;
intptr_t *cpu_data;
long nr_cpus = sysconf(_SC_NPROCESSORS_ONLN);
cpu_data = calloc(nr_cpus, sizeof(*cpu_data));
if (!cpu_data) {
perror("calloc");
exit(EXIT_FAILURE);
}
register_thread();
cpu = RSEQ_ACCESS_ONCE(__rseq_abi.cpu_id_start);
ret = rseq_addv(&cpu_data[cpu], 1, cpu);
if (ret)
fprintf(stderr, "Failed to increment per-cpu counter\n");
else
printf("cpu_data[%d] == %ld\n", cpu, cpu_data[cpu]);
return 0;
}
The code in rseq_addv()
begins by filling out a struct rseq_cs
object that describes the
segment of the restartable sequence denoted by the start symbol 2
, the
post-commit offset (3
- 2
), and the abort handler 4
. If the thread does not
complete the sequence between labels 2
and 3
, control jumps to the 4
label
and then to the abort
label in C.
A word of caution: you must ensure that the CPU number is only read once.
Compilers need to be coerced into guaranteeing that with the volatile keyword.
The RSEQ_ACCESS_ONCE()
macro above guarantees this.
Exactly How Much Faster is rseq?
One of the main use cases for restartable sequence is getting the CPU number
that the current thread is executing on — usually to then be used as an index
into a per-CPU data structure. The existing method of using sched_getcpu()
to
discover the CPU number requires a system call on ARM and invokes the VDSO on
x86. But rseq(2)
allows you to read a cached CPU number value in the struct
rseq
object shared between kernel and user-space.
The rseq approach results in a 20x speedup on x86 and 35x speedup on ARM.
Here's a graph demonstrating the relative speed improvements of the alternative methods of getting the CPU number for the current thread. In all the following graphs, smaller values are better and reflect a speed increase.
As discussed above, there are a number of use-cases that rseq is suitable for.
One of the most common uses of per-CPU data is for storing counters. The
following graph shows the speed improvements when incrementing a per-CPU
counter with rseq(2)
versus using sched_getcpu()
and
atomic instructions. ARM shows an 11x speedup and x86 shows 7.7x.
LTTng-UST uses per-CPU buffers to store events. The graph below shows the
performance improvement when storing 32-bit header and 32-bit event payload
into a per-CPU buffer using sched_getcpu()
and atomic
instructions or rseq. ARM received a 1.1x speedup and x86 saw a 1.2x improvement.
And finally, using rseq in liburcu from the Userspace RCU project results in a 5.8x speedup on ARM and a 1.9x speedup on x86.
What's Next?
While patches to use rseq(2)
exist for LTTng, Userspace RCU, and
glibc, however they're only at the proof of concept stage. The next phase is to get
them merged into their respective projects. For glibc, that means patches to
automatically register with rseq(2)
at thread start time,
unregister on thread exit, and at NPTL initialization for the main thread.
LTTng-UST is in a slightly different situation because of a shortcoming with
the current design: it's not possible to move data between per-CPU data
structures without also needing to change the thread's affinity mask. To solve
this problem Mathieu has proposed a new cpu_opv system
call
that executes a vector of fixed operations (comparison, memcpy, and add) on a
specific CPU, similar to the struct iovec
concept with readv(2)
and
writev(2)
. Another issue with rseq(2)
that is solved with cpu_opv is that if
you single-step through a critical section the debugger will loop forever. The
new cpu_opv system call will allow debuggers to work with existing applications
even if libraries make use of rseq(2)
.
Mathieu had originally hoped to get the new cpu_opv system call merged in time
for the Linux kernel 4.21 release, but Linus Torvalds has made it clear that he
wants to see users of
rseq(2)
implemented first. Meaning those work in progress rseq(2)
patches for glibc need to be merged.
If you want to use rseq(2)
in your application or library, we'd love to know
about it. You can contact Mathieu at mathieu.desnoyers@efficios.com.