Today, were going to look at the many ways of implementing user-to-kernel
transitions on x86, i.e. system
calls. Lets first quickly review
what system calls actually need to accomplish.
In modern operating systems there is a distinction between user mode
(executing normal application code) and kernel mode1 (being able
to touch system configuration and devices). System calls are the way for
applications to request services from the operating system kernel and bridge the
gap. To facilitate that, the CPU needs to provide a mechanism for applications
to securely transition from user to kernel mode.
Secure in this context means that the application cannot just jump to arbitrary
kernel code, because that would effectively allow the application to do what it
wants on the system. The kernel must be able to configure defined entry points
and the system call mechanism of the processor must enforce these. After the
system call is handled, the operating system also needs to know where to return
to in the application, so the system call mechanism also has to provide this
information.
I came up with four mechanisms that match this description that work for 64-bit
environments. Im going to save the
weirderones that only work on 32-bit
for another post. So we have:

  1. Software Interrupts using the int instruction
  2. Call Gates
  3. Fast system calls using sysenter/sysexit
  4. Fast system calls using syscall/sysret

Software interrupts are the oldest mechanism. The key idea is to use the same
method to enter the kernel as hardware interrupts do. In essence, it is still
the mechanism that was introduced with Protected
Mode in 1982 on the 286, but even
the earlier CPUs already had cruder versions of this.
Because interrupt vector 0x80 can still be used to invoke system
calls2 on 64-bit Linux, we are going to stick with this example:
The processor finds the kernel entry address by taking the interrupt vector
number from the int instruction and looking up the corresponding descriptor in
the Interrupt Descriptor
Table (IDT). This descriptor
will be an Interrupt or Trap Gate3 to kernel mode and it contains the
pointer to the handling function in the kernel.
These kinds of transitions between different privilege levels using gates cause
the processor to switch the stack as well. The stack pointer for the kernel
privilege level is kept in the Task State
Segment4. After switching to
the new stack, the processor pushes (among other information) the return address
and the users stack pointer onto the kernel stack. A typical handler routine in
the kernel would then continue with pushing general purpose registers on the
stack as well to preserve them. The data structure that is created on the stack
during this process is called the interrupt frame.
To return to userspace5, the kernel executes an iret (interrupt
return) instruction after restoring the general purpose registers. iret
restore the users stack and execution continues after the int instruction
that entered the kernel in the first place. Despite the short description here,
iret is one of the most
complicated
instructions in the x86 instruction set.
Our second mechanism, the Call Gate is very similar to using software
interrupts. Although Call Gates are the somewhat official way of implementing
system calls in the absence of the more modern alternatives discussed below, Im
aware of no use except by
malware.
Ive highlighted the differences between the software interrupt flow and Call
Gate traversal here:
Instead of int, the user initiates the system call by executing a far call.
Far calls are left-overs from the x86 segmented memory
model where a call
instruction doesnt only specify the instruction pointer to go to, but also
refers the memory segment the instruction pointer is relative to using a
selector (0x18 in the example).
The processor looks up the corresponding segment in the Global Descriptor
Table and in our case
finds a Call Gate instead of an ordinary segment. The Call Gate specifies
the instruction pointer in the kernel just as the Interrupt Gate does. The
processor ignores the instruction pointer provided by the call instruction in
this case. The rest works similarly to the software interrupt case, except that
the kernel has to use a different instruction for the return path because of a
different stack frame layout created by the hardware.
As youll see below, both of these kernel entry methods are dog slow. For both
the interrupt and call gate traversals, the processor re-loads code and stack
segment registers from the GDT. Each descriptor load is very expensive, because
the processor has to decipher a fairly messed up data structure and perform many
checks.
Many of these checks are pointless in modern operating systems. The features
provided by the segmented memory model and the hierarchical protection
domains they enable are not
used. Instead of disjoint memory segments, every segment has a base of zero and
the maximum size. Protection is realized via paging and the protection rings are
only used to implement kernel and user mode.
The solution to this problem comes in an Intel and an AMD flavor:
sysenter/sysexit is the instruction pair to implement fast system calls on
Intel that they introduced with the Pentium II in 1997. AMD came up with a
similar but incompatible instruction pair syscall/sysret with the K6-2 in
19986.
Both of these instruction pairs work pretty the same. Instead of having to
consult descriptor tables in memory for what to do most of the functionality is
hardcoded and the unused flexibility is lost: sysenter and syscall assume a
flat memory model and load segment descriptors with fixed values. They are also
incompatible with any non-standard use of the privilege levels.
A kernel-accessible model-specific
register (MSR) points to
the kernels system call entry point. sysenter also switches to the kernel
stack in this way. The processor does not have to interpret data structures in
memory. The return address is left in a general-purpose register for the kernel
to save wherever it wants to.
Ive done microbencharks (code is on
Github) of the cost of these
mechanisms7. Ive measured the cost of entering and exiting the kernel
with an empty system call handler in the kernel. We are just looking at the cost
the hardware imposes on the system call path. The difference in performance is
striking:
Using either syscall/sysret or sysenter/sysexit to perform a system call
is a magnitude faster than using traditional methods. Both modern methods cost
around 70 cycles per roundtrip from user to kernel and back. This is less than
a single 64-bit integer division!
No description of sysenter would be complete without mentioning the sharp
edges of using it. Check out
thesetwo issues that can ruin your day, if
you implement system call paths.