SuSE Kernel Developer
Copyright (C) 2001 Andrea Arcangeli SuSE
UKUUG Manchester, 29 June - 1 July 2001
- why 64bit?
- need of large amounts of ram
- higher performance
- other code that works with 64bit data
- why to extend x86 to 64bit?
- proven and familiar technology
- x86 is the most popular instruction set in the world
- delivers 64bit while providing full x86 compatibility at nearly no cost
- doesn't require a completely new toolchain
- backwards compatible with existing x86 modes
- architectural support for 64bit of virtual and physical memory address space
- implementations may support less
- 64bit integer operations and registers
- double the number of integer and SSE registers
- 16 32bit integer general purpose registers
- 16 SSE registers
- Eliminate unused/underutilized arcane x86 features within the context of 64-bit mode
- 64bit mode doesn't use segments
Cost of a system call
System calls have a relevant cost: before being able to run the kernel code implemented by the syscall, the cpu has to do a relevant amount of work both in hardware and in software.
This overhead is not negligible. It is easy to measure it using the getpid() system call that basically only costs the overhead of entering/exiting kernel.
system call cost numbers
I measured the enter/exit kernel overhead in linux on a few machines (the kernel version isn't relevant as there are no interesting changes in the entry/exit points between 2.2 and 2.4).
machine syscall latency linux kernel running
------- --------------- --------------------
Old Pentium (no-MMX) 166mhz: 125 cycles 2.2.18pre9aa1
PII 450Mhz: 292 cycles modified 2.4.0-test8-pre5
Athlon 1Ghz: 260 cycles 2.2.18pre9aa1
Alpha 21264 ev6 500mhz: 146 cycles modified 2.4.0-test8-pre5
call vs systemcall
Despite linux systemcall latency is extremely low compared to other operative systems, the more a system call is frequently recalled and the less cycle it takes, the less we want to to pay the overhead of entering/exiting kernel.
Incidentally gettimeofday is heavily recalled by lots of applications first of all real-time and multimedia applications that needs to somehow react in function of the real time. Secondly databases and many other apps runs an huge amount of those gettimeofday calls.
Gettimeofday is also extremely fast, so it's exactly the kind of functionality that we don't want to keep in kernel.
gettimeofday into userspace?
Moving gettimeofday into userspace is what we wish, but it isn't feasible.
gettimeofday needs information from the hardware, from the timer interrupt, from the time stamp counter, and so we cannot move gettimeofday entirely in userspace because we don't have access to all those informations from userspace.
Since we can't implement gettimeofday in userspace, the solution(tm) we implemented has been to move the kernel in userspace :-).
In short gettimeofday is still kernel code, but we don't need to enter/exit kernel anymore, in order to run it because it's in a _kernel_ page executable with the userspace privileges.
We call this kind of kernel syscalls "virtual syscalls" and the gettimeofday vsyscall in particular is called "vgettimeofday".
The program will only need to run a `call 0xfff...` (to a certain fixed virtual address in kernel space) to execute the vgettimeofday. Like if it would be a glibc function or more in general a normal userspace function (but remember it is really kernel code running with userspace privileges instead!).
This drops down to zero the system call overhead, improving significantly performances of applications using gettimeofday heavily in tight loops.
Interaction with glibc
In the long term the vgettimeofday kernel virtual address will be prelinked into the binary at compile time, like a weak glibc symbol.
glibc will take care to patch the binary .text code if somebody is overriding it with an userspace library.
The other issue to solve is the handling of the dwarf2 information of the vsyscalls: we must be able to somehow unwind the stack through a vsyscall to allow exceptions and stack traces to pass through the vsyscall.
For live debugging and exceptions the simpler solution is to somehow provide to userspace the dwarf2 information of the vsyscalls so that userspace will be able to unwind the stack through the whole vsyscall.
For the core dumps with the dwarf2 information stored in the coredump, the debugger will be able to unwind the frame of the dumped vsyscall, to discard the vsyscall execution and finally to set the right retval into %rax before artificially returning from the vsyscall.
In all linux ports, when an invalid userspace pointer is passed to a syscall, the syscall will gracefully return -EFAULT.
Since the vsyscall runs in userspace, when the vsyscall will dereference the invalid pointer right now it will just generate a SIGSEGV.
It will be possible to teach the page fault handler when it should not raise a SIGSEGV possibly with a mechanism similar to the one used by the copy_user() but OTOH posix also allows the syscalls to throw SIGSEGV to the user when userspace is passing invalid pointers to the kernel, so it is at least theoretically not required.
vsyscalls must make sure to get a coherent input from the privileged kernel before computing the time of the day (output).
The locking algorithm we implemented is to use two sequence numbers. Userspace is only a reader, kernel is only a writer.
Privileged kernel increases sequence number #1, then makes the changes to the vsyscall input, and ultimately increases sequence number #2.
The vsyscall instead reads sequence number #2, then reads the input, and finally reads sequence number #1. If sequence number #1 and #2 match, it means the input is coherent and we go ahead with the computation, otherwise we try again.