Valgrind, an open-source memory debugger for x86-GNU/Linux

Julian Seward, jseward@acm.org

http://developer.kde.org/~sewardj

May 2002


1  Introduction

Valgrind is a tool to help you find memory-management problems in your programs. When a program is run under Valgrind's supervision, all reads and writes of memory are checked, and calls to malloc/new/free/delete are intercepted. As a result, Valgrind can detect problems such as: Problems like these can be difficult to find by other means, often lying undetected for long periods, and then causing occasional, difficult-to-diagnose crashes.

Since valgrind's core machinery intercepts every memory access your program makes, other interesting possibilities emerge. Support was recently added to simulate a I1/D1/L2 cache hierarchy, giving detailed insights into the causes of cache misses in programs.

Valgrind is designed to be as non-intrusive as possible. It works directly with existing executables. You don't need to recompile, relink, or otherwise modify the program to be checked. Simply place the word valgrind at the start of the command line normally used to run the program. So, for example, if you want to run the command ls -l on Valgrind, simply issue the command: valgrind ls -l.

Valgrind takes control of your program before it starts. Debugging information is read from the executable and associated libraries, so that error messages can be phrased in terms of source code locations. Your program is then run on a synthetic x86 CPU which checks every memory access. All detected errors are written to a log. When the program finishes, Valgrind searches for and reports on leaked memory.

If run in cache-profiling mode, at program exit the accumulated statistics are dumped into a log file. An auxiliary program reads this, and will print source code annotated with access and miss counts, down at the level of individual x86 instructions if you want. Summaries for (C/C++) source lines, procedures and the whole program are also generated.

You can run pretty much any dynamically linked ELF x86 executable using Valgrind. Programs run 25 to 50 times slower, and take a lot more memory than they usually would. It works well enough to run large programs. For example, the Konqueror web browser from the KDE Desktop Environment, version 3.0, runs slowly but usably on Valgrind.

Valgrind simulates every single instruction your program executes. Because of this, it finds errors not only in your application but also in all supporting dynamically-linked libraries, including the GNU C library, X client libraries, Qt, the KDE libraries, and so on. That often includes libraries, for example the X client libraries, which contain memory access violations, but which you cannot or do not want to fix.

Rather than swamping you with errors in which you are not interested, Valgrind allows you to selectively suppress errors by recording them in a suppressions file which is read when Valgrind starts up. The build mechanism attempts to select suppressions which give reasonable behaviour for the libc and XFree86 versions detected on your machine.

Valgrind is not a toy system. It can run huge applications, including major open-source projects: OpenOffice 1.0, pretty much all of KDE 3.0, and Mozilla 1.0rc3, give an idea of the scale of programs it can run.

2  Errors that valgrind can find

Despite considerable sophistication under the hood, Valgrind can only really detect two kinds of errors, use of illegal addresses, and use of undefined values. Nevertheless, this is enough to help you discover all sorts of memory-management nasties in your code.

2.1  Illegal address errors

For example:
  Invalid read of size 4
     at 0x40F6BBCC: (within /usr/lib/libpng.so.2.1.0.9)
     by 0x40F6B804: (within /usr/lib/libpng.so.2.1.0.9)
     by 0x40B07FF4: read_png_image(QImageIO *) (kernel/qpngio.cpp:326)
     by 0x40AC751B: QImageIO::read() (kernel/qimage.cpp:3621)
     Address 0xBFFFF0E0 is not stack'd, malloc'd or free'd

This happens when your program reads or writes memory at a place which Valgrind reckons it shouldn't. In this example, the program did a 4-byte read at address 0xBFFFF0E0, somewhere within libpng.so.2.1.0.9, which was called from somewhere else in the same library, called from line 326 of qpngio.cpp, and so on.

Valgrind tries to establish what the illegal address might relate to, since that's often useful. So, if it points into a block of memory which has already been freed, you'll be informed of this, and also where the block was freed at. Likewise, if it should turn out to be just off the end of a malloced block, a common result of off-by-one-errors in array subscripting, you'll be informed of this fact, and also where the block was malloced.

In this example, Valgrind can't identify the address. Actually the address is on the stack, but, for some reason, this is not a valid stack address -- it is below the stack pointer, %esp, and that isn't allowed. In this particular case it's probably caused by gcc generating invalid code, a known bug in various flavours of gcc.

Note that Valgrind only tells you that your program is about to access memory at an illegal address. It can't stop the access from happening. So, if your program makes an access which normally would result in a segmentation fault, your program will still suffer the same fate -- but you will get a message immediately prior to this. In this particular example, reading junk on the stack is non-fatal, and the program stays alive.

2.2  Use of uninitialised values

For example:
  Conditional jump or move depends on uninitialised value(s)
     at 0x402DFA94: _IO_vfprintf (_itoa.h:49)
     by 0x402E8476: _IO_printf (printf.c:36)
     by 0x8048472: main (tests/manuel1.c:8)
     by 0x402A6E5E: __libc_start_main (libc-start.c:129)

An uninitialised-value use error is reported when your program uses a value which hasn't been initialised -- in other words, is undefined. Here, the undefined value is used somewhere inside the printf() machinery of the C library. This error was reported when running the following small program:

  int main()
  {
    int x;
    printf ("x = %d\n", x);
  }

It is important to understand that your program can copy around junk (uninitialised) data to its heart's content. Valgrind observes this and keeps track of the data, but does not complain. A complaint is issued only when your program attempts to make use of uninitialised data. In this example, x is uninitialised. Valgrind observes the value being passed to _IO_printf and thence to _IO_vfprintf, but makes no comment. However, _IO_vfprintf has to examine the value of x so it can turn it into the corresponding ASCII string, and it is at this point that Valgrind complains.

The two main sources of uninitialised data are local variables which have not been initialised, and the contents of malloced blocks, before you write something into them.

2.3  Illegal frees

For example:
  Invalid free()
     at 0x4004FFDF: free (vg_clientmalloc.c:577)
     by 0x80484C7: main (tests/doublefree.c:10)
     by 0x402A6E5E: __libc_start_main (libc-start.c:129)
     by 0x80483B1: (within tests/doublefree)
     Address 0x3807F7B4 is 0 bytes inside a block of size 177 free'd
     at 0x4004FFDF: free (vg_clientmalloc.c:577)
     by 0x80484C7: main (tests/doublefree.c:10)
     by 0x402A6E5E: __libc_start_main (libc-start.c:129)
     by 0x80483B1: (within tests/doublefree)

Valgrind keeps track of the blocks allocated by your program with malloc/new et al, so it can know exactly whether or not the argument to free/delete is legitimate or not. Here, this test program has freed the same block twice. As with the illegal address errors, Valgrind attempts to make sense of the address freed. If, as here, the address is one which has previously been freed, you will be told that -- making duplicate frees of the same block easy to spot.

2.4  When a block is freed with an inappropriate deallocation function

In the following example, a block allocated with new[] has wrongly been deallocated with free:
  Mismatched free() / delete / delete []
     at 0x40043249: free (vg_clientfuncs.c:171)
     by 0x4102BB4E: QGArray::~QGArray(void) (tools/qgarray.cpp:149)
     by 0x4C261C41: PptDoc::~PptDoc(void) (include/qmemarray.h:60)
     by 0x4C261F0E: PptXml::~PptXml(void) (pptxml.cc:44)
     Address 0x4BB292A8 is 0 bytes inside a block of size 64 alloc'd
     at 0x4004318C: __builtin_vec_new (vg_clientfuncs.c:152)
     by 0x4C21BC15: KLaola::readSBStream(int) const (klaola.cc:314)
     by 0x4C21C155: KLaola::stream(KLaola::OLENode const *) (klaola.cc:416)
     by 0x4C21788F: OLEFilter::convert(QCString const &) (olefilter.cc:272)
The following was told to me by the KDE 3 developers. I didn't know any of it myself. They also implemented the check itself.

In C++ it's important to deallocate memory in a way compatible with how it was allocated. The deal is:

The worst thing is that on Linux apparently it doesn't matter if you do muddle these up, and it all seems to work ok, but the same program may then crash on a different platform, Solaris for example. So it's best to fix it properly.

Pascal Massimino adds the following clarification: delete[] must be associated with a new[] because the compiler stores the size of the array and the pointer-to-member to the destructor of the array's content just before the pointer actually returned. This implies a variable-sized overhead in what's returned by new or new[]. It is rather surprising how compilers [runtime-support libraries?] are robust to mismatch in new/delete new[]/delete[].

2.5  Passing system call parameters with inadequate read/write permissions

Valgrind checks all parameters to system calls. If a system call needs to read from a buffer provided by your program, Valgrind checks that the entire buffer is addressable and has valid data, ie, that it is readable. And if the system call needs to write to a user-supplied buffer, Valgrind checks that the buffer is addressable. After the system call, Valgrind updates its administrative information to precisely reflect any changes in memory permissions caused by the system call.

Here's an example of a system call with an invalid parameter:

  #include <stdlib.h>
  #include <unistd.h>
  int main( void )
  {
    char* arr = malloc(10);
    (void) write( 1 /* stdout */, arr, 10 );
    return 0;
  }

You get this complaint ...

  Syscall param write(buf) contains uninitialised or unaddressable byte(s)
     at 0x4035E072: __libc_write
     by 0x402A6E5E: __libc_start_main (libc-start.c:129)
     by 0x80483B1: (within tests/badwrite)
     by <bogus frame pointer> ???
     Address 0x3807E6D0 is 0 bytes inside a block of size 10 alloc'd
     at 0x4004FEE6: malloc (ut_clientmalloc.c:539)
     by 0x80484A0: main (tests/badwrite.c:6)
     by 0x402A6E5E: __libc_start_main (libc-start.c:129)
     by 0x80483B1: (within tests/badwrite)

... because the program has tried to write uninitialised junk from the malloc'd block to the standard output.

3  Implementation overview

What follows is a short tour of Valgrind's internals. Those interested in going further will find extensive documentation in the (huge) file docs/techdocs.html in any Valgrind source distribution.

3.1  Getting started

Valgrind is compiled into a shared object, valgrind.so. The shell script valgrind sets the LD_PRELOAD environment variable to point to valgrind.so. This causes the .so to be loaded as an extra library to any subsequently executed dynamically-linked ELF binary, viz, the program you want to debug.

The dynamic linker allows each .so in the process image to have an initialisation function which is run before main(). It also allows each .so to have a finalisation function run after main() exits.

When valgrind.so's initialisation function is called by the dynamic linker, the synthetic CPU to starts up. The real CPU remains locked in valgrind.so for the entire rest of the program, but the synthetic CPU returns from the initialisation function. Startup of the program now continues as usual -- the dynamic linker calls all the other shared object initialisation routines, and eventually runs main(). This all runs on the synthetic CPU, not the real one, but the client program cannot tell the difference.

Eventually the client program will terminate by calling the exit system call -- there is no other way out. Valgrind detects this, and uses it as its cue to exit. It prints summaries of all errors detected, possibly checks for memory leaks, and then calls exit itself.

On entry, Valgrind switches stacks, so it runs on its own stack. On exit, it switches back. This means that the client program continues to run on its own stack, so we can switch back and forth between running it on the simulated and real CPUs without difficulty. This was an important design decision, because it makes it easy (well, significantly less difficult) to debug the synthetic CPU.

3.2  The translation/instrumentation engine

None of the original program's code is run directly. Only instrumented translations are run. Valgrind maintains a translation table, which allows it to find the translation quickly for any branch target (code address). If no translation has yet been made, the translator - a just-in-time translator - is summoned. This makes an instrumented translation, which is added to the collection of translations. Subsequent jumps to that address will use this translation.

The JITter translates basic blocks -- blocks of straight-line-code -- as single entities. To minimise the considerable difficulties of dealing with the x86 instruction set, x86 instructions are first translated to a RISC-like intermediate code, similar to sparc code, but with an infinite number of virtual integer registers. Initially each instruction is translated separately, and there is no attempt at instrumentation.

The intermediate code is improved, mostly so as to try and cache the simulated machine's registers in the real machine's registers over several simulated instructions. This is often very effective. Also, we try to remove redundant updates of the simulated machine's condition-code register.

The intermediate code is then instrumented, giving more intermediate code. There are a few extra intermediate-code operations to support instrumentation; it is all refreshingly simple. After instrumentation there is a cleanup pass to remove redundant value checks.

This gives instrumented intermediate code which mentions arbitrary numbers of virtual registers. A linear-scan register allocator is used to assign real registers and possibly generate spill code. All of this is still phrased in terms of the intermediate code.

Then, and only then, is the final x86 code emitted. The intermediate code is carefully designed so that x86 code can be generated from it without need for spare registers or other inconveniences.

The translations are managed using a traditional LRU-based caching scheme. The translation cache has a default size of about 28MB. Instrumented translations are 12-14 times larger than the originals.

3.3  Tracking the status of memory

Each byte in the process' address space has nine bits associated with it: one "A" bit, which records whether that byte is addressable or not, and eight "V" bits, which record the definedness or otherwise of the bits within that byte. If a byte is not addressable, its eight V bits are irrelevant, so it is fair to say that valgrind regards each byte of memory as being in one of 257 different states: unaddressable, or addressable with 256 definedness possibilities.

The A and V bits for each byte are stored using a sparse array, which flexibly and efficiently covers arbitrary parts of the 32-bit address space without imposing significant space or performance overheads for the parts of the address space never visited. The scheme used, and speedup hacks, are described in detail at the top of the source file vg_memory.c, so you should read that for the gory details.

3.4  System calls

All system calls are intercepted. The memory status map is consulted before and updated after each call. It's all rather tiresome and not at all exciting. See vg_syscall_mem.c for details.

3.5  Signals

All system calls to sigaction and sigprocmask are intercepted. If the client program is trying to set a signal handler, Valgrind makes a note of the handler address and which signal it is for. Valgrind then arranges for the same signal to be delivered to its own handler.

When such a signal arrives, Valgrind's own handler catches it, and notes the fact. At a convenient safe point in execution, Valgrind builds a signal delivery frame on the client's stack and runs its handler. If the handler longjmps, there is nothing more to be said. If the handler returns, Valgrind notices this, zaps the delivery frame, and carries on where it left off before delivering the signal.

The purpose of this nonsense is that setting signal handlers amounts to giving callback addresses to the kernel. We can't allow this to happen, because if it did, signal handlers would run on the real CPU, not the simulated one. This means the checking machinery would not operate during the handler run, and, worse, memory permissions maps would not be updated, which could cause spurious error reports once the handler had returned.

An even worse thing would happen if the signal handler longjmp'd rather than returned: Valgrind would completely lose control of the client program.

Upshot: we can't allow the client to install signal handlers directly. Instead, Valgrind must catch, on behalf of the client, any signal the client asks to catch, and must deliver it to the client on the simulated CPU, not the real one. This involves considerable gruesome fakery; see vg_signals.c for details.

3.6  PThreads

Linux's PThreads implementation, libpthread.so, uses the clone system call to generate new kernel-level threads. One possibility would be to allow Valgrind to persist across calls to clone, and insert the required locking/mutual exclusion for the memory permissions maps and other key data structures.

However, this was considered difficult to get right, and the additional locking requirements would have added significant overhead to what is already a slow system.

Instead, Valgrind supplies its own user-space PThreads implementation. This intercepts calls to the POSIX pthread_* functions, which are handled by a combination of simulated code communicating with supporting primitives and a scheduler inside Valgrind itself. The interception is done by supplying a fake libpthread.so and causing programs to link it instead of the system-supplied one. The fake library passes requests to Valgrind's core using a special trapdoor mechanism designed to allow client code to communicate with the underlying Valgrind core.

As ever, this trickery is invisible to both the end-user and the program being debugged. It has been a significant amount of work, but now works well enough to run large threaded apps (eg Mozilla, OpenOffice/StarOffice), and gives complete control of thread scheduling to Valgrind, which may be useful for future error-checking extensions designed to automatically find data races.

3.7  Cache profiling

The cache profiler was developed by Nick Nethercote, njn25@cam.ac.uk, at the University of Cambridge Computing Laboratory.

When in cache profiling mode, the normal error-checking instrumentation is not done. Instead, instrumentation code is added to call the cache simulator, which lives in vg_cachesim_D1.c, vg_cachesim_I1.c and vg_cachesim_L2.c. These track enough of the state of the caches being simulated to determine whether or not each memory reference will cause a miss in the D1, I1 or L2 cache. If so, a miss count for that reference is incremented.

The counters for each basic block are stored along with the translations themselves. This means there is essentially zero cost in finding the counter(s) associated with a given memory reference point, which speeds the simulation. When old translations are ejected from the translation cache, their counter arrays are retained. Should those translations later be re-made, they are reunited with the counter arrays they had previously.

At the end of the run the counters are dumped in a text file, cachegrind.out. The format of it has been designed to keep its size manageable. Even so, running a web browser for a couple of minutes generates a file on the order of several megabytes long.

The vg_annotate Perl script reads this file and generates summaries of the memory access and miss statistics, for the program as a whole and for each function. It will also print source code with counts annotated on each line, so that code generating cache misses can be identified exactly. The script has various flags which give considerable flexibility about which source fragments are displayed and which not.

The default simulated cache arrangement is for an AMD Athlon. The supplied vg_cachegen Perl script will generate vg_cachesim_D1.c, vg_cachesim_I1.c and vg_cachesim_L2.c suited to a cache arrangement of your specification, with a range of cache sizes, associativities and line sizes which should cover most arrangements in use today.

This is a little cumbersome. One possibility is for Valgrind to make enquiries with the x86 CPUID instruction at startup, so the correct cache parameters are automatically obtained.

4  Valgrind in practice

4.1  Feedback from users

The purpose of writing Valgrind was to create the best possible open source memory-debugging tool for x86-Linux. Although it has been under development for about two years, snapshots were only made publicly available towards the end of February 2002.

Feedback from users has ranged from lukewarm to very positive. Many people report that Valgrind was helpful in quickly finding obscure bugs which would otherwise have taken longer to find. Many people suffered the bugs, trials and tribulations of the frequent development snapshots made available, and the feedback, bug reports and patches from them rapidly improved the usability and quality of valgrind. If you are one of those people, I thank you for your persistence.

It is difficult to directly assess whether valgrind helps programmers create code which is more reliable and stable than it otherwise would be. Valgrind's availability in Feb/March 2002 coincided with the KDE developers stabilising KDE 3.0 for its release in early April. My impression -- and it is only an impression -- is that KDE 3 was extensively Valgrinded. Certainly KDE 3 elicits far fewer complaints from Valgrind than KDE 2 does. Some KDE 3 applications, most notably KMail 1.4.1 in KDE 3.0.1, run almost with no complaints from Valgrind, which suggests that they, and the underlying KDE support libraries, are largely free of the memory management errors that Valgrind detects. To what extent Valgrind contributed to this is unclear to me.

Dirk Mueller, KDE 3.0 release coordinator, comments:

[regarding KDE 3 being Valgrinded prior to release] That's certainly true. While we always tried to verify the code quality before releases, we lacked the means to do this effectively, even with the time-limited licenses that were donated for available commercial tools, because those required instrumenting the code, which was often difficult in practice. You had to use the right combination of libs, the right binutils version, the linux kernel it was written for, and still had to fiddle a lot to find the bugs in our code.

Valgrind made all this very easy - no instrumenting required. Runs with any recent distribution. Works reliably. Runs fast and is easy to use.

Various developers spent weeks valgrinding almost all applications of KDE 3 release. Numerous smaller leaks and uninitialized variable reads were found quickly with valgrind.

Valgrind was very helpful in spotting memory leaks in even the XFree86 client code.

[...] we were able to spot bugs in KDE or Qt easily that would maybe never be found otherwise, because they only cause real problems (crashes or malfunctions) maybe once in 100 cases.

4.2  What it's like to use

Tracking down addressing errors is relatively simple. When an illegal read or write occurs, Valgrind indicates the precise source location where this is happening. It also tries to offer some useful hints as to the nature of the offending address, and is usually successful. Common causes which it can identify are reading/writing freed memory, and reading/writing just off the end or before the beginning of an allocated block. With this information, the cause of the problem can usually be found with little difficulty.

Dealing with undefined-value errors is harder. Typically Valgrind complains that a conditional jump depends on uninitialised value(s). Valgrind's definedness-tracking mechanism allows undefined values to propagate through arbitrary computations but only complains when they are used to determine the outcome of a conditional jump, or when a memory address depends on them.

The resulting effect is that the point where the complaint occurred may be far from where the undefined value originated. One has to work backwards through the chain of computations leading up to the error point. This can be difficult, since it may lead through arbitrary procedure calls, moving values around in memory, etc.

A good general debugging strategy seems to be to try and make the target program or library "Valgrind-clean", that is, run without any errors from Valgrind. This has the same rationale as making code compile silently with -Wall, for example: it is worth getting rid of all errors, even apparently harmless ones, since that makes it a lot easier to spot new ones when they appear. In practice there are some difficulties with this:

How reliable is Valgrind? One factor critical to acceptance is that it has a very low rate of false alarms, since if programmers don't believe what it says, they won't use it.

Valgrind is almost always correct (I would say, simply, "always correct", but I know better than that) about invalid addresses. That's because there is no ambiguity about whether or not an address is valid -- establishing this is usually straightforward.

Whether or not a value is undefined, and whether this has any effect on the client program, is more difficult to tell. Valgrind's value tracking system was designed carefully to minimise the chances of false error reports. Two prototype schemes were tried and discarded prior to the current one. Valgrind tracks definedness at the bit level, and even simulates the definedness of the CPU's condition codes, in an attempt to avoid making mistakes.

I think it's fair to say that valgrind is almost always correct when it complains about uninitialised value uses. It is certainly far more likely to be correct than not correct, experience shows. There are known cases where obscure code sequences can trigger a false report, but these are extremely rare. Only one was observed, as far as I know, whilst KDE 3 was being valgrindified.

Speed: instrumented programs run 25-50 times slower than natively. In practice this is just about fast enough to be useful. Interactive programs like text editors and web browsers run at usable, if somewhat sluggish, speeds on typical development machines (PIIIs/Athlons in the 1 GHz region). Slowdown is close to 25 x for pure compute bound code, but increases substantially in code which uses malloc/free intensively, since that necessitates a lot of messing with memory permissions. Execution also slows considerably when a lot of translation is happening -- the translator is not fast.

4.3  Future directions

The past couple of months have seen extensive work to support POSIX pthreads. Many large systems use pthreads, so supporting them is important. This work is now largely complete.

The current emphasis is to freeze functionality and stabilise the code base, to create a reliable candidate for release as valgrind-1.0 later this year.

Some vague ideas for future functionality are:

Finally it would be nice to make Valgrind faster. Some preliminary investigations have already been made. Three avenues look interesting: