Since valgrind's core machinery intercepts every memory access your program makes, other interesting possibilities emerge. Support was recently added to simulate a I1/D1/L2 cache hierarchy, giving detailed insights into the causes of cache misses in programs.
Valgrind is designed to be as non-intrusive as possible. It works
directly with existing executables. You don't need to recompile,
relink, or otherwise modify the program to be checked. Simply place
the word valgrind at the start of the command line
normally used to run the program. So, for example, if you want to run
the command ls -l on Valgrind, simply issue the command:
valgrind ls -l.
Valgrind takes control of your program before it starts. Debugging information is read from the executable and associated libraries, so that error messages can be phrased in terms of source code locations. Your program is then run on a synthetic x86 CPU which checks every memory access. All detected errors are written to a log. When the program finishes, Valgrind searches for and reports on leaked memory.
If run in cache-profiling mode, at program exit the accumulated statistics are dumped into a log file. An auxiliary program reads this, and will print source code annotated with access and miss counts, down at the level of individual x86 instructions if you want. Summaries for (C/C++) source lines, procedures and the whole program are also generated.
You can run pretty much any dynamically linked ELF x86 executable using Valgrind. Programs run 25 to 50 times slower, and take a lot more memory than they usually would. It works well enough to run large programs. For example, the Konqueror web browser from the KDE Desktop Environment, version 3.0, runs slowly but usably on Valgrind.
Valgrind simulates every single instruction your program executes. Because of this, it finds errors not only in your application but also in all supporting dynamically-linked libraries, including the GNU C library, X client libraries, Qt, the KDE libraries, and so on. That often includes libraries, for example the X client libraries, which contain memory access violations, but which you cannot or do not want to fix.
Rather than swamping you with errors in which you are not interested, Valgrind allows you to selectively suppress errors by recording them in a suppressions file which is read when Valgrind starts up. The build mechanism attempts to select suppressions which give reasonable behaviour for the libc and XFree86 versions detected on your machine.
Valgrind is not a toy system. It can run huge applications, including major open-source projects: OpenOffice 1.0, pretty much all of KDE 3.0, and Mozilla 1.0rc3, give an idea of the scale of programs it can run.
Invalid read of size 4
at 0x40F6BBCC: (within /usr/lib/libpng.so.2.1.0.9)
by 0x40F6B804: (within /usr/lib/libpng.so.2.1.0.9)
by 0x40B07FF4: read_png_image(QImageIO *) (kernel/qpngio.cpp:326)
by 0x40AC751B: QImageIO::read() (kernel/qimage.cpp:3621)
Address 0xBFFFF0E0 is not stack'd, malloc'd or free'd
This happens when your program reads or writes memory at a place
which Valgrind reckons it shouldn't. In this example, the program did
a 4-byte read at address 0xBFFFF0E0, somewhere within
libpng.so.2.1.0.9, which was called from
somewhere else in the same library, called from line 326 of
qpngio.cpp, and so on.
Valgrind tries to establish what the illegal address might relate to, since that's often useful. So, if it points into a block of memory which has already been freed, you'll be informed of this, and also where the block was freed at. Likewise, if it should turn out to be just off the end of a malloced block, a common result of off-by-one-errors in array subscripting, you'll be informed of this fact, and also where the block was malloced.
In this example, Valgrind can't identify the address. Actually the
address is on the stack, but, for some reason, this is not a valid
stack address -- it is below the stack pointer, %esp, and
that isn't allowed. In this particular case it's probably caused by
gcc generating invalid code, a known bug in various flavours of gcc.
Note that Valgrind only tells you that your program is about to access memory at an illegal address. It can't stop the access from happening. So, if your program makes an access which normally would result in a segmentation fault, your program will still suffer the same fate -- but you will get a message immediately prior to this. In this particular example, reading junk on the stack is non-fatal, and the program stays alive.
Conditional jump or move depends on uninitialised value(s)
at 0x402DFA94: _IO_vfprintf (_itoa.h:49)
by 0x402E8476: _IO_printf (printf.c:36)
by 0x8048472: main (tests/manuel1.c:8)
by 0x402A6E5E: __libc_start_main (libc-start.c:129)
An uninitialised-value use error is reported when your program uses
a value which hasn't been initialised -- in other words, is undefined.
Here, the undefined value is used somewhere inside the
printf() machinery of the C library. This error was
reported when running the following small program:
int main()
{
int x;
printf ("x = %d\n", x);
}
It is important to understand that your program can copy around
junk (uninitialised) data to its heart's content. Valgrind observes
this and keeps track of the data, but does not complain. A complaint
is issued only when your program attempts to make use of uninitialised
data. In this example, x is uninitialised. Valgrind
observes the value being passed to _IO_printf and thence
to _IO_vfprintf, but makes no comment. However,
_IO_vfprintf has to examine the value of x
so it can turn it into the corresponding ASCII string, and it is at
this point that Valgrind complains.
The two main sources of uninitialised data are local variables which have not been initialised, and the contents of malloced blocks, before you write something into them.
Invalid free()
at 0x4004FFDF: free (vg_clientmalloc.c:577)
by 0x80484C7: main (tests/doublefree.c:10)
by 0x402A6E5E: __libc_start_main (libc-start.c:129)
by 0x80483B1: (within tests/doublefree)
Address 0x3807F7B4 is 0 bytes inside a block of size 177 free'd
at 0x4004FFDF: free (vg_clientmalloc.c:577)
by 0x80484C7: main (tests/doublefree.c:10)
by 0x402A6E5E: __libc_start_main (libc-start.c:129)
by 0x80483B1: (within tests/doublefree)
Valgrind keeps track of the blocks allocated by your program with
malloc/new et al, so it can know exactly
whether or not the argument to free/delete
is legitimate or not. Here, this test program has freed the same
block twice. As with the illegal address errors, Valgrind attempts
to make sense of the address freed. If, as here, the address is one
which has previously been freed, you will be told that -- making
duplicate frees of the same block easy to spot.
new[]
has wrongly been deallocated with free:
Mismatched free() / delete / delete []
at 0x40043249: free (vg_clientfuncs.c:171)
by 0x4102BB4E: QGArray::~QGArray(void) (tools/qgarray.cpp:149)
by 0x4C261C41: PptDoc::~PptDoc(void) (include/qmemarray.h:60)
by 0x4C261F0E: PptXml::~PptXml(void) (pptxml.cc:44)
Address 0x4BB292A8 is 0 bytes inside a block of size 64 alloc'd
at 0x4004318C: __builtin_vec_new (vg_clientfuncs.c:152)
by 0x4C21BC15: KLaola::readSBStream(int) const (klaola.cc:314)
by 0x4C21C155: KLaola::stream(KLaola::OLENode const *) (klaola.cc:416)
by 0x4C21788F: OLEFilter::convert(QCString const &) (olefilter.cc:272)
The following was told to me by the KDE 3 developers. I didn't know
any of it myself. They also implemented the check itself.
In C++ it's important to deallocate memory in a way compatible with how it was allocated. The deal is:
malloc, calloc,
realloc, valloc or
memalign, you must deallocate with free.
new[], you must deallocate with
delete[].
new, you must deallocate with
delete.
Pascal Massimino adds the following clarification:
delete[] must be associated with a
new[] because the compiler stores the size of the array
and the pointer-to-member to the destructor of the array's content
just before the pointer actually returned. This implies a
variable-sized overhead in what's returned by new or
new[]. It is rather surprising how compilers
[runtime-support libraries?] are robust to mismatch in
new/delete
new[]/delete[].
Here's an example of a system call with an invalid parameter:
#include <stdlib.h>
#include <unistd.h>
int main( void )
{
char* arr = malloc(10);
(void) write( 1 /* stdout */, arr, 10 );
return 0;
}
You get this complaint ...
Syscall param write(buf) contains uninitialised or unaddressable byte(s)
at 0x4035E072: __libc_write
by 0x402A6E5E: __libc_start_main (libc-start.c:129)
by 0x80483B1: (within tests/badwrite)
by <bogus frame pointer> ???
Address 0x3807E6D0 is 0 bytes inside a block of size 10 alloc'd
at 0x4004FEE6: malloc (ut_clientmalloc.c:539)
by 0x80484A0: main (tests/badwrite.c:6)
by 0x402A6E5E: __libc_start_main (libc-start.c:129)
by 0x80483B1: (within tests/badwrite)
... because the program has tried to write uninitialised junk from the malloc'd block to the standard output.
docs/techdocs.html in any Valgrind
source distribution.
valgrind.so.
The shell script valgrind sets the
LD_PRELOAD environment variable to point to
valgrind.so. This causes the .so to be
loaded as an extra library to any subsequently executed
dynamically-linked ELF binary, viz, the program you want to debug.
The dynamic linker allows each .so in the process
image to have an initialisation function which is run before
main(). It also allows each .so to have a
finalisation function run after main() exits.
When valgrind.so's initialisation function is called by
the dynamic linker, the synthetic CPU to starts up. The real CPU
remains locked in valgrind.so for the entire rest of the
program, but the synthetic CPU returns from the initialisation
function. Startup of the program now continues as usual -- the
dynamic linker calls all the other shared object initialisation
routines, and eventually runs main(). This all runs on
the synthetic CPU, not the real one, but the client program cannot
tell the difference.
Eventually the client program will terminate by calling the
exit system call -- there is no other way out.
Valgrind detects this, and uses it as its cue
to exit. It prints summaries of all errors detected, possibly checks
for memory leaks, and then calls exit itself.
On entry, Valgrind switches stacks, so it runs on its own stack. On exit, it switches back. This means that the client program continues to run on its own stack, so we can switch back and forth between running it on the simulated and real CPUs without difficulty. This was an important design decision, because it makes it easy (well, significantly less difficult) to debug the synthetic CPU.
The JITter translates basic blocks -- blocks of straight-line-code -- as single entities. To minimise the considerable difficulties of dealing with the x86 instruction set, x86 instructions are first translated to a RISC-like intermediate code, similar to sparc code, but with an infinite number of virtual integer registers. Initially each instruction is translated separately, and there is no attempt at instrumentation.
The intermediate code is improved, mostly so as to try and cache the simulated machine's registers in the real machine's registers over several simulated instructions. This is often very effective. Also, we try to remove redundant updates of the simulated machine's condition-code register.
The intermediate code is then instrumented, giving more intermediate code. There are a few extra intermediate-code operations to support instrumentation; it is all refreshingly simple. After instrumentation there is a cleanup pass to remove redundant value checks.
This gives instrumented intermediate code which mentions arbitrary numbers of virtual registers. A linear-scan register allocator is used to assign real registers and possibly generate spill code. All of this is still phrased in terms of the intermediate code.
Then, and only then, is the final x86 code emitted. The intermediate code is carefully designed so that x86 code can be generated from it without need for spare registers or other inconveniences.
The translations are managed using a traditional LRU-based caching scheme. The translation cache has a default size of about 28MB. Instrumented translations are 12-14 times larger than the originals.
The A and V bits for each byte are stored using a sparse array, which
flexibly and efficiently covers arbitrary parts of the 32-bit address
space without imposing significant space or performance overheads for
the parts of the address space never visited. The scheme used, and
speedup hacks, are described in detail at the top of the source file
vg_memory.c, so you should read that for the gory
details.
vg_syscall_mem.c for details.
sigaction and
sigprocmask are intercepted. If the client program is
trying to set a signal handler, Valgrind makes a note of the handler
address and which signal it is for. Valgrind then arranges for the
same signal to be delivered to its own handler.
When such a signal arrives, Valgrind's own handler catches it, and
notes the fact. At a convenient safe point in execution, Valgrind
builds a signal delivery frame on the client's stack and runs its
handler. If the handler longjmps, there is nothing more
to be said. If the handler returns, Valgrind notices this, zaps the
delivery frame, and carries on where it left off before delivering the
signal.
The purpose of this nonsense is that setting signal handlers amounts to giving callback addresses to the kernel. We can't allow this to happen, because if it did, signal handlers would run on the real CPU, not the simulated one. This means the checking machinery would not operate during the handler run, and, worse, memory permissions maps would not be updated, which could cause spurious error reports once the handler had returned.
An even worse thing would happen if the signal handler
longjmp'd rather than returned: Valgrind would completely
lose control of the client program.
Upshot: we can't allow the client to install signal handlers
directly. Instead, Valgrind must catch, on behalf of the client, any
signal the client asks to catch, and must deliver it to the client on
the simulated CPU, not the real one. This involves considerable
gruesome fakery; see vg_signals.c for details.
libpthread.so, uses the
clone system call to generate new kernel-level threads.
One possibility would be to allow Valgrind to persist across calls to
clone, and insert the required locking/mutual exclusion
for the memory permissions maps and other key data structures.
However, this was considered difficult to get right, and the additional locking requirements would have added significant overhead to what is already a slow system.
Instead, Valgrind supplies its own user-space PThreads
implementation. This intercepts calls to the POSIX
pthread_* functions, which are handled by a combination
of simulated code communicating with supporting primitives and a
scheduler inside Valgrind itself. The interception is done by
supplying a fake libpthread.so and causing programs to
link it instead of the system-supplied one. The fake library passes
requests to Valgrind's core using a special trapdoor mechanism
designed to allow client code to communicate with the underlying
Valgrind core.
As ever, this trickery is invisible to both the end-user and the program being debugged. It has been a significant amount of work, but now works well enough to run large threaded apps (eg Mozilla, OpenOffice/StarOffice), and gives complete control of thread scheduling to Valgrind, which may be useful for future error-checking extensions designed to automatically find data races.
When in cache profiling mode, the normal error-checking
instrumentation is not done. Instead, instrumentation code is added
to call the cache simulator, which lives in
vg_cachesim_D1.c, vg_cachesim_I1.c and
vg_cachesim_L2.c. These track enough of the state
of the caches being simulated to determine whether or not each memory
reference will cause a miss in the D1, I1 or L2 cache. If so, a miss
count for that reference is incremented.
The counters for each basic block are stored along with the translations themselves. This means there is essentially zero cost in finding the counter(s) associated with a given memory reference point, which speeds the simulation. When old translations are ejected from the translation cache, their counter arrays are retained. Should those translations later be re-made, they are reunited with the counter arrays they had previously.
At the end of the run the counters are dumped in a text
file, cachegrind.out. The format of it has been
designed to keep its size manageable. Even so, running a web
browser for a couple of minutes generates a file on the order
of several megabytes long.
The vg_annotate Perl script reads this file and
generates summaries of the memory access and miss statistics, for the
program as a whole and for each function. It will also print source
code with counts annotated on each line, so that code generating cache
misses can be identified exactly. The script has various flags which
give considerable flexibility about which source fragments are
displayed and which not.
The default simulated cache arrangement is for an AMD Athlon. The
supplied vg_cachegen Perl script will generate
vg_cachesim_D1.c, vg_cachesim_I1.c and
vg_cachesim_L2.c suited to a cache arrangement of your
specification, with a range of cache sizes, associativities and line
sizes which should cover most arrangements in use today.
This is a little cumbersome. One possibility is for Valgrind to
make enquiries with the x86 CPUID instruction at startup,
so the correct cache parameters are automatically obtained.
Feedback from users has ranged from lukewarm to very positive. Many people report that Valgrind was helpful in quickly finding obscure bugs which would otherwise have taken longer to find. Many people suffered the bugs, trials and tribulations of the frequent development snapshots made available, and the feedback, bug reports and patches from them rapidly improved the usability and quality of valgrind. If you are one of those people, I thank you for your persistence.
It is difficult to directly assess whether valgrind helps programmers create code which is more reliable and stable than it otherwise would be. Valgrind's availability in Feb/March 2002 coincided with the KDE developers stabilising KDE 3.0 for its release in early April. My impression -- and it is only an impression -- is that KDE 3 was extensively Valgrinded. Certainly KDE 3 elicits far fewer complaints from Valgrind than KDE 2 does. Some KDE 3 applications, most notably KMail 1.4.1 in KDE 3.0.1, run almost with no complaints from Valgrind, which suggests that they, and the underlying KDE support libraries, are largely free of the memory management errors that Valgrind detects. To what extent Valgrind contributed to this is unclear to me.
Dirk Mueller, KDE 3.0 release coordinator, comments:
[regarding KDE 3 being Valgrinded prior to release] That's certainly true. While we always tried to verify the code quality before releases, we lacked the means to do this effectively, even with the time-limited licenses that were donated for available commercial tools, because those required instrumenting the code, which was often difficult in practice. You had to use the right combination of libs, the right binutils version, the linux kernel it was written for, and still had to fiddle a lot to find the bugs in our code.Valgrind made all this very easy - no instrumenting required. Runs with any recent distribution. Works reliably. Runs fast and is easy to use.
Various developers spent weeks valgrinding almost all applications of KDE 3 release. Numerous smaller leaks and uninitialized variable reads were found quickly with valgrind.
Valgrind was very helpful in spotting memory leaks in even the XFree86 client code.
[...] we were able to spot bugs in KDE or Qt easily that would maybe never be found otherwise, because they only cause real problems (crashes or malfunctions) maybe once in 100 cases.
Dealing with undefined-value errors is harder. Typically Valgrind complains that a conditional jump depends on uninitialised value(s). Valgrind's definedness-tracking mechanism allows undefined values to propagate through arbitrary computations but only complains when they are used to determine the outcome of a conditional jump, or when a memory address depends on them.
The resulting effect is that the point where the complaint occurred may be far from where the undefined value originated. One has to work backwards through the chain of computations leading up to the error point. This can be difficult, since it may lead through arbitrary procedure calls, moving values around in memory, etc.
A good general debugging strategy seems to be to try and make the
target program or library "Valgrind-clean", that is, run without any
errors from Valgrind. This has the same rationale as making code
compile silently with -Wall, for example: it is worth
getting rid of all errors, even apparently harmless ones, since that
makes it a lot easier to spot new ones when they appear. In practice
there are some difficulties with this:
How reliable is Valgrind? One factor critical to acceptance is that it has a very low rate of false alarms, since if programmers don't believe what it says, they won't use it.
Valgrind is almost always correct (I would say, simply, "always correct", but I know better than that) about invalid addresses. That's because there is no ambiguity about whether or not an address is valid -- establishing this is usually straightforward.
Whether or not a value is undefined, and whether this has any effect on the client program, is more difficult to tell. Valgrind's value tracking system was designed carefully to minimise the chances of false error reports. Two prototype schemes were tried and discarded prior to the current one. Valgrind tracks definedness at the bit level, and even simulates the definedness of the CPU's condition codes, in an attempt to avoid making mistakes.
I think it's fair to say that valgrind is almost always correct when it complains about uninitialised value uses. It is certainly far more likely to be correct than not correct, experience shows. There are known cases where obscure code sequences can trigger a false report, but these are extremely rare. Only one was observed, as far as I know, whilst KDE 3 was being valgrindified.
Speed: instrumented programs run 25-50 times slower than natively. In practice this is just about fast enough to be useful. Interactive programs like text editors and web browsers run at usable, if somewhat sluggish, speeds on typical development machines (PIIIs/Athlons in the 1 GHz region). Slowdown is close to 25 x for pure compute bound code, but increases substantially in code which uses malloc/free intensively, since that necessitates a lot of messing with memory permissions. Execution also slows considerably when a lot of translation is happening -- the translator is not fast.
The current emphasis is to freeze functionality and stabilise the code base, to create a reliable candidate for release as valgrind-1.0 later this year.
Some vague ideas for future functionality are:
Finally it would be nice to make Valgrind faster. Some preliminary investigations have already been made. Three avenues look interesting:
A standard solution to this is translation chaining: jumping directly from one translation to another, with no intervening address lookup or indirect jump. Translation chaining improves performance but brings new complications in the storage and management of translations, since translations can no longer be handled in isolation from each other. These issues are discussed in detail in an excellent paper: "Shade: A Fast Instruction-Set Simulator for Execution Profiling" (Bob Cmelik, David Keppel), ACM SIGMETRICS, 1994.
The analysis and optimisation phases could be more effective if groups of basic blocks, and particularly loops, were translated together. This might tie in with translation chaining.
One way the tests can be made short enough to do in-line in the generated code is as follows: the two-level array is used as backing store for a simulated direct-mapped cache. This cache is operated in the usual manner for a write-back, direct mapped cache, but with a subtlety: a line may be present in the cache only if the corresponding address range is completely valid (addressable). So the generated code does a tag check, which fails if the line is not present. A failure (a miss) could be due to the usual reasons (capacity, conflict or compulsory), or it could happen because there is an addressability error.
In addition, the bottom two bits of the address are incorporated into the tags in such a way as to cause tag comparisons to fail if either of the bits are nonzero. The effect is that misaligned accesses also cause tag check failures.
The result is that a quick tag lookup is followed by a single test, which detects misaligned addresses, invalid addresses and genuine cache misses. In the miss (slow) case, a helper function establishes the reason for the miss and acts accordingly. In the hit (fast) case the relevant definedness bits can simply be read from the cache. For 4-byte loads this reduces the expected-case path length for a memory check and load from about 20 x86 instructions to about 7, a marked improvement.
A trial implementation showed about a 10% performance improvement for a memory-intensive program (bzip2). Not all programs show improvements. It proved unexpectedly difficult to reduce the miss rate to an acceptable level, given that misses are quite expensive. Nevertheless this avenue seems promising, and warrants further investigation.