sean cassidy : A Difficult Bug

in: programming

I recently had a tough bug to deal with, and I think it makes for a good story.

Background

As part of my work on DiNet, I needed to write a Radix trie for reasons that should be obvious if you understand how important prefixes are to DiNet and its HTTP API.

I wrote a first draft of a radix trie and it was awful. It was slow, difficult to reason about, and almost certainly buggy in surprising ways. But as this was a pet project and I was having fun, I just soldered on.

Until my program started crashing, that is.

After about a day of running, processing only little bit of data - one packet every ten seconds - the server would crash. The stacktrace was always in the radix trie code.

This buggy, poorly written code had to go. I rewrote the radix trie, added tests, and ran it through Valgrind. The new version was fast and correct. It had no memory leaks, invalid reads or writes, or other memory issues.

However, when running the same test, it segfaulted in the same place! With a similar stacktrace! How could this be possible?

Debugging with GDB

The problem was that after a while, nodes went missing. A parent node kept track of its children with a list of pointers, which was apparently being overwritten by something.

Further, while I stepped through the code, nothing was wrong! The logic was there, and actually relatively well tested. Static analysis was also failing me.

One thing jumped out at me though as a glaring red flag: if I tracked which node had the issue, it was always at the same address. Curious, I set a hardware watchpoint for writes on that address. For those who are unfamiliar, a watchpoint will stop program execution whenever the address is accessed, and can be configured to stop for read or write access.

So, my program would break on that address. Since this address was a pointer to a node in the radix trie, I expected that only the radix trie code would access it. I was wrong.

Something was calling free on that very address! But what? It looked like it was a completely different section of code, the LRU cache:

int add_to_cache(struct lru_cache *cache, void *data, void *cache_key, int key_len, 
        void **deleted_ptr, bool free_deleted) {
    struct hash_struct *tmp, *tmp_entry;
    int deleted = 0;

    // Find any entry in the LRU with key cache_key of length
    // key_len, if any, and store it in tmp 
    HASH_FIND(hh, cache->cache, cache_key, key_len, tmp);

    if (tmp == NULL) {
        // If there was no entry, see if it will fit
        if (HASH_COUNT(cache->cache) >= cache->max_items) {
            // Cache is full, so remove the eldest node
            HASH_ITER(hh, cache->cache, tmp, tmp_entry) {
                // Delete the uthash data
                HASH_DELETE(hh, cache->cache, tmp);

                // Set the pointer we will delete here so we can do
                // other stuff with it later
                if (deleted_ptr != NULL) {
                    *deleted_ptr = tmp->data;
                    deleted = 1;
                }

                // If we're freeing the data for the caller, do so now
                if (free_deleted)
                    free(tmp->data);

                free(tmp);
                break;
            }
        }

        // Insert the new node into the LRU
        struct hash_struct *hs = dmalloc(sizeof(struct hash_struct));
        memcpy(hs->key, cache_key, key_len);
        HASH_ADD(hh, cache->cache, key, key_len, hs);
    }
    // if we deleted anything, return non-zero
    return deleted;
}

I added some extra comments so that it's easier to read for someone unfamiliar to the code. But let's take it step by step.

int add_to_cache(struct lru_cache *cache, void *data, void *cache_key, int key_len, 
        void **deleted_ptr, bool free_deleted) {
    struct hash_struct *tmp, *tmp_entry;
    int deleted = 0;

    // Find any entry in the LRU with key cache_key of length
    // key_len, if any, and store it in tmp 
    HASH_FIND(hh, cache->cache, cache_key, key_len, tmp);

So, the add_to_cache method does is it takes an lru_cache, some data, the lookup key and how long that is is and places it in the LRU if it's not already there. The other two parameters are optional features: when it removes something from the cache, it can place the data pointer that was there in deleted_ptr, and to optionally free that pointer if free_deleted was true.

I used the excellent uthash library to implement a hash table that was size limited and would remove the eldest member, which is always a struct hash_struct. The HASH_FIND call here will return the value stored at cache_key, if any and NULL otherwise.

    if (tmp == NULL) {
        // If there was no entry, see if it will fit
        if (HASH_COUNT(cache->cache) >= cache->max_items) {
            // Cache is full, so remove the eldest node
            HASH_ITER(hh, cache->cache, tmp, tmp_entry) {
                // Delete the uthash data
                HASH_DELETE(hh, cache->cache, tmp);

If there was no key at this location, we need to put it into the LRU. So we count how many items are currently in the cache and compare that against the precomputed max. If we're full, iterate through the cache with HASH_ITER, which iterates them in eldest-first order. Then, remove it from the cache with HASH_DELETE.

                // Set the pointer we will delete here so we can do
                // other stuff with it later
                if (deleted_ptr != NULL) {
                    *deleted_ptr = tmp->data;
                    deleted = 1;
                }

                // If we're freeing the data for the caller, do so now
                if (free_deleted)
                    free(tmp->data);

                free(tmp);
                break;
            }
        }

Using HASH_DELETE removes it from the cache, but it doesn't handle the rest of the memory allocation. So, if the user of add_to_cache asked to be told which data was deleted, store it. If we were told to free the data, free it here. Then free the metadata associated with the cache and break from the loop. We only ever want to remove one node, so we don't need to loop.

And to finish up the function, we add the data to the cache.

        // Insert the new node into the LRU
        struct hash_struct *hs = dmalloc(sizeof(struct hash_struct));
        memcpy(hs->key, cache_key, key_len);
        HASH_ADD(hh, cache->cache, key, key_len, hs);
    }
    // if we deleted anything, return non-zero
    return deleted;
}

But why was the LRU code trying to delete a radix trie node? Something it should never do, as the data pointer in the code didn't have anything to do with radix trie nodes.

Did you spot the bug? Try and find it before continuing on to the answer.

The Fix

To solve this bug, I went back to where it was crashing: in the LRU, on a free of a data pointer supplied by the consumer of the LRU and radix trie.

// If we're freeing the data for the caller, do so now
if (free_deleted)
    free(tmp->data);

And how was this data pointer set? It was passed into the add_to_cache method as we saw. And then completely forgotten about. We were freeing uninitialized memory. This memory just so happened to always point to a radix trie node.

So, the fix was straightforward. Just keep track of the data pointer. Here is the diff:

     // Insert the new node into the LRU
     struct hash_struct *hs = malloc(sizeof(struct hash_struct));
     memcpy(hs->key, cache_key, key_len);
+    hs->data = data;
     HASH_ADD(hh, cache->cache, key, key_len, hs);

Such a simple mistake.

Preventing this bug

This bug had multiple causes, and debugging it would have been easier if I had used best practices.

Some of the causes were:

  1. Making the code do too much (why does add_to_cache do so much for the user, such as freeing memory?)
  2. Weak separation of concern (whose job is it to free the memory?)
  3. Poorly documented code contracts
  4. Not enough unit testing

To make the debugging faster, two things should have been done: more, better unit tests, and using Valgrind (specifically track-origins) earlier. I had used Valgrind to verify my radix trie, but then I was convinced I could find it without it. Valgrind would have pointed me to the invalid free much more quickly than static analysis or GDB did.

It's important to try and understand why the bugs we had manifested themselves. Understanding how they occurred is an important step in preventing them. Especially when using a language that lets you get away with a lot, such as C in this case, it's paramount to have good testing and to use all of your debugging tools.

Update: Other practices that were suggested to me that would have helped to avoid and debug this scenario:

Setting a pointer to NULL after using it like so:

#define dfree(ptr) do { free(ptr); ptr = NULL; } while(0)

This may have made the problem easier to debug.

Another idea that would have made it obvious that I was freeing uninitialized memory would have been to use calloc instead of malloc. I used calloc elsewhere and I'm not sure why I decided to use malloc in this case.

Sean is the Head of Security at Asana, a work management platform for teams.

Follow @sean_a_cassidy