You might have heard about the critical GnuTLS bug that was recently fixed recently. What's the deal with it? Why is it a big deal? What happened?

Here's the bug, in essence, in lib/x509/verify.c:

/* Checks if the issuer of a certificate is a
 * Certificate Authority, or if the certificate is the same
 * as the issuer (and therefore it doesn't need to be a CA).
 *
 * Returns true or false, if the issuer is a CA,
 * or not.
 */
static int
check_if_ca (gnutls_x509_crt_t cert, gnutls_x509_crt_t issuer,
            unsigned int flags)
{
    int result;
    result =
        _gnutls_x509_get_signed_data (issuer->cert, "tbsCertificate",
                                    &issuer_signed_data);
    if (result < 0)
        {
        gnutls_assert ();
        goto cleanup;
        }

    result =
        _gnutls_x509_get_signed_data (cert->cert, "tbsCertificate",
                                    &cert_signed_data);
    if (result < 0)
        {
        gnutls_assert ();
        goto cleanup;
        }

    // snip

    result = 0;

cleanup:
    // cleanup type stuff
    return result;
}

Can you spot the bug?

Here's the (abridged) fixed version:

    int result;
    result =
        _gnutls_x509_get_signed_data (issuer->cert, "tbsCertificate",
                                    &issuer_signed_data);
    if (result < 0)
        {
        gnutls_assert ();
        goto fail; // CHANGED
        }

    // snip

fail:  // ADDED
    result = 0;

cleanup:
    // cleanup type stuff
    return result;
}

The bug was a disagreement between return value meanings. The function check_if_ca returns "true" or rather 1, when the certificate is a CA, and zero otherwise. However the other functions used return negative when they fail. In C, any integer value other than zero is regarded as a true value. So if the certificate is invalid, it's actually marked as a CA certificate.

So what's the implication of this? This function is used by gnutls_x509_crt_verify, which verifies x509 certificates. Invalid certificates can be passed off as genuine, even though they're invalid.

Based on this and the previous Apple bug, I don't think we've seen the last serious TLS stack bug. Testing TLS is notoriously difficult, as it's pretty complicated. Of course, GnuTLS doesn't have a great track record for correctness in general, so this specifically isn't that surprising.

Return values in C

The bug is the disagreement about return values and true and false. In C, the situation about what to return for success verses failure is sort of complicated.

Let's take socket(2) and connect(2) as examples. To get an IPv4 socket in C, you need to do this:

#include <sys/types.h>
#include <sys/socket.h>

// in some function
int s = socket(AF_INET, SOCK_STREAM, 0);

How do you know if this socket is valid? That is, you successfully got a valid socket that you can call connect with?

You test s. The man page says that on error, -1 is returned. So a common way to test this would be:

if (s < 0) {
    perror("Couldn't get socket");
    exit(EXIT_FAILURE); // or whatever you want to do
}

Easy enough. Let's connect with our socket now:

struct sockaddr_in addr; // set this up somehow
if (connect(s, &addr, sizeof(addr)) {
    perror("Couldn't connect"); 
    exit(EXIT_FAILURE);
}

Because connect returns zero on success, the error check for connect looks backwards. If we've connected, connect will return 0, which evaluates to false. So to avoid this confusion, most C programmers would add an explicit check for less than zero.

Due to this historical standard, you have essentially two options:

Follow the C tradition and return zero for success and non-zero (or less than zero, it depends) for failure.
Return explicit error codes that should be checked.

GnuTLS used a third option, which is the opposite of the first one, returning 1 for success and 0 for failure, and then mixed that with code that used the C traditional method.

If you think this is all just nonsense and that you should use exceptions instead, it's not really clear that that's better in all cases. Martin Sústrik, the author of ZeroMQ, wishes he wrote ZeroMQ in C rather than C++ with exceptions.

git blame

Let's find out how this bug got in here in the first place.

$ git clone https://git.gitorious.org/gnutls/gnutls.git
$ git checkout 6aa26f78150ccbdf^

Now we're at the commit before the bug fix. Let's run git blame and see who edited what when:

$ git blame lib/x509/verify.c

The problem lines around 141-145 were edited by Simon Josefsson, one of the two maintainers of GnuTLS, in 2005. Wow this bug is old!

If you keep going, to a5891d7^, then to 802e1ed^ you'll finally arrive at 0fba2d9^ the first version without the bug.

So it was 0fba2d9 that caused the issue. Why?

This commit was a large refactor of several parts of the certificate code. Many new functions were written which follow the old C-style error handling (return less than zero for failure, zero for success), such as _gnutls_x509_get_signature.

In the same commit, Nikos refactors check_if_ca and it looks remarkably similar to the traditional C-style error handling the other methods he was adding and refactoring. It looks like he just forgot that this wasn't a C-style error handling method at all, but a true-means-true one.

Lessons

What can we learn from this?

If you're writing in C, you need to pick one of the two C error handling model and stick with it. Aggressively refactor any code that doesn't match your error handling model choice.

Use smaller commits. It's more likely that Nikos or someone reviewing his commit would have seen the error in his diff if it was smaller. He might even have not made the mistake in the first place, as the refactor would have been different for check_if_ca.

Perhaps most important of all, test your crypto code. It's the biggest win for quality. The public facing method, gnutls_x509_crt_verify actually still doesn't have any unit tests for it. This should change.

C can be hard to get right, so it's important to be strict in how you write it and it's important to test it well. We haven't seen the last critical TLS stack bug, so don't be surprised when the next one comes. Maybe we should pool our money and pay some security auditors to audit common TLS implementations that we all use daily.