At my current job, we needed to move an aging web service into our job management system for reliability reasons.

After implementing the code and merging it into our master branch, the unit tests were failing for some people but not others. Particularly, if you wiggled your mouse during these non-interactive backend tests, the unit tests would pass!

The Tests

I modified our extensive unit test suite to add a new workflow using the third-party JARs that the old web service used, and it was working great. I was working in a small branch because my coworkers were working on a bigger feature in master. When they checked the tests and did a code review, they pulled my change into master.

And then the tests broke.

Alright, we thought, there is some sort of test conflict. But the tests always worked on my machine. And worked most of the time on my coworkers' machines. And never worked on Jenkins. But always worked when actually deployed to our dev environment.

We tried loading our machines to see if there was a CPU or disk load issue, but it wasn't. Running tests individually would work, but not when run all together with Maven.

After extensive analysis and debugging, my coworker found that if you were actively using your computer while Maven was running, the test would work. Otherwise it would time out.

How is that possible?

A short detour to explain our workflow system

To understand how this bug will break the tests, I figured I would explain how our workflow system works at a very high level. We have tasks that do something like so:

public class DownloadSomething extends Task {
    @Override
    public Result do(TaskInfo info) {
        Downloader downloader = new Downloader(info.getInput());

        while (downloader.notDone()) {
            Chunk chunk = downloader.getNextChunk();
            appendToFile(chunk);

            heartbeat();
        }
        return success("Downloaded");
    }
}

A workflow is made up of tasks like this one, which have input and produce output, such as a file. However, each task might fail either because the download URL is incorrect, or the network connection is down.

If your implementation of do() was merely

while(true) {
    Thread.sleep(1);
}

we would interrupt the task and the test would fail.

So we have logic that will time out a task if it does not heartbeat early and often. We have special InputStream classes which wrap other InputStreams and heartbeat as the consumer is reading them.

This is a relatively common pattern.

What was making this task timeout?

This particular task read some X.509 certificates from disk and then called the proprietary JARs with their proprietary logic. Reading the certificates was taking minutes. Literally five or more minutes on occasion. Why?

Our special heartbeating InputStream wasn't even helping, so the code that was allegedly reading certificates was doing something else entirely.

Something that only took awhile when you weren't using your computer.

The Cause

If you wanted some secure random numbers, for encryption purposes, how would you do that? In Java, the way is SecureRandom. How does SecureRandom work?

If you look at the source code for SecureRandom, the default constructor calls getDefaultPRNG, which gets a provider, such as sun.security.provider.SecureRandom to provide actual secure random numbers.

To get the random data, the user calls SecureRandom.nextBytes:

synchronized public void nextBytes(byte[] bytes) {
    secureRandomSpi.engineNextBytes(bytes);
}

which calls the specific implementation random number generation, such as in sun.security.provider.SecureRandom. If there was no seed specified (as there often isn't), the implementation will call a seed generator class such as sun.security.provider.SeedGenerator.

SeedGenerator has a few implementations, one of which is an interesting threading randomness generator. The other is URLSeedGenerator, which uses /dev/random

final static String URL_DEV_RANDOM = SunEntries.URL_DEV_RANDOM; // /dev/unrandom
// snip
static class URLSeedGenerator extends SeedGenerator {
    URLSeedGenerator() throws IOException {
        this(SeedGenerator.URL_DEV_RANDOM);
    }

    @Override
    void getSeedBytes(byte[] result) {
        int len = result.length;
        int read = 0;
        try {
            while (read < len) {
                int count = devRandom.read(result, read, len - read);
                // /dev/random blocks - should never have EOF
                if (count < 0)
                    throw new InternalError("URLSeedGenerator " + deviceName +
                                    " reached end of file");
                read += count;
            }
        }
        // snip
    }

So it reads from /dev/random where appropriate. What's an important difference between /dev/random and its counterpart /dev/urandom? Well, /dev/random can and will block on Linux when there isn't enough randomness to go around.

To test this theory (that the certificate loading code, was for whatever reason, consuming a lot of entropy) we did this:

# mv /dev/random /dev/random.bkup
# ln -s /dev/urandom /dev/random

And the problem went away! Tests were completely consistent now.

Why did wiggling the mouse fix the tests?

Linux uses multiple sources to generate entropy for /dev/random. On the Jenkins build server and when we weren't using our computers, /dev/random would run out of entropy quickly and blocked.

Using the computer (which I almost always do while the build is running) kept it working, which is why I didn't notice any failures.

Since SecureRandom is a pseudo-random number generator that uses a cryptographically secure seed, it should use /dev/urandom instead of /dev/random, in my opinion.

Why was loading an X.509 certificate using random numbers?

Ah, the magic of mystery third-party libraries. Using the awesome Java decompiler JD, I decompiled the JARs we were given and poked around a little. In the load certificate method, I found this:

public static X509Certificate loadCert(InputStream paramInputStream) throws CertificateException
{
    CertificateFactory localCertificateFactory = CertificateFactory.getInstance("X.509", ProviderInit.getProvider());
    X509Certificate localX509Certificate = (X509Certificate)localCertificateFactory.generateCertificate(paramInputStream);
    // snip

Isn't Java wonderfully concise? But wait, what's that ProviderInit buisness? More JD magic gives us:

public class ProviderInit {
    public static Provider getProvider()
    {
        if (!initDone)
            init();
        return provider;
    }

    private static void init()
    {
        if (initDone)
            return;

        if (CryptoJ.isFIPS140Compliant()) {
            CryptoJ.setMode(0);

            if (!CryptoJ.selfTestPassed()) {
                throw new RuntimeException("Crypto-J is disabled");
            }
        }
        // snip
    }
}

So, when ProviderInit is first run, it does some self tests using CryptoJ, which is from an ancient Java crypto library made by RSA called JSAFE. These CryptoJ self tests use SecureRandom judiciously, and thus take a long time if there is little entropy.

The Fix

All we need to do to fix our tests is to call ProviderInit.init() when the JVM loads, rather than when our time sensitive code is being run. Easy enough!

The lesson: use /dev/urandom when appropriate, and don't run self tests in non-debug code.