sean cassidy : Sherlock Holmes Debugging

How do you debug?

Most people, if you describe a train of events to them, will tell you what the result would be. They can put those events together in their minds, and argue from them that something will come to pass. There are few people, however, who, if you told them a result, would be able to evolve from their own inner consciousness what the steps were which led up to that result. This power is what I mean when I talk of reasoning backwards, or analytically. Sherlock Holmes A Study in Scarlet, by Sir Arthur Conan Doyle

I use the Sherlock Holmes method of debugging software. This means working backwards from a symptom to the series of causes which tell a story of the entire bug. It works for production outages and during development. Something that recently happened at my startup should be instructive.

There is nothing more deceptive than an obvious fact.

A few months ago, my coworker was using Mandrill to send email from one of our AWS EC2 boxes. We had been using Mandrill for a few months for most of our email sending needs. However, he was having trouble getting his code to send the email. It would fail without reason when it had been working the week before.

First, we examined the Mandrill API client that we were using. It was somewhat confusingly written and sparsely tested, so we assumed the bug was in this library. We spent an hour or two going through it, and we found a major problem: it was swallowing exceptions and making it look like it succeeded.

We needed more information. We needed to remove variables from the equation and gather data. "Try cURL directly," I said. So we set up a cURL command:

$ curl -H 'application/json' -X POST -v -d '{
    "key": "ASJFSDKfkekjjsdkfiEJKFJ",
    "message": {
        "text":"test",
        "subject":"test",
        "to":[{
            "email":"my@email.story",
        }]
    }
}' https://mandrillapp.com/api/1.0/messages/send.json

And it worked! No problem at all. The email was sitting in my inbox, and we were confused.

"So the problem must lay in my software," my coworker reasoned. We had spent all morning looking at his code, and we had found no problems. I didn’t want to do that again.

It is a capital mistake to theorize before you have all the evidence. It biases the judgment.

We needed still more data. "Try sending your exact email payload to Mandrill," I suggested.

There were many differences between the simple test email we tried and the email he needed to send. His email was more complicated with tables, headers, a few people in the "to" and "cc" lists. It was probably ten times larger. So we copied it exactly:

$ curl -H 'application/json' -X POST -v -d '{
    "key": "ASJFSDKfkekjjsdkfiEJKFJ",
    "message": {
        // SNIP, large message here
    }
}' https://mandrillapp.com/api/1.0/messages/send.json

And it didn’t work. No email.

How is this possible? A Mandrill bug? Or maybe we have our message format wrong. Maybe we’re using cURL wrong. So many possible theories, but we still didn’t have enough data. We booted up my favorite tool, tcpdump. First, we checked the small message that worked:

23:26:12 IP ip-172.us-west-2.compute.internal.49493 > ec2-54.us-west-2.compute.amazonaws.com.https: Flags [S], length 0  // SYN
23:26:12 IP ec2-54.us-west-2.compute.amazonaws.com.https > ip-172.us-west-2.compute.internal.49493: Flags [S.], length 0 // SYN+ACK
23:26:12 IP ip-172.us-west-2.compute.internal.49493 > ec2-54.us-west-2.compute.amazonaws.com.https: Flags [.], length 0  // ACK
23:26:12 IP ip-172.us-west-2.compute.internal.49493 > ec2-54.us-west-2.compute.amazonaws.com.https: Flags [P.], length 212 // Data
23:26:12 IP ec2-54.us-west-2.compute.amazonaws.com.https > ip-172.us-west-2.compute.internal.49493: Flags [.], length 0  // ACK Data
23:26:12 IP ec2-54.us-west-2.compute.amazonaws.com.https > ip-172.us-west-2.compute.internal.49493: Flags [P.], length 390 // Data2
//..
23:27:12 IP ip-172.us-west-2.compute.internal.49493 > ec2-54.us-west-2.compute.amazonaws.com.https: Flags [FP.], length 37 // FIN
23:27:12 IP ec2-54.us-west-2.compute.amazonaws.com.https > ip-172.us-west-2.compute.internal.49493: Flags [R], length 0  // Connection closed

Everything looks fine. Connection opened and acknowledged. Data is sent after the TLS setup and then acknowledged just like you’d expect.

We then sent the larger message to Mandrill and observed its results:

23:32:05 IP ip-172.us-west-2.compute.internal.58542 > ec2-54.us-west-2.compute.amazonaws.com.https: Flags [S], length 0  // SYN
23:32:05 IP ec2-54.us-west-2.compute.amazonaws.com.https > ip-172.us-west-2.compute.internal.58542: Flags [S.], length 0 // SYN+ACK
23:32:05 IP ip-172.us-west-2.compute.internal.58542 > ec2-54.us-west-2.compute.amazonaws.com.https: Flags [.], length 0  // ACK
23:32:06 IP ip-172.us-west-2.compute.internal.58542 > ec2-54.us-west-2.compute.amazonaws.com.https: Flags [P.], length 212 // Data
23:32:06 IP ec2-54.us-west-2.compute.amazonaws.com.https > ip-172.us-west-2.compute.internal.58542: Flags [.], length 0  // ACK Data
23:32:06 IP ec2-54.us-west-2.compute.amazonaws.com.https > ip-172.us-west-2.compute.internal.58542: Flags [P.], length 390 // Data2
//..
23:32:06 IP ip-172.us-west-2.compute.internal.58542 > ec2-54.us-west-2.compute.amazonaws.com.https: Flags [P.], length 1707 // Data3
23:32:06 IP ip-172.us-west-2.compute.internal.58542 > ec2-54.us-west-2.compute.amazonaws.com.https: Flags [P.], length 1707 // Data3 retransmit
23:32:07 IP ip-172.us-west-2.compute.internal.58542 > ec2-54.us-west-2.compute.amazonaws.com.https: Flags [P.], length 1707 // Data3 retransmit
23:32:07 IP ip-172.us-west-2.compute.internal.58542 > ec2-54.us-west-2.compute.amazonaws.com.https: Flags [P.], length 1707 // Data3 retransmit

So, the packets were just getting lost. To recap the facts:

  1. It wasn’t our software, we were testing with cURL
  2. Small messages sent to Mandrill always worked
  3. Large messages never worked
  4. Large messages had worked the previous week

We then came up with a few theories which fit the facts:

  1. Mandrill’s API was overloaded and dropping data
  2. There was a bad route between our AWS and Mandrill, dropping larger packets
  3. There was something wrong in our VPC’s layer 2 that was dropping jumbo frames

The first theory was plausible, but it didn’t match any patterns of API problems I had ever experienced. Why would the size of the packets determine which were dropped on an overloaded REST service? That theory was unfulfilling.

I was a fan of the last theory, as it seemed to fit all of the facts, but we needed to verify that it was just our VPC. I logged into one of our Linode machines and tried to send the larger email command. It worked! The message was there, sitting in his inbox.

Now we thought we could solve our issue. If we set our MTU to be below the jumbo Ethernet frame boundary, 1500 bytes, it should work. First, we double checked our settings:

$ ifconfig eth0
    eth0    Link encap:Ethernet  HWaddr 0a:02:7b:a2:78:AF
            inet addr:172.32.2.151  Bcast:172.32.2.255  Mask:255.255.240.0
            inet6 addr: fe80::832:780f:fba2:7005/64 Scope:Link
            UP BROADCAST RUNNING MULTICAST  MTU:9001  Metric:1
            RX packets:220362102 errors:0 dropped:0 overruns:0 frame:0
            TX packets:34530020 errors:0 dropped:0 overruns:0 carrier:0
            collisions:0 txqueuelen:1000
            RX bytes:315440534468 (315.4 GB)  TX bytes:3097949824 (3.0 GB)

The MTU was 9001, typical for gigabit hosts. We reduced it to 1500:

$ ifconfig eth0 mtu 1500

And then sending Mandrill a larger message worked. Linux would then automatically fragment our packets into 1500 byte chunks and then the bad router in our VPC would pass the packets just fine. We double checked with our other VPC and found that it didn’t experience this jumbo frame issue. This meant that it was almost certainly AWS’s problem.

We summarized the issue to AWS with good examples they could replicate and sent a support ticket. They responded:

A regression in our instance firewall meant that Path-MTU discovery messages were not correctly relayed back to the instance, meaning that affected instances (in a single AZ) communicating with >1500MTU instances/hosts across the EC2 border (including when you contact region-local instances on their public IP) would have connectivity issues any time the affected instance's OS attempted to send packets larger than 1500 bytes. This regression has been rolled back, and we have updated our testing procedures to ensure this regression does not recur.

Validation! We got a service credit for our trouble.

Jeremy Brett as Sherlock Holmes

There is no sin in software engineering more serious than thinking some behavior of a computer system is magical or beyond our understanding. It may be difficult to understand given the time or resource constraints we face, but given enough time and persistence, all phenomena are reducible to logic. We are engineers and it’s our job to understand the issues we face. Too often we abstract away the details and either assume its someone else’s fault or just take it for granted. Knowing enough about your entire software stack (including networking and operating systems) is invaluable even if you work in high-level programming languages.

Good engineers debug like Sherlock Holmes investigates: we gather facts, generate theories from those facts, and then test them. Coincidence is to be met with the fiercest skepticism. You changed the production database yesterday and now customers are reporting data loss issues? If your knee jerk reaction is to blame the customers you need to reevaluate how you approach engineering. That should be the absolute last theory you consider and it should be backed by mountains of evidence.

The Holmes method of debugging is superior, I think, to the scientific method of debugging because debugging isn’t just a science. There’s an art to knowing where to look and what data is needed. This comes from experience and is as much intuition as it is logic. Practice debugging and you will be a better debugger. If a production outage happens, first do no harm. Instead, consider it from another perspective:

What would Sherlock Holmes do?

Nothing clears up a case so much as stating it to another person.

Articles

  1. Plural gTLDs are evil
  2. Your Interface is what Matters
  3. Write in the Margins
  4. Meditations
  5. Better Java
  6. When names outlive their usefulness
  7. Diagnosis of the OpenSSL Heartbleed Bug
  8. The Intuition Trap
  9. Ambition
  10. The Story of the GnuTLS Bug
  11. Wrong Solutions
  12. Host an infodump session
  13. So, you want to crypto
  14. Hackers and Engineering School
  15. Strings are untyped
  16. Don't Pipe to your Shell
  17. How to Organize Your Brain with Bookmark Tags
  18. You are not a 10x Developer
  19. Windows ruins everything
  20. Don't Give Up and Die
  21. On Being Nice
  22. Bus Factors and Walk Score
  23. Wiggle the mouse to fix the test
  24. A Difficult Bug
  25. The Origins of the Diluvian Network
  26. Zipf your variable names
  27. H.264 and VP8, compared
  28. On Accepting Interview Question Answers
  29. Rate Limiting per User
  30. Write your own Data Structures