Root cause analysis? You're doing it wrong

(entropicthoughts.com)

68 points | by davedx 2 days ago

12 comments

CobrastanJorji 2 hours ago
Many years ago, I worked at Amazon, and it was at the time quite fond of the "five whys" approach to root cause analysis: say what happened, ask why that happened, ask why that in turn happened, and keep going until you get to some very fundamental problem.
I was asked to write up such a document for an incident where our team had written a new feature which, upon launch, did absolutely nothing. Our team had accidentally mistyped a flag name on the last day before we handed it to a test team, the test team examined the (nonfunctional) tool for a few weeks and blessed it, and then upon turning it on, it failed to do anything. My five whys document was most about "what part of our process led to a multiweek test effort that would greenlight a tool that does nothing that it is required to do."
I recall my manager handing the doc back to me and saying that I needed to completely redo it because it was unacceptable for us to blame another team for our team's bug, which is how I learned that you can make a five why process blame any team you find convenient by choosing the question. I quit not too long after that.
[-]
- BeetleB 1 hour ago
  My litmus test for these types of processes: If root causes like "Inflexible with timelines", or "Incentives are misaligned (e.g. prioritizing career over quality)" are not permitted, the whole process is a waste of time.
  Edit: You can see others commenting on precisely this. Examples:
  https://news.ycombinator.com/item?id=45573027
  https://news.ycombinator.com/item?id=45573101
  https://news.ycombinator.com/item?id=45572561
  https://news.ycombinator.com/item?id=45572561
- grogers 1 hour ago
  Usually another team's failure is covered by their own independent report. That simplifies creating the report since you don't need to collaborate closely, but also prevents shifting the blame on to anyone else (because really, both teams had failures they should have caught independently). E.g. as the last why:
  Why did the testing team not catch that the feature was not functional?
  This is covered by LINK
  [-]
  - CobrastanJorji 57 minutes ago
    If a root cause analysis is not cross team, how deep can the analysis possibly be? "Whoops, that question leads to this other process that our team doesn't directly control, guess we stop thinking about that!"
    [-]
    - taeric 19 minutes ago
      If your root cause is cross team, then you wind up having to make some implicit assumptions on what the other team could have done. Is akin to ending with "because the gods got angry." Not really actionable.
      This is a classic "limit the scope of the feature." You want the document to be written and constrained to someone that is in a position to impact everything they talk about. If you think there was something more holistic, push for that, as well.
      Note you can discuss what other teams are doing. But do that in a way that is strictly factual. And then ask why that led your team to the failure that your team owns.
    - lijok 31 minutes ago
      Pretty deep. It forces you to account for failures in other domains
- hshdhdhehd 1 hour ago
  Interesting one.
  My first thought is why is rolling out a new system to prod that is not used yet an incident? I dont think "being in prod" is sufficient. There should be tiers of service and a brand new service should not be on a tier where it having teething issues is an incident.
  > what part of our process led to a multiweek test effort that would greenlight a tool that does nothing that it is required to do
  would be interested to see the doc, but imagine you'd branch off the causes, one branch of the tree is: UAT didnt pick up the bug. why didn't UAT pick up the bug? .... (you'd need that teams help).
  I think that team would have something that is a contributing cause. You shouldn't rely on UAT to pick up a bug in a released product. However just because it is not a root cause doesnt mean it shouldn't be addressed. Today's contributing cause can be tomorrow's root cause!
  So yeah yiu dont blame another team but you also dont shield another team from one of their systems needing attention! The wording matters alot though.
  The way you worded the question seems a little loaded. But you may be paraphrasing? 5 whys are usually more like "Why did they papaya team not detect the bug before deployment?"
  Whereas
  > what part of our process led to a multiweek test effort that would greenlight a tool that does nothing that it is required to do
  Is more emotive. Sounds like a cross examiners question which isn't the vibe you'd want to go for. 5 whys should be 5 drys. Nothing spicy!
  [-]
  - NikolaNovak 1 hour ago
    That's how we do it - there are "branches" to most of our RCAs, and in fact, we have separate sections for root cause analysis (things which directly or indirectly contribute to incident, which are a branched / fractal 5 whys) and lessons learned (things which did not necessarily contribute to incident but which upon reflection we can do better - frequently incident management or communication or reporting or escalation etc).
    It took a while for all the teams to embrace the rca process without fear and finger pointing, but now that it's trusted and accepted, problem management stream / rca process probably the healthiest / best viewed of our streams and processes :-)
  - CobrastanJorji 1 hour ago
    It was an incident because it was important to leadership. It was a marketing targeting feature that was advertised to the local executive with some excitement by the management, so they were excited to share the results of it, and when there weren't results on the anticipated launch date, they wanted answers, which meant the manager treated it as an incident.
- sanman8119 1 hour ago
  A very relatable experience, lot of pressure to stop the Whys at the dev team and not question larger leadership or organizational moves
- nobrains 56 minutes ago
  they way i handle this with my teams: any bugs caught by the QA team go against the developers. any bugs caught after QA green lights the go live go against the QA team. (Of course, discounting any bugs that are deemed acceptable for go live by the PM).
- vivalahn 47 minutes ago
  The next org you went to, did they also use the Five Whys or did they get by with Four True Colors instead?
- tayo42 1 hour ago
  5 why's can be very political. You can make it take whatever direction you want to tell what ever story you want. I don't get why it's cargo culted the way it is
  [-]
  - stonemetal12 9 minutes ago
    No, people can be very political. It doesn't matter what the process is.
    Hell, people even legislated the value of PI that one time.
captainkrtek 2 hours ago
At a large cloud provider I held a role for a bit in the “safety” organization that was tasked with developing better understanding of our incidents, working on tooling to protect systems, and so on.
A few problems I faced:
- culturally a lack of deeper understanding or care around “safety” topics. Forces that be inherently are motivated by launching features and increasing sales, so more often than not you could write an awesome incident retro doc and just get people who are laser focused on the bare minimum of action items.
- security folks co-opting the safety things, because removing access to things can be misconstrued to mean making things safer. While somewhat true, it also makes doing jobs more difficult if not replaced with adequate tooling. What this meant was taking away access and replacing everything with “break glass” mechanisms. If your team is breaking glass every day and ticketing security, you’re probably failing to address both security and safety..
- related to the last point, but a lack of introspection as to the means of making changes which led to the incident. For example: user uses ssh to run a command and ran the wrong command -> we should eliminate ssh. Rather than asking why was ssh the best / only way the user could affect change to the system? Could we build an api for this with tooling and safeguards before cutting off ssh?
AstroJetson 15 minutes ago
I did a very long RCA on a problem. My management at the time was really BIG into looking at ALL THE CAUSES. They wanted HUGE fishbone diagrams to show that we had looked at everything. This was in the days of having huge drum plotters, so the diagrams could be 36" and many feet long.
So I did what they wanted and the root cause was:
On December 11 1963 Mr and Mrs Stanley Smith had sexual intercourse.
I got asked what that had to do with anything and I said, "If you look up a few lines you'll see that the issue was a human error caused by Bob Smith, if he hadn't been born we wouldn't have had this problem and I just went back to the actual conception date."
I got asked how I was able to pin it to that date and said "I asked Bob what his father's birthday was and extrapolated that info"
I was never asked to do a RCA again.
hshdhdhehd 1 hour ago
Please keep working on that piece I think it will be very useful for incident reviewers.
Someone said the quiet part loud! :
"""
Common circumstances missing from accident reports are:
Pressures to cut costs or work quicker,
Competing requests for colleagues,
Unnecessarily complicated systems,
Broken tools,
Biological needs (e.g. sleep or hunger),
Cumbersome enforced processes,
Fear of being consequences of doing something out of the ordinary, and
Shame of feeling in over one’s head.
"""
[-]
- VirusNewbie 1 hour ago
```
    >Pressures to cut costs or work quicker,

    >Unnecessarily complicated systems,

    >Broken tools,
    
    >cumbersome enforced processes,
```
  I have seen all of these specifically called out in Post Mortems at Google, so that's a plus in my book.
tptacek 3 hours ago
Some of the same thoughts in Richard Cook, which was a brain-altering read for me:
https://how.complexsystems.fail/
[-]
- Waterluvian 2 hours ago
  Hah #7 really hits home. Every RCA I’ve been a part of always ends up pointing to systemic failures in the org at the top level, because walking the tree always leads there. You can’t blame any one person or system for a failure in isolation. It’s usually some form of, “this is ultimately a consequence of miscalibrating risk associated to business/financial decisions.”
  I forget where I heard this but, “you manage risk, but risk cannot be managed.” Ie. there is no terminal state where it’s been “solved.” It’s much like “culture of safety.”
  [-]
  - anonymars 1 hour ago
    I was pondering this a bit recently while going through The Wire
    Though unsatisfying it feels like a lot boils down to "shit rolls downhill" or "fish rot from the head down"
- wrs 2 hours ago
  For a deeper dive there’s a somewhat old but excellent book on most of these points called Normal Accidents. [0]
  [0] https://en.wikipedia.org/wiki/Normal_Accidents
- growse 1 hour ago
  This paper so affected me that I scrapped a talk I was writing three days ahead of an (internal) conference and wrote a talk about this paper instead!
- bombcar 2 hours ago
  I think #7 strikes (but barely misses) the point - root cause analysis is not root blame analysis- but we often combine them in our mind.
jph 2 hours ago
Thanks for the article and shoutout - CAST is great and I use it extensively with tech teams.
Causal Analysis based on Systems Theory - my notes - https://github.com/joelparkerhenderson/causal-analysis-based...
The full handbook by Nancy G. Leveson at MIT is free here: http://sunnyday.mit.edu/CAST-Handbook.pdf
kqr 2 days ago
Author here. Please note this is an early draft/stream-of-consciousness. Feel free to read and share anyway but my actual published articles hold a higher standard!
[-]
- maybelsyrup 3 hours ago
  I caught your related comments and eventual link to this post in another HN thread earlier this week and really liked them / it. I'm glad you posted it by itself!
gtirloni 2 hours ago
I recommend attending the next STAMP Workshop offered by MIT if you have a chance: https://psas.scripts.mit.edu/home/stamp-workshops
gmuslera 2 hours ago
Not all problems (and systems) are alike. And probably simple approaches like Occam's Razor will work good enough with most. But the remaining 10% will need deeper digging into more data and correlations.
exmadscientist 1 hour ago
I agree with a lot of the statements at the top of the article, but some of them are just nonsense. This one, in particular:
> If we analyse accidents more deeply, we can get by analysing fewer accidents and still learn more.
Yeah, that's not how it works. The failure modes of your system might be concentrated in one particularly brittle area, but you really need as much breadth as you can get: the bullets are always fired at the entire plane.
> An accident happens when a system in a hazardous state encounters unfavourable environmental conditions. We cannot control environmental conditions, so we need to prevent hazards.
I mean, I'm an R&D guy, so my experience is biased, but... sometimes the system is just broke and no amount of saying "the system is in a hazardous state" can paper over the fact that you shipped (or, best-case, stress-tested) trash. You absolutely have to run these cases through the failure analysis pipeline, there's no out there, but the analysis flow looks a bit different for things that should-have worked versus things that could-never-have worked. And, yes, it will roll up on management, but... still.
opwieurposiu 2 hours ago
I feel like half the time issues are caused by adding some stupid feature that nobody really wants, but makes it in anyways because the incentive is to add features, not make good software.
People rarely react well if you tell them "Hey this feature ticket you made is poorly conceived and will cause problems, can we just not do it?" It is easier just to implement whatever it is and deal with the fallout later.
bluGill 2 hours ago
Root cause works better if you can come back next time the same thing happens and find a different root cause to fix. keep repeating until the problem doesn't happen enough to care anymore.
If the result/accident is too bad though you need to find all the different faults and mitigate as manyias possible the first time.