Reducing Uncertainty in Software Delivery
I recently attended a half day online event that InfoQ held on Reducing Uncertainty in Software Delivery. The thing that made this half day event different was the underlying focus on testing but without a single tester present in the talks or panel discussions. The majority of speakers where developers and there was even a few Engineering managers, Product people and a CEO or two. It also appeared to me that none of them have come from a traditional testing background. However they all made points that a good tester would and then some. The advantage they appear to have over testers is that they were able to incorporate their knowledge of their discipline to give a much broader view than just focusing on the testing itself.
A key theme that I’m seeing from these talks is that they are spending a lot of effort on learning from failure. Either by analysing ones that have happened in production or actively encouraging teams to cause failures. It was only the more advanced organisations that were taking this approach but the others were not far behind. Why? To make their systems even more resilient. Their approach appears to be using Site Reliability Engineers (SRE) to work along side their engineering teams to help them do the work but also enable the teams to extract the learnings from it too. This isn’t simply having chaos testing to cause failures or postmortems for production failure analysis but to also help teams with the people side of working with and handling failure productively.
The talks that caught my interests were Building in reliability (SRE at Gremlin), User Simulation for Rapid Outage Mitigation (SRE at Uber), and a panel discussion on Testing in production (with 2 CEOs, Product person and an Engineering Manager).
Now this is a small sample, the speakers are very experienced and working or have worked at some of the best known web based organisations (Google, Uber etc) and US focused too. But I’m seeing a lot of things that testers could advocate for being pursued and implemented by Site Reliability Engineers (SRE). For example:
- testing in production,
- building in observability,
- pushing testing earlier in the process,
- encouraging developers to test their own work
The advantage SREs have is they already have the technical ability and are now starting to build out the socio-technological skills that they were lacking previously. These organisations have another advantage in that they are heavily focused on learning from their failures. So when they do get things wrong they work hard to make sure they extract as much value from that failure as possible. On top of that some of these organisations are actively causing failures within their systems to further limit catastrophic failures that could occur. Some of these organisations have never had tester and from the looks of things never will. If you’re pursuing a true continuous improvement strategy testers could look like a bottleneck in the process slowing down information flow. How can testers enable the flow of information and what can they add that makes this information even more valuable?
I’ve pulled my summaries of the talks I found interesting below
Talk: User Simulation for Rapid Outage Mitigation
Uber uses an alternative approach to end-to-end testing due to their system being so big that no one person can ever fully understand it. Instead they use composable tests that each team will create that allows that team to tests their part of the system but mix in other parts pre and post steps built by their dependent teams. These are then run in a simulation environment that allows them to see how the system will perform when that change is deployed. To incentives team to build the tests they use a mixture of pain (woke-up at 3AM due to production failure) and mitigation support team (hold their hands at 3AM) to encourage them to build the tests. For example if you had these test you wouldn’t be awake at 3AM trying to mitigate the issue. They also don’t try and solve the issues at 3AM but mitigate them so others can also learn about outages that affect their system.
Talk: Building in reliability
Interesting talk focusing in on availability of systems within organisations. The speaker walked through how you could go from 99% availability to 99.99% and how it is a learning journey. Used a simple analogy going from crawling, walking and running to get your availability towards what makes sense for your organisation. Essentially can you do it manually, can you script it and can you automate it? I find this slide as a great way to help others understand what the outcomes are at each stage going from 99% to 99.99%.

Panel: Measuring Value Realisation Through Testing in Production
I usually only see these types of conversation from tester focused panels but none of this panel where testers. Tester focused panels typically focus on testers testing in production but this was very much focused on learning from real users in production. Interesting thing from my prospective was they made all the points that I would expect a reasonably experienced tester to bring. In some cases due to their roles being out of testing they focused on the costs and benefits that were outside of simply testing in production e.g. down side of A/B testing or product management mindset shifts that need to happen to embrace learning from users rather then whatever the road map they have decided says.
In some ways testers testing in production almost act like the middlemen of the learning that happens during testing. Could it be that in some cases testers are getting in the way for teams to learn effectively from testing in production?
Leave a Reply
Want to join the discussion?Feel free to contribute!