Tuesday, July 07, 2020

SAFe: Synchronization vs. Decoupling in PI Planning

Recently, I put my foot in my mouth while tweeting. This turned out to be a good thing, not only because it was an amazing feat of dexterity considering my age and body mass, but also because it lead to an interesting conversation, and thus, an opportunity to think things through, and to learn.

I won’t recapitulate the whole conversation in this blog post, because you can easily look it up on Twitter. I’ll provide the tweet that kicked the discussion off though, and give you the gist of the conversation. I have invited everyone who was involved to read and review this blog post, so if I screw anything up, they can jump in and unscrew it again.

It started with me tweeting:

That started up a conversation with Henrik Berglund and Beatric During. It did not take many tweets until I wrote:

“There are plenty of things where SAFe has a very shallow implementation of important ideas. For example, they borrowed the idea of PI planning from Reinertsen, but ignored the limitations he pointed out.”

Beatric During immediately spotted something a bit off, namely the statement “they borrowed the idea of PI planning from Reinertsen, but ignored the limitations he pointed out”. I slipped up bigly there, because that was not a statement I could back up with facts.

A couple of years earlier, I had noticed similarities between a part of Donald Reinertsen’s excellent book The Principles of Product Development Flow: Second Generation Lean Product Development and the SAFe practice of PI Planning. That, and the fact that SAFe borrows several other ideas from Reinertsen’s book, had lead me to form a hypothesis that the idea of PI Planning was also derived from what Reinertsen writes in Flow.

Just to be perfectly clear: I really like Flow. Borrowing from it is a very good thing to do. The purpose of writing a book like this, is to spread knowledge around, so lots of people can use it.

I have used ideas from that book to help clients, and it has worked out great. I wish more people would read it, and use it.

The mistake I made was not forming the hypothesis. The mistake was presenting it as if it was a fact.

Eric Schön asked me to be more specific, so I tweeted the numbers of the specific sections I was referring to:

On the off chance that you haven’t memorized all the section titles in the book, I’ll provide them here:

  • F11: The Principle of Multiproject Synchronization: Exploit scale economies by synchronizing work from multiple projects.
  • F12: The Principle of Cross-Functional Synchronization: Use synchronized events to facilitate cross function trade-offs.
  • F13: The Synchronization Queueing Principle: To reduce queues, synchronize the batch size and timing of adjacent processes

Section F11 uses project reviews as an example of how you can increase the frequency of project reviews by reviewing multiple projects at the same time. In SAFe, this is done through biweekly System Demos, and, every three months, through a PI System Demo.

Section F12 uses reviews as an example of an activity that can benefit from synchronization between functional departments.  If you have functional teams, working on functional requirements, you would get the same kind of advantages as Reinertsen describes for departments. SAFe uses (usually) weekly Scrum of Scrum meetings, and during PI Planning, a Management Review and Problem-solving meeting to resolve dependencies and make cross-function trade-offs.

Note that I used the word “functional” several times in the previous paragraph, because I will get back to that later.

In section F13 Reinertsen states:
“If we synchronize both the batch size and the timing of adjacent processes, we can make capacity available at the moment the demand arrives. This leads to a dramatic reduction in queues.”
This is also congruent with how SAFe uses syncronized Sprints, demos, Scrum of Scrums, Product Owner Synchronization meetings, and other artifacts, to synchronize batch sizes and timing of teams working in the same Agile Release Train (ART).

The mistake I made when I tweeted, was to assume that my unproven hypothesis, that SAFe has gotten its ideas from Reinertsen’s book, was a fact. All the evidence is circumstantial. The practices in SAFe may have been derived from other sources.

At this point, Donald Reinertsen joined the conversation, and pointed out that the similarities do not in any way prove causality. He did agree though, that the basic principles are the same.

While I agree completely with the principles listed above, I also believe that there is a bit more to it than that. Let’s check out what Reinertsen wrote about in those other sections I mentioned in my tweet:

  • B13: The Principle of Batch Size Diseconomies: Batch size reduction saves much more than you think.
  • B14: The Batch Size Packing Principle: Small batches allow finer tuning of capacity utilization.
  • B15: The Fluidity Principle: Loose coupling between product subsystems enables small batches.

Section B13 explains non-linear effects that makes transaction and holding costs skyrocket as batch sizes grow. Reinertsen writes:
“Our calculated optimum batch sizes are likely to be too large. In the presence of the hard-to-quantify large batch diseconomies, it is important to aggressively test our assumptions about optimum batch size.”
Section B14 says that if we want to use people and resources effectively, we need small batches. Reinertsen writes:
“Smaller batches inherently help us get better resource utilization with smaller queues.”
Section B15 is particularly interesting. It says that to work efficiently with small batches, we need to reduce dependencies between them. To quote Reinertsen again:
“If we want to start testing subsystems before the entire system is designed, we need to create independently testable subsystems. As we reduce dependencies, we gain flexibility in routing and sequencing.”
Note that this has implications for the SAFe program organization: Teams cannot work independently of each other if they work on subsystems that have dependencies. At the risk of seriously overquoting, here is Reinertsen again:
“This illustrates the extremely important relationship between product architecture and development process design. Once a product developer realizes that small batches are desirable, they start adopting product architectures that permit work to flow in small, decoupled batches. These loosely coupled architectures, with stable interfaces, enable us to work in parallel on many subsystems. We can work in parallel with low risk, if we are confident that the work we do will ultimately integrate well at a system level.” 
Before putting everything together, I’ll add one more piece to the puzzle. This is a reference I ought to have included in my tweet above, but in my tweeting frenzy, I missed it:

  • F10: The Synchronization Capacity Margin Principle: To enable synchronization, provide sufficient capacity margin.

We need a capacity margin to synchronize. That means the benefits of synchronization come at a cost. This should not be a surprise.

Imagine you have two people running a long distance race. There are five checkpoints along the way, and you want the runners to arrive at the same time at each checkpoint, and at the goal line.

You may think this sounds a little bit weird, but that is what the synchronization mechanisms in SAFe do. Synchronization, in this case, would enable you to have each checkpoint open for a shorter time, and it would enable the runners to exhange information, and plan the best route to the next checkpoint. It would also slow down the runners. Before each checkpoint, the fastest runner would have to wait for the other runner to catch up.

Suppose one runner is the slowest on stretches one, three and five. The other is slowest on two and four. Over the course of the race, both runners will have to slow down, waiting for the other.

Imagine you have twenty runners, and twenty checkpoints. It’ll be a slow race. Replace the runners with software development teams, and the race with a SAFe program.

I’ve been measuring team velocity, and variation in team velocity, in almost every software project or program I have worked in for the past fifteen years.

Team velocity varies a lot from sprint to sprint. Study a team over a significant course of time, and you will see that the difference between lowest and highest velocity in a sprint, may be a factor twenty, or even more, for a normally functioning team.

If you have ten teams in an ART, it is a good bet that at least one team will be really slow each sprint. If you have a lot of dependencies, it is likely that other teams will have to wait for that slow team, every sprint.

What do you call five slow sprints, and a two week capacity buffer? A Program Increment.

A Program Increment is three months long. That means you get pretty large batches. If B13 holds true, and reducing batch size saves much more than you think, then large batches will cost you much more than you think. Those large batches will also prevent you from doing the fine tuning mentioned in B14.

How can you see that you have a problem with too many couplings?

For starters, look at the Program Board. If you have many pieces of string connecting work items done by different teams, then you have a problem.

You should be aware that the board does not tell the whole story. Some years ago I created a system for real time detection of dependencies, and delays due to dependencies, and found, among other things, that what you see on the Program board is typically less than 25% of the dependencies you have in a single PI.

For a very rough estimate of how many dependencies you have between subsystems, count the number of strings, multiply by four, then multiply by the number of PIs you have had in the program.

I created a graph of such an estimate in a program where I worked. It was a red mass of strings, where you could not even see individual strings, except in the periphery. I have come to understand that this is normal for SAFe programs. Few people are interested in reducing the number of dependencies, because few people are aware of the impact.

How do you solve the problem with slow progress?
Reduce batch size!

How do you do that?
Reduce couplings between teams, so they do not have to wait for each other.

How do you do that?
Reduce couplings between software subsystems.

How do you do that?
That depends.

Sorry, but my experience is that the answer to that last question is different for different organizations. I have seen organizations where legacy IT infrastructure causes hard couplings, where inability to grasp object oriented programming principles cause hard couplings, where the organizational structure causes hard couplings, where the requirements model causes hard couplings, where the organization’s own process experts cause hard couplings, where the architects cause hard couplings, where KPIs and OKRs cause hard couplings…Usually, there is more than one problem at the same time.

Everywhere I have been, the problems have been of known types, with known solutions, just not known in the organizations beleaguered by the problems.

While I cannot give you a solution tailor made to fit your organization in a blog post, I can give you a very general guideline:
Decouple when you can, synchronize when necessary! 
SAFe does tell you to decouple, it is just that the emphasis is almost exclusively on synchronization. What little information there is on decoupling gets lost.

I believe there ought to be way more emphasis on decoupling, and I believe people must be trained to understand the trade-offs involved.

Unfortunately, in many organizations, that means you need to retrain managers, developers, and architects…and many of your SAFe consultants.

What do you think? Comments are welcome.

No comments: