Monday, January 09, 2023

Are You Still Using the Wrong Control Levers in your Agile projects? Part One: Cost and Capacity - The Levers of Death

Which levers should you use? When should you use them? Which levers should you avoid using? There is a subtle hint in the illustration.

Agile methods brought us new ways of developing software, and new ways of managing software projects, programs, and product development. Unfortunately, I have seen very few, if any, organizations that make good use of the powerful new management tools they have at their disposal. Instead, they continue to use the same tools they used before agile, often with predictably bad results.

In this series of articles I’ll provide a walk through of high level controls, their pros, cons, and how they relate to each other.

The Levers of Death: Capacity and Cost

Let’s start with the Levers of Death, Capacity and Cost. These levers are the ones I see used most often. They are not necessarily bad in and of themselves (well, firing people is bad), but they are easy to misuse, and often poorly understood.

In most organizations I have worked in, it is assumed Capacity and Cost controls are rather straightforward:

Capacity - Increase capacity, i.e. hire more people, when you want a project to speed up. This of course also increases cost.

Cost - Cut Cost, i.e. fire people, when the project gets to expensive. What is too expensive or not, is usually measured against a predetermined budget. Usually a yearly budget. Cutting cost is expected to reduce capacity, but the implications of that are often quietly ignored, because, what else can you do, right?

As we will see, the problems with the hiring and firing approach outweigh the benefits. Fortunately, there are better alternatives, controls that both provide more effective economic control, and are more humane. I’ll explore both bad and good control levers in this article series, but we will start with two baddies: Cost and Capacity.

Problem 1: Vital Information is Missing

The critical path (or paths) is the longest path (in time) from Start to Finish; it indicates the minimum time necessary to complete the entire project.

— The ABCs of the Critical Path Method, by F. K. Levy, G. L. Thompson, and J. D. Wiest, Harvard Business Journal, September 1963

When agile methods became popular, they were intended to replace older methods of management. That lead to chucking older practices overboard, because they were not needed anymore. Unfortunately, the capacity and cost levers were supposed to be, if not chucked overboard altogether, at least relegated to third rate status, but managers in many companies held on to them as primary project controls.

To make it worse, when organizations use them now, they do it without benefit of information they had 25 years ago. The reason is that some of the decision support needed to use Capacity and Cost effectively, actually was chucked overboard.

One such missing piece of information is the Critical Path. The critical path is the longest path, in time, from start to finish in a project. The critical path is extremely important in old style waterfall projects, because it determines the duration of the entire project.

If you know the critical path, then you also know that along the critical path you have a capacity bottleneck.

When you wanted to add capacity, the trick was to locate the capacity bottleneck, and add capacity there, and nowhere else. This will sound very familiar to anyone who uses the Theory of Constraints (TOC) management paradigm, and the TOC project management method, Critical Chain.

Conversely, if you wanted to reduce cost, you made very sure to reduce it in places that were not the bottleneck, and preferably not on the critical path.

“Dr. Livingstone, I presume?”

Before we look closer at why we need to know the critical path and the project bottleneck before we should even think about hiring and firing, we need to look at a problem the critical path idea was not designed to handle:

In projects with a lot of variation, like software, and product, development, the critical path, and the main bottleneck moves around, a lot!

The reason for the critical path and the project bottleneck moving around, is simple: Random variation!

When you build something new, which is what software development is all about, you do not have well defined lists of activities. Instead, you are doing exploratory work, with very limited ability to predict the future. To make it worse, the more detailed your predictions, the more wrong they will be.

Imagine, for a moment, that you are Henry Livingstone, on March 21, 1871. It’s the first day of your attempt to find the explorer David Livingstone, who had vanished in central Africa, several years later. You have prepared carefully for the rescue expedition, but it would be folly to make a predetermined plan of exactly where to go to find Stanley, and exactly how long it will take.

Livingstone’s expedition faced enormous dangers, or impediments, as Livingstone would have called them if he had been a Scrummaster: Crocodiles ate the pack animals, tse-tse flies gave them deadly diseases. Dozens of porters abandoned the expeditions, or died from dysentery, smallpox, malaria, and other diseases. Livingstone had been spotted near Lake Tanganyika, so Stanley had a general idea of where to go, but he had to pick up more along the way. He heard a rumor about a white man in the town Ujiji, and went there, not knowing whether he would find Livingstone, or not. By luck, he did!

Software development is like that. You can’t make detailed plans and schedules, but you can prepare.

Unfortunately, the whole critical-path-and-bottleneck idea requires that you can plan and schedule with a great deal of accuracy. If you can’t plan in detail, you can’t identify the critical path. If you can’t identify the critical path, it’s difficult to identify the bottleneck in the process. If the bottleneck, and the critical path, keeps moving around, a good decision about where to hire and fire today, will become a bad decision tomorrow.

There are things you can do to mitigate the problems with critical path, but I’ll leave that for another article series. In the more than 40 years I have worked in software development, I have yet to see a software project implement anything close to a useful solution.

Today, when companies have scrapped the whole idea of critical path management, fixing the problems with it, is of little relevance.

Instead, we will look at what happens when an organization uses the Capacity and Cost levers without knowing what the critical path and the project bottleneck is.

Problem 2: Adding Capacity Adds Work-In-Process

When you do not know where the critical path bottleneck is, and you add more people to a project, you are more likely to add people in other locations than the bottleneck, than to actually hit the bottleneck itself. That means most of the people you add won’t contribute to speeding up the project. Instead, they will add to Work-In-Process (WIP), queues of unfinished work in the process.

The larger the queues you have, the larger your lead times will be. Unfinished work in queues also add risk, because you won’t know whether the stuff will actually work with all the other stuff you build until you test it end-to-end. There are plenty of techniques for mitigating that risk, but you can’t eliminate it. Besides, most companies I have worked with are rather poor at this kind of risk mitigation.

Build up enough WIP, and your critical path will shift to the path where most of the added WIP is, which will increase project duration.

Thus, adding people won’t buy you the added capacity you think it will.

Communications Overhead

The Illustration shows how adding more nodes to a network, i.e. adding team members to a team, or adding teams to a project, causes quadratic growth in communications overhead.

On top of the WIP problem, adding more people will add communications overhead. The communications overhead can start out small for a small project, but it will grow quadratically while you add people linearly. This means when you add more people, productivity per person will go down.

I have 30 people in the project. The problem is I need only 5 people.

— Project manager, in a project I worked in around 2005

Worst case, you can actually reduce capacity when you add people! I have worked in 200 people projects that could have moved a lot faster if there had been only 20 people in the project.

The short of it is that adding more people will add cost, that we know for certain, but whether it will actually shorten project duration is a bit hit and miss. The larger your project, the higher the probability it will be a miss.

Problem 3: Cutting Cost Reduces Bottleneck Capacity

When management, often belatedly[*], discover that adding people added a lot of cost, but did not shorten project duration as much as they had hoped, or at all, the natural reaction is to cut costs, in order to make the budget targets.

Unfortunately, if you have 100 people in your project, add 100 more, and then cut 100, you won’t be back where you started. You will be worse off than before!

Why is that? Because we do not know where the bottleneck is, and because the bottleneck often jumps around, it is highly likely that cost cuts affect the bottleneck, either permanently, or intermittently. When that happens, the entire project is delayed.

The illustration shows how reducing capacity at a bottleneck can have much greater effect on duration and cost than expected.

Here is an example from a project I worked in:

Several years ago I was the Scrummaster for a development team where higher level management from time to time pulled one person from the team to work outside the team.

The team had 7 members, so both the team and management expected the remaining capacity to be 86% (6/7 = 0.857 ≈ 86%). However, this is true only if all team members are full stack developers, and if they all are equally productive.

The team had 5 developers and 2 testers. The testers were the bottleneck. Unfortunately, the person pulled from the team was one of the testers. That reduced the total team capacity to 50% (1/2 = 0.5 ≈ 50%).

If that team had been the bottleneck in the entire software development project removing the tester for 1 week would mean adding 1 week to the duration of the entire project. That also adds 1 week of cost for the entire project, way more money than management had expected.

Note that if you have, for example, a large SAFe program, with several Agile Release Trains (ART), and 7-10 teams in each ART, you could double the duration of the project by firing that single tester…unless you figure the problem out and hire a new tester. If you do that, and the critical path and the bottleneck then jumps to somewhere else, then the new hire will just contribute to creating more WIP, and you are back to Problem 2: Adding Capacity Adds Work-In-Process.

The point is that cutting just a few people from a project may have a disproportionate effect on project duration and cost, and you do not know where it is safe to cut cost! Sometimes it works, sometimes it makes the situation worse.

Problem 5: The Hire and Fire Death Spiral

Using the Capacity and Cost levers can easily drag a software development project into a kind of economic death spiral:

It starts with WIP going up due to statistical variation. More WIP means work will have to wait in queues, which means delays, which means cycle times go up for many teams. This means the project is delayed.

Because of the delays, deadlines are broken. Management tries to fix this by adding more people. This increases communications overhead. It also adds capacity off the critical path, which leads to more WIP accumulating. The net result is that the project does not speed up as much as expected, but it now costs more.

It is common that management keeps trying to speed the project up by adding even more people. This is less and less effective each time. This is partly because the communications overhead goes up quadratically when people are added. Another part is that the larger the project is, the higher the probability of missing the critical path altogether when adding new people.

Eventually management will notice that not only does the project not move forwards as expected, it also burns through money at an alarming rate. That is when the cost cuts come. Some of those cost cuts are likely to hit the critical path. When that happens, project duration goes up. Cost per day does go down, but the increased duration means there are many more days, so total cost goes up.

Cost cuts continue until management notices that we now have even more delays, so management starts hiring more people, and the Hire and Fire cycle starts over again.

The whole thing continues until the project either stumbles over the finishing line, or the organization gives up and pulls the plug on the whole mess.

Very, very rarely, management stops, decides the whole depressing cycle is daft, and decides to find a better way. When that happens, management often goes for the sweet promise of increased Business Value.

Next: Part 2: Business Value, its Use and Abuse.

[*] Agile methods have a built-in early warning system, monitoring queues. Unfortunately, organizations that rely on the Capacity and Cost levers usually do not use queue monitoring, at least not very well.

No comments: