Wednesday, December 19, 2007

Back in Time


Argh, I'm still busy. I write this at 02.23 A.M. I have just finished working for the day.

I upgraded my Mac OS to 10.5 (Leopard) recently. I had several reasons for doing this (being a geek is one), but there were two in particular that might be of interest:

I wanted a better backup system than the one I have used previously. Leopard comes with the much touted Time Machine.

I have done a lot of work on how the choice of requirements analysis methodology affects the final product lately. This has revived my interest in an interaction design methodology, Goal-Directed Design. I don't know if Time Machine was designed using GDD, but it was certainly designed using a very similar philosophy.

Time Machine is extremely simple to use. The only thing you need to do to get going is to plug in an external hard drive (I use a 160GB Iomega dedicated to backups only) and start the program. Time machine backs up the internal hard drive. From then on, it makes hourly backups.

Time Machine is defined by what it doesn't do as much as by what it does. There is no way to configure the time and frequency of backups. No choice about when to do full or incremental backups.

The interesting thing is that the lack of features increase the value of Time Machine enormously. The backup system I used previously had a lot of features. The developers had tried to avoid using technical jargon in the GUI. Unfortunately they just managed to confuse everyone. It was difficult to tell whether you had configured the system to make full or incremental backups, for example. Making backups with that system was a chore.

With Time Machine, I no longer have to care. Backups are taken care of invisibly, without me even noticing. Restoring files is easy. As you can see in the picture above, the interface for restoring backed up files is a bit spaced out. (The background star field is animated.) Still, it is extremely easy to use.

Someone put a lot of thought into this. For example, most people restore files rarely, so they will be perpetual beginners. Nobody but a system administrator wants to learn how to use backup software. It should just be there and fix the problem. Time Machine does that.

I also find it interesting that there is no way to come up with Time Machine if you use functional requirements. For all the talk about use cases and user interaction design, most development teams I see still use functional requirements, or something very close to it. They often dress the functional requirements up a bit by calling them "use cases", but that is as far as it goes.

Interaction design is a different beast entirely. The focus is on identifying user goals, and design software so that users can achieve the goals. Some goals are directly related to tasks, such as "write a report that looks good", or "make sure everyone in the company gets paid on payday".

Other goals are quite different, for example "feel good about the job I'm doing", "keep in touch with my family", or "be valued by my boss". Designing software that helps users achieve those goals is a challenge, but it is possible.

Enterprise systems have many users, and these users can have conflicting goals. A user may for example have "be in control of my own work" as a goal, while a supervisor may have "keep close tabs on what everyone is doing" as a goal. An Evaporating Cloud, a conflict resolution diagram, can help resolve such conflicts.

If you have read my old posts, you may recall that I was working on a book quite some time ago. I haven't had time to work on it for some time, but the project isn't dead, just comatose. I might be able to resuscitate it in January. If I anage that, I'll incorporate material on how to use interaction design and TOC busness analysis techniques to build a clear picture of what users really need.

Saturday, October 06, 2007

Busy

In case you were wondering, I haven't gotten tired of blogging. I have been working, a lot. I am working on some stuff that will appear in the blog, but my family and my work takes precedence, so it is going forward very slowly.

Sunday, September 09, 2007

The Kanban Juice War

I posed what I believed to be an innocuous problem for the kanbandev group at Yahoo and inadvertently started a bit of a row. The problem was as follows:

In a hotel restaurant there is a table with a juice dispenser. Next to the juice dispenser is a stack of trays with glasses. Sometimes the glasses run out, and a queue of customers build up. What is the best way to prevent glass outages?

I had a solution, but I wasn't happy with it. It involved using a photocell to detect when the glasses were in danger of running out. One complication was that the serving personnel could not easily see if the glasses were running out. The table with the juice dispenser is in an alcove. The, usually very busy, serving personnel move along paths that make it difficult to see the trays. If there is a queue of people waiting for juice, they obscure the trays, and any simple kanban mechanisms. Of course, if the arrival rate of thirsty restaurant guests is great enough, a queue may form even if there is plenty of juice and glasses.

I quickly got several suggestions on how to implement various kanban mechanisms to replenish the glasses. All of the suggestions were good, but only a few will work in the environment at the restaurant. (Or so I believe. I've been there only once, and it was awhile ago.) This was what I had expected, but one member of the group told me that my problem was a distribution problem, not a process scheduling problem, and therefore it would be more appropriate to ask the question in another group.

This sparked some heated debate. For once, I didn't wade into the thick of it. There were some big guns on both sides of the debate. One thing that became clear to me, is that there was a difference of opinion about some of the fundamentals, like what a pull system is, when to apply kanban or DBR mechanisms, and stuff like that.

I like diagrams. Mailing lists are not diagram friendly, so I thought I'd blog about my point of view. It allows me to describe the system under discussion with a diagram like this:

(Click the image to get a full size view.)

As you can see, the system has two legs, a glass leg, and a juice leg. (Some of it is conjecture. For example, I do not know if the restaurant dilutes concentrated juice, but it seems a reasonable supposition.)

The basic conflict, as I understand it, is described in the following evaporating cloud:

(Click the image to get a full size view.)

I asked about the reason for kanban not being applicable, and got the answer that the system is not a pull system because the operator of the system - the restaurant management - is not able to control the arrival rate of the work - the customers who want a drink of juice.

This seemed strange to me on two counts:
  • Pull systems control the flow of resources in a production process by replacing only what has been consumed. This has nothing to do with controlling the arrival rate of customers. (Controlling the arrival rate of work orders is important, but it is not something that defines a pull system.)
  • There was an assumption that the system was not a pull system, but suitable for a TOC distribution system approach. TOC distribution systems are pull systems.
When cracking a cloud, it is important to be able to prove assertions like the two I made above. I checked several references, but these will be sufficient to prove my point:

For definitions of the term pull system, see

For explanations of the TOC distribution system approach, see
  • Viable Vision, Ch 7. Distribution: From Push to Pull, Gerald I. Kendall;
So claiming that the problem isn't a kanban problem because the system under discussion isn't a pull system, does not hold. The system is a pull system, because the act of using glasses should trigger replenishment of the trays. The problem is that the current mechanism for doing so, visual inspection of the trays, is too unreliable.

Maybe there is some other reason to consider this a distribution problem? Well, distribution problems are primarily about sizing, composing, and placing buffers. The TOC distribution system is based on two ideas:
  1. Pool the inventory where there is the greatest predictability
  2. Use a pull system to replenish exactly what has been consumed at short intervals
Would such a solution be of help? No. The primary problem here is not where to put buffers. Increasing the glass buffer would not help, unless the buffer is big enough to last through breakfast. Even then, the restaurant might run out of glasses by dinner time. It is the mechanism for replenishing the buffer of glasses that is faulty. The second step does tell us to use a pull system, but it does (of course) not specify the exact mechanism to use.

On the other hand, proving that this is not primarily a distribution problem, does not prove that it is a problem that can be solved with kanban (or DBR). What does kanban provide? A visual aid that tells when to replenish a buffer in a pull system. Sounds rather abstract. How about translating it into something specific:

A visual aid that tells when to place a new tray of glasses next to the juice machine.

Much better. This is exactly what a kanban solution is. It also satisfies step 2 in the TOC distribution mechanism. Could it be that kanban systems can extend throughout the entire supply chain? Yes they can.

The kanban idea originated by studying how wares are replenished on the shelves of supermarkets. This is a classic kanban application, but it also ties into distribution problems. Supermarkets don't make anything, they are retail outlets, so they are fed by distributors.

A kanban system is just a signaling mechanism for replenishing resources used by a process stage in a pull system. Whether that pull system is a manufacturing system, or a distribution system does not really matter. That is not to say a manufacturing system and a distribution system are necessarily alike in all other ways. They are not.

Thus, the Kanban Juice War was caused by misunderstandings:
  • It was not understood that the juice dispenser system under discussion is a pull system
  • It was not understood that the TOC distribution system is a pull system
  • It was not understood that it is irrelevant whether the system under discussion is a production system or a distribution system, because kanban is applicable to all pull systems
The evaporating cloud was broken in two places:
Focusing on buffering won't work, because the problem is the faulty triggering mechanism. Choosing between kanban and distribution solutions is also wrong, because kanban solutions may be a part of distribution solutions that are based on pull systems.

When someone misunderstands, the responsibility may lie with the sender of the message, as well as with the receiver. As I am the original sender, I may be guilty of garbling the problem description. Hard for me to say. The message seemed clear to me, but that proves nothing. If you understand the problem description it does not prove anything either. I have restated the problem, and I am using illustrations. This may make a lot of difference. On the other hand, if you don't get the the problem explanation in this post, then I have probably goofed in my original posting to the mailing list too.

The whole thing may seem like a storm in a glass of juice, but I find the Juice War interesting. If we "experts" can't agree on how to define a very simple problem like this, how are we going to sell TOC and Lean solutions to the people who need them so desperately? Why do TOC experts themselves use the Evaporating Cloud technique to resolve conflicts so rarely? Is it because the technique does not work? Or, is it because it does work very well? (I suspect the latter, probably because I find clouds very useful.)

Saturday, September 01, 2007

A Visit to BNI


A little over a week ago, I visited BNI Spar Hotell in Gårda, Gothenburg. I was invited as a guest by Martin Richards. Last Thursday, I visited again.

BNI, Business Network International, is an international business referal organization. The idea is simple. A BNI group consists of a number of representatives for various companies. To avoid internal competition, only one company from each trade is allowed. The members have a breakfast meeting once a week, where they exchange business referals. The members also carry each others business cards, and look out for opportunities to help each other.

At my first visit, I was impressed by the well structured and tightly focused meeting. Of course, from my Theory Of Constraints perspective, Throughput is the game, and the group does deliver.

I won't mention any figures here, but on average the membership is well worth the cost. I do not know how the Throughput is distributed over the group members, but everyone I met was very happy with how their system works.

My second visit, last Thursday was even more fun. Martin is a business communications coach. He held a ten minute talk about his work, and I shot it on video. Comparing his performance with mine reminds me of how woefully out of practice I am speaking English. (Writing in English uses different parts of the brain.)

Business referral networks are interesting to me, and my clients, so I am going to delve deeper into how they work, their pros and cons, and how to optimize their performance.

Monday, August 20, 2007

Wednesday, August 15, 2007

Kevin on Blame

Kevin Rutherford made a few comments on my posting on risk aversity. He also posted a link to a posting of his where he is writing about several articles on related topics, including mine.

Kevin's posting is well worth reading, and so are the articles he references.

Sunday, August 05, 2007

Risk Aversity

If you happen to be a CEO, have you noticed a certain sluggishness in your department managers? A reluctance to take decisive action. Perhaps not. The problem is pervasive, but it is more noticeable from the bottom of the organization than it is from the top. The reason is that even though problem causes can often be found high up, the effects are usually felt further down.

Most organizations have a built in resistance to take action. It is generally safer for a person, especially for managers, to take no action at all. The illustration below shows how it works:

The Current Reality Tree (CRT) shows that there are two root causes contributing to exaggerated risk aversity:

  • Detected mistakes are punished
  • Decisions resulting in no action are not recorded
Because of this, it is a much safer strategy for an individual to take no action, than to take action, when faced with a problem. This discourages managers from taking action. Employees at the lower levels of the corporate hierarchy hesitate to bring problems to their managers, because they know the manager won't like it. As a result, the organization becomes risk averse.

Because information about mistakes of inaction are not recorded, upper management will probably not even be aware of the problem. (There is an old saying that managers are like mushrooms: keep 'em in the dark and feed 'em horseshit. For an individual, this is certainly a viable tactic in a risk averse organization.)

The problem, from the organization's point of view, is that being overly risk averse can hurt, or even kill it. Most business organizations that get into serious trouble do so because of what they did not do, rather than what they did. Even when they get in trouble because they did something wrong (or even because they did something right), they probably did something wrong because they did not do something right before that.

I once worked in an organization where the team I belonged to uncovered some serious problems in the way we developed software. We also worked out how to fix the problems so that they would never occur again. Just when we were set to go, our company merged with a very risk averse organization, and we got a new manager. His first directive was "don't change anything, for any reason". Partly because of this, the department collapsed. Most people in it left the company, and profits from software development dropped like a rock.
Still, it was a viable tactic for the departmental manager. He remained with the company, and the last I heard, he had gotten promoted.

How does one go about improving the way an organization handles risk? By removing the causes that make managers avoid taking action. The Future Reality Tree (FRT) below shows one way to do it:
To begin with, the organization must teach its members that making mistakes is OK. It is OK, because it is from mistakes we learn things. Success brings no new knowledge. It only confirms what we already know how to do. Failure increases knowledge (assuming we are willing to learn from our mistakes), and so has value. This value may often be greater than the cost incurred by making the mistake in the first place, provided that the organization, just not the individual, learns from it, and adapts to avoid such mistakes in the future.

An organization can do this by rewarding mistakes that lead to learning something new. It is important to ensure that mistakes are not repeated. One way to do this would be by punishing repeated mistakes, but this does not work very well. A better way is to ferret out the root causes, and fix those. By the way, root causes are almost always systemic, so it is very rare that an individual really is to blame.

It is also necessary to start recording decisions not to take action. Inactions probably cause more problems than actions in your organization. If mistakes due to inaction are not recorded, the organization cannot learn from them.

There is one important thing about evaluating mistakes: a common error is to evaluate a decision based solely on the result. This does not work very well. Most decisions have a wide range of possible outcomes. Randomness plays a big part, much larger than we are usually comfortable admitting. Rather, decisions should be evaluated on whether they are strategically and tactically sound, using the currently approved management model. (The most famous example of a company that consistently does this is Toyota.)

When the outcome of a decision differs from the outcome predicted by the management model, then the cause must be evaluated separately. Was it a random result (i.e. we don't know why, but we do not believe the model was at fault), was the model misapplied, or, horror, is the management model wrong? If it is, the model must be updated or replaced.

Be conservative. You should not adopt a management model without a lot of evidence of it being a good one. Neither should you throw one out without a lot of evidence it is a bad one.

Friday, August 03, 2007

www.henrikmartensson.org is Back in Business

I have just relaunched www.henrikmartensson.org. The site has been rewritten from the ground up. Right now, the material you'll find there is mostly about my TOC consulting business. Over time I will add material on TOC, Lean, and agile.

In addition to the kallokain.blogspot.com address, the Kallokain blog can also be reached at blog.henrikmartensson.org.

Please allow up to 48 hours (that would be on Sunday), to allow the address information to propagate over the Internet. (Depending on where you are, the address will work right now. Try it!)

It's a Policy Constraint!

There was some interest in my process improvement story. Both Tobias Fors and Torbjörn Kalin have commented.

I talked to the manager at the café. She had noted the head-bang problem long ago, and wanted to fix it by replacing the low hanging lamps. Upper management won't let her. The lamps match other lamps in the bookstore.

A possible solution would be to shorten the power cords.

Monday, July 30, 2007

Lucy in the Chocolate Factory

This little gem explains why push systems are a bad idea:

What do you think happens when an organization based on push systems tries to go agile without changing the rest of the organization?

Friday, July 27, 2007

The Trap Has Been Reset

In my previous post I wrote about improving systems vs. improving processes. Yesterday, I went back to the book café in my little anecdote, and found that someone had moved the tables back to their original position. The insidious lamp trap has been reset for the next customer.

As I pointed out, improving systems instead of just processes is worthwhile, but hard to do. It does take a lot of effort, usually over an extended period of time.

Maybe I'll move the tables back under the lamps again this afternoon.

Wednesday, July 25, 2007

Process vs. System Improvement


A couple of days ago I sat in my favorite book café and worked on a Current Reality Tree. Opposite from where I was sitting, a couple with a small child sat down. The man sat at the right table. The woman sat at the left table. (Right and left are from the position of the viewer, i.e me, throughout this article.) After a few minutes, the man rose up, tried to go between the tables, and hit his head on a lamp hanging from the ceiling.

The figure above shows the tables where the couple was sitting. If you look closely at the picture, you won't be surprised that a couple of minutes later, when the man rose up again, he hit his head, just like he did before. The lamps are not centered over the tables, so rising up to the left of either table is likely to result in a bump on the head.

The third time the man rose up, he instigated a process change. Instead of trying to go between the two tables, he went to the right of his table. This time, he didn't hit his head.

Most people would be satisfied with this. Indeed, we do such process changes all the time, little accommodations to work around the imperfections in the systems we are part of.

However, changing just the process left the couple with unsolved problems. One problem is that if there is a small process variation, for example, if the man forgets to move to the right when he rises, he is likely to bump his head again. Another problem is that the woman might bump her head too. As you can see in the picture, the lamp above her table was also off center.

At this time I pointed out the problem with the off center lamps to the couple, and suggested that they should move the tables a little bit to the left. The man grinned (a bit sheepishly) at me, and moved his table, but not the other one. Nor did the woman.

This is very interesting behavior. The man recognized a possible systems improvement when it was pointed out to him, and moved the table. Both he and his partner failed to apply the same solution to the table standing next to it. This is a failure to generalize a solution. They could see the problem with the rightmost table, yet did not recognize that there was an identical problem with the table to the left.

Eventually the couple left, without either one getting bumped on the head. The woman, though she did not implement the systems improvement, i.e. moving the table, did rise carefully, so as not to bump into the lamp above her table.

When they had left, I moved the leftmost table, like this:
In general, we are much better at adapting processes than we are at improving systems. On the other hand, improving systems tends to yield much better results. In this case, all customers sitting down at either of the two tables will enjoy a head-bump free stay at the café. Maybe the café will attract a little bit more business as a result. Customers receiving head-bumps might be less likely to return.

To me, the story illustrates one of the core properties shared by TOC and Lean. Both aim at improving systems. That is why they are so effective. It is also one of the reasons why they are so hard to implement.

Monday, July 02, 2007

Review: Throughput Accounting vs. Throughput Accounting

Been to busy to blog again, but I have been reading. For some time now, I have been studying whatever material I can lay my hands on about Throughput Accounting (TA). TA is a management accounting system that is based on the Theory Of Constraints (TOC).

Who needs another accounting model? Just about everyone, it turns out, because the standard model, GAAP Accounting (Cost Accounting), is so fraught with problems it is positively dangerous. Though GAAP accounting works sometimes, it does not always come up with the right answers. TA is a more reliable alternative.

This review covers not one, but two TA books, the recently published Throughput Accounting by Stephen Bragg, and Thomas Corbett's Throughput Accounting, from 1999. (Yes, they have the same title.)

In the book Throughput Accounting: A Guide to Constraint Management, Stephen Bragg explains how TA works, and he does it very well. The focus is on accounting for manufacturing companies. Thus, some of the specifics are not directly applicable to the software industry. However, all the principles are.

The book discusses not only the basics of TA, but also how to use TA for performance measurement and reporting. Bragg also discusses the differences between TA and GAAP Accounting financial statements, and how to construct an accounting system that uses TA for reporting to management, and GAAP Accounting for external reports.

If there is one weakness in the book, it is that Bragg asks the reader to just accept the TA view of a company. He does not prove GAAP Accounting wrong, at least not from the start.
For me, this was not a problem. I have read up on TA before, and have used it in my work when doing financial analysis, and when I have modeled project value streams. For a reader with a background in GAAP Accounting, it might be harder to make the switch to the TA system.

Nevertheless, this is my favorite TA book, though I would advice anyone without previous TOC or TA experience to also read Throughput Accounting, by Thomas Corbett. Corbett spends a lot more energy describing what is wrong with standard GAAP Accounting before offering TA as an alternative.

Corbett's book does not cover quite the range that Bragg's book does, so I'm glad I have read both. Even if you have read one, the other will contribute something of value. When working through the examples in Corbett's book, I found one minor error, but as the end result was correct, I believe it is a typo, not an error in the calculations.

Both books are worth longer reviews, but as both are fairly short, there is an alternative to reading a comprehensive review: read the books!

Monday, June 04, 2007

By The Book 3: Throughput Accounting

This is the third part in a series on management accounting for software development. The material is from a book I am working on. To make sense out of it, read part 1 and part 2 first.

Let's look at the diagram modeling team A and B again:



The developers in team A are already working at full capacity. Thus, they are the constraint of the process.

Adding 8 hours of work for a developer, means adding 8 hours to the constraint. Adding 8 hours to the constraint is the same as adding 8 hours to the whole project. The cost is €175 times the number of developers and testers, plus an additional €200 (8 * 4,000 / 160 = 200) for the team leader. This works out to €1,425 (7 * 175 + 200 = 1,425) in increased project costs. In addition, there may be a considerable cost associated with delaying delivery. I'll leave that can of worms alone. (Read about it in Lean Software Development by Tom and Mary Poppendieck, or check out a blog posting by Gustaf Brandberg.)

Team B has an easier time dealing with defects. The developers are idle about 16 hours/month. 8 of those hours have been wasted the current iteration, but the remaining 8 hours per developer is just enough for a developer to fix the defect. Project throughput won't be affected at all. There will be no project delay. There is no change in Operating Expenses because of the defect. Thus, the defect can be fixed at zero cost. This of course assumes the cost of reporting a defect is negligible. A complex defect report procedure would add to the cost of the project every time a defect is found.

Note that up to know we have dealt with one particular problem with Cost Accounting: that Cost Accounting treats all parts of a system (project, company, conglomerate, whatever) as separate and equally valuable. In reality, different parts make contributions of different value (at different times, just to make it interesting). I hope I have shown that Throughput Accounting offers an alternative that makes more sense.

There is another problem with Cost Accounting: do you remember the overhead costs mentioned earlier. Even in the rare cases where managers do understand that the overhead allocations distort figures, they tend to dismiss it as unimportant. In the book Throughput Accounting, Thomas Corbett contends that this distortion can be pretty serious. It can make the most profitable product, project, or company seem the least profitable, and vice versa.

This is bad enough, but there is another area where overhead allocation can gum things up: when you use ROI calculations for decision support. Agile methodologies do this. Scrum, for example, has a very tight focus on maximizing ROI. During the projects, this is done in a pretty freeform manner, and Cost Accounting never enters the picture. (Unless the team has to ask for money to do or buy something).

Before a project begins is another matter. For example, Scrum uses ROI calculations to provide decision support for go/no go decisions. (Scrum does not tell how to do it, just that it should be done.) Using ROI calculations is a good thing, but when a company uses a dysfunctional management accounting system, the effects can be devastating.

Of course, you'll have to wait for the next post to get the details.

Thursday, May 31, 2007

By The Book 2: The Way Of the Cost Accountant

Let's explore the problem in my previous post using Cost Accounting (CA). CA is the generally accepted management accounting method. It is used by companies all over the world, and it is the accounting method mandated by law in most countries (that I know of):

CA makes the calculation like this: A developer has to work 8 hours to fix the defect. The developer makes €3,500 per 160 hour month. This would work out to €175 (8⋅3, 500 ÷ 160 = 175). However, CA allocates overhead costs from management, rent, electricity, and other things. These costs are shared by all workers. The overhead costs can be quite large. Let's be conservative and say they are €1,000 per developer and month.

A developer’s total cost is now 4,500/month (3, 500 + 1, 000). Cost Accounting then gives us 8 ⋅ 4, 500 ÷ 160 = 225. That is, it would cost €225 to fix the defect. The cost is the same for teams A and B.



Let's do a thought experiment. Have a look at the figure above. The teams spend working time in order to produce software. Both teams are equally productive, but their limitations are different. When the developers in team A need to fix a defect, it directly impacts their productivity, which is the productivity of the entire team. The developers in team B, on the other hand, have time to spare. They can fix the defect with less impact, perhaps no impact at all, on overall production capacity. Even without getting into details, it is clear that the impact on the two teams will be entirely different.

Cost Accounting told us the cost of fixing the defect is the same in both cases. We have just seen that it can't be. Something fishy is going on. It could be me, I am not an accountant, so I could be misunderstanding how Cost Accounting works. On the other hand, I have checked Cost Accounting out fairly thoroughly, so that is probably not it.

Could it be Cost Accounting itself? Some people believe so. Check out what Wikipedia says about Cost Accounting. Here is a quote:

...using standard cost accounting to analyze management decisions can distort the unit cost figures in ways that can lead managers to make decisions that do not reduce costs or maximize profits.
Why should software developers care? Because software development projects are economic engines. If the theory governing about project economics is flawed, then the project management will be flawed. This should be of special concern for anyone working with agile, because agile will work only if management does not make mistakes like the one above.

In my book, I am showing a connection between the economic mistake you just have seen, and crappy code. The somewhat abbreviated version is that if you treat each part of a software development project as independent of any other part, then it becomes important to focus on task completion time. If you focus on task completion time, then you won't waste time on trifles like refactoring, writing unit tests, and doing domain design. Even if you want to, management will keep pushing you to start a new task.

Next time I'll work through the example using Throughput Accounting. Throughput Accounting is based on the Theory Of Constraints. As we shall see, the results will be quite different. This will affect management behavior, which will create a different environment for developers.

Sunday, May 27, 2007

By The Book

I am actually getting somewhere with the book I am working on. Here is a small excerpt:

Consider two project teams, A and B. Each team has one project manager, four developers and three testers. Each team member makes €3,500/month and works 160 hours/month. The project managers make €4,000/month. Both teams produce 80 Story Points worth of functionality in a week.

In team A, the developers work very hard, but the testers have time to surf the Web now and then. In team B, it is the other way around. The testers are pressed to keep up, so the developers slow down a bit to avoid deluging them with more work than they can handle. On the same day, a defect is found in each project. Both defects will take eight hours for a developer to fix. What is the cost to team A? What is the cost to team B?

Give it a try. I'll publish the solution in a couple of days. (Hint: the answer is slightly less obvious than it looks.)

Thursday, March 15, 2007

Fixes That Fail

Many companies use a standard response in troubled times, they appoint a new CEO. The new CEO takes measures. If the CEO is a former sales person, he focuses on improving sales. If the CEO is a former accountant, there will be a savings program. A CEO with a technical background will focus on developing new products. The outcome is usually one of the following:
  1. The company takes a nosedive, crashes and burns.
  2. Nothing happens. The gradual decline continues. Eventually
    yet another CEO is appointed.
  3. There is improvement. Sometimes the improvement is radical.
    after awhile, the rate of improvement abates. Then the
    company begins to backslide. Eventually another CEO
    is appointed.
  4. There is a sustained improvement in the company's financial
    health. This is rare though.
The third alternative, initial success followed by backsliding, is in many ways the most interesting outcome. First of all it is a common outcome. Second, it often seems inexplicable. The CEO proved his management genius with the initial success. Why can't the improvement be sustained? Third, the explanation model for this alternative can teach a lot about the other alternatives too.

Though the particulars may vary, the underlying causes are usually the same. Figure 1 shows what happens in a common case. A new CEO is appointed. In this example, the CEO is an experienced manager, with a strong background in sales.



Figure 1: How Success Begets Failure.

The Capacity Constrained Resource (CCR) of most companies (, about 70%, according to Gerald I. Kendall in Viable Vision: Transforming Total Sales Into Net Profits,) is in sales. to improve sales, you do not necessarily have to be a good corporate manager. All you need to be is a good sales person.

If the CEO is a former sales person, he will know what to do as long as sales is the CCR. Most likely, he will do much of the grunt work himself. This is why so many managers, CEOs or not, are so busy with sales. Sales may be the CCR, but the real reason the manager spends so much time and energy on it, is that it is the one thing he knows how to do.

Of course, if the manager has some other background, he is just as likely to continue with activities in his area of expertise. For example, bosses who are former programmers often continue to make software design decisions, or write code. (The difference is that because the initial CCR is most likely to be in sales, it is less likely that a CEO with some other background hits the CCR with his "improvement" measures. Therefore, the decline phase is likely to set in immediately.)

As long as sales really is the CCR, there will be improvement. The manager is considered brilliant, a winner. Unfortunately, if sales improves enough, a new area in the company will become the CCR. Even worse, because the CEO spent so much of his personal energy on improving sales, the rest of the company is in decline, speeding the emergence of a new CCR.

When the new CCR emerges, the CEO lacks the management knowledge to identify and correct the problem. Thus, the initial success is followed by decline.


Figure 2: Going Down.

Most managers are extroverts, and have a great deal of confidence. This i not bad per sé, but it may contribute to trapping a CEO in the reinforcing loop shown in Figure 2. Well grounded confidence in an area of expertise, may easily turn into overconfidence in another. Most CEOs I have met are not given to introspection, and this makes it hard for them to discover when their own actions are becoming part of the problem. This is especially true if those same actions have led to success before.

With a sound foundation in management theory, a problem like this can be dealt with, or avoided entirely.

A CEO who knows a little bit of Systems Thinking, will recognize the problem above as a specific case of the Fixes That Fail systems archetype, and counter the problem using the recommended Systems Thinking tactics. (Which I'll leave it to you to discover if you are interested. Google a bit, or buy and read The fifth Discipline by Peter M. senge.)

Lean managers do not fall into the trap as easily, because a Value Stream Map will tell them where the problem areas in the value stream are. Once diagnosed, Lean offers plenty of simple, reliable tools to deal with almost any process problem.

Theory Of Constraints managers use buffer monitoring to find the problem areas, and the Focusing Steps to deal with them wherever they emerge.

Statistical Process Control (SPC) can alert managers to emerging problems before they become too serious to deal with, but does not offer generic solution techniques.

Early on I stated that the explanation model above offers some hints about what happens in the other common cases:

Alternative 1: Crash and burn is most likely to happen when the situation is really bad from the beginning, and management measures are totally inappropriate. Why a CEO would do something wildly off base, is explained in Figure 1.

Alternative 2: Whatever the CEO does, does not affect the CCR. The decline continues, and the CCR is fired before a new CCR emerges. (If a new CCR does emerge, there may be a sudden acceleration in the rate of decline.)

Alternative 4: The CEO is able to use generic management principles to deal with problems as they emerge. This is the steadiness of purpose and consistent drive to improve seen in successful Lean companies like Toyota.

Friday, March 02, 2007

Truth, Damned Truth, and Statistics, Part 2

This is part 2 in a series of articles. You can read the first part if you click here.

In the previous article in this series, I discussed how to measure and visualize the Throughput part of the Return On Investment (ROI) equation. This time I'll focus on Inventory (I).

As you may recall, Inventory is defined as "money tied up in the system". Inventory includes all the equipment you use in a development project, computers, software, chairs, tables, etc. In a manufacturing process, Inventory would also include the partially finished material being worked on in the process itself, the Work-In-Progress, or WIP.

In a software development process there is no WIP, but there is an analog: partially finished software. This includes requirements, design specifications, and any code that is not fully finished, tested, and ready for release. Partially finished software is sometimes called Design-In-Progress, or DIP. The DIP has a monetary value. The value of the DIP is the amount of money lost producing DIP that is never used because of requirements changes, plus the cost of removing partially finished software from the system. This cost can be quite high.

The less DIP there is, the less risk we take that requirements changes will cause the project to waste effort.

We cannot have zero DIP, because then the project team would have nothing to work on, but it is obvious we want to keep the DIP as low as possible.

There are two basic approaches to entering material into a production process: push systems, and pull systems. As it turns out, these two models have vastly different effects on the DIP.

A push model, as used in RUP and other traditional software development methodologies, means that each step in the production chain pushes material to the next step: a project leader assigns work to team members, analysts work as fast as possible to push work to designers, designers push to programmers, programmers push to testers.

Push systems designed to be cost cost efficient, that is, they are an attempt to maximize the number of work items per hour, per person. Unfortunately, the production capacity of different parts of a production process is never balanced. Some parts will have higher capacity than others. In addition, the capacity varies. As a result, queues with DIP will build up in the process. Perhaps the testers can't keep up, or the DBA can't keep up with the programmers, or the analyst finished a truck load of requirements before the designers even got started.

In a pull system, each step in the production chain signals the step before when it is ready to take on more material. Work is pulled into the system no faster than it can be handled. This keeps the DIP to a minimum. The most well know pull system technique is called Kanban. It is the technique used by Extreme Programming. (Though the name "Kanban" is used very rarely.) There is a slightly different pull system model called Drum-Buffer-Rope, which is used by Feature Driven Development. All agile methodologies use pull systems. It is part of what makes them agile. (It is also one of the most misunderstood parts of agile.)

Figure 1: Design-In-Progress in Agile and Traditional Projects.

Figure 1 shows how DIP builds up in two projects. The agile project, using a pull model, never processes more than five stories concurrently. From this it is possible to surmise that the team either consists of five solo developers, or ten developers working in pairs. The process runs smoothly, with the DIP winding down to zero at the end of each iteration.

The non-agile project is different. DIP is allowed to build unchecked, and gets much higher than the DIP in the agile project. Thus, the non-agile project risks to loose more money if requirements change. we can also see that in iteration two and three, DIP builds until late in the project, and then suddenly drops. There is a big batch of material building up, and getting released at the end of the project. This indicates that there is a process step at the end that can't keep up with the steps before it. This is likely to be the testers, working frantically to test what the developers produce.

If the testers are overloaded with work, and the project still makes the iteration goals every time, it is most likely that the testers have been pressured into skipping tests. This in turn may indicate that the testers are using a brute force "test everything" approach. This is bad, because it may indicate that the quality of the code the team produces is low. It is much better if the developers use defect prevention techniques (unit testing, pair programming, refactoring, etc.) to keep the code quality acceptable. Of course, it may also indicate that the testers just do not know how to test statistically significant samples. Either way, it is up to the management to step in and fix the process.

Note that the DIP does not quite reach zero at the end of an iteration. There is a backlog of unfinished work building up. This is a project destined for large delays. Traditional methodologies, like RUP, may cause enormous amounts of DIP to build, but they have no mechanism for monitoring it! This is why project delays often come as a surprise to management very late in a project. (That, and the fact that Management By Fear causes many project managers to actively hide information about problems in a project.)

How To Measure DIP

As you can see, monitoring the DIP can tell you a lot about the state of a project. Measuring the DIP is easy. In the project I am currently working in, we use an andon board, i.e. a whiteboard where we have a table with a column for each step in the development process. At the start of an iteration the team writes a sticky note for each story. At the beginning of an iteration, all stories are in the leftmost column, the backlog column. When someone begins to work on a story (s)he moves the corresponding sticky note to the next column. Eventually, each story has travelled to the Done! column.

To measure the DIP, all I have to do is to keep track of how when stories are moved from one column to the next. I use a spreadsheet to do this. Then I use a small Ruby program to a DIP graph (and several other graphs).

Thursday, March 01, 2007

Five Things Managers (Usually) Don't Get About Agile

Clash 1: Goals
Traditional development methods set three goals:
  • Keep within budget
  • Implement all requirements (often specified before the project starts)
  • Make the deadline
Agile projects set the following goal:
  • Maximize the Return On Investment (ROI)
To maximize the ROI, an agile project changes the following variables:
  • Cost
  • Scope
  • Time
In other words, what traditional software development, and most companies, set as fixed goals, are just the things an agile project need to change to reach its goal.

Clash 2: Push vs. Pull
Another major difference is that most companies, and traditional development methods, are based on push systems, while agile is based on pull systems. :
  • In a push system, each step in a production line does what the previous step tells it to do.
  • In a pull system, each step in a production line does what the following step has to do.
The situation when orders start travelling from both ends of the command chain can at best be described as chaotic.

Clash 3: Cost Efficiency vs. Lead Time Reduction
A third difference is that traditional methodologies, and traditionally managed companies, seek to raise the cost efficiency, while agile methodologies seek to reduce lead times. There is a connection between the two:
  • When cost efficiency goes up, lead times will also go up
  • When lead times are reduced, cost efficiency goes down
It should be IOTTMCO that if corporate management seeks to push cost efficiency up, while a project team seeks to push lead times down, there will be a clash.

Clash 4: Responsibilities
A fourth difference is that agile is based on systems thinking, and views most problems as systemic, and therefore the responsibility of the system owners, which is of course the management. Traditional management, on the other hand, views most problems as special cause problems, and leave it to their work force to do firefighting.

Clash 5: Attitude to Knowledge and Training

Scientific Management based, management seeks to make people easily replaceable by reducing the amount of training each person needs to do his or her job. (This shows very clearly in the RUP philosophy of dividing work into very narrowly defined roles.) In original Scientific Management, the idea was that profound knowledge of how processes work should reside with management. Today, the SM philosophy has lead to management reducing their own knowledge about how systems work to practically nothing.

In contrast, agile emphasizes very broad training, so that each individual can fit as many jobs as possible. Management is expected to have very deep knowledge of systems thinking, Lean, TOC, and of course of agile philosophy and practices.

www.henrikmartensson.org is Down, but Kallokain is Up

As you may have noticed www.henrikmartensson.org has been down for some time. The site went down because some of the software broke when the host system was upgraded. Before putting the system up again, I will fix some broken links and other problems.

My schedule is rather full these days, mostly with the joys of fatherhood, but also with a major writing project, and a few other things, so getting the site going again will take some time.
I will spend time blogging again though. The past few months have been incredibly hectic, but the past few weeks my life has settled down to a somewhat saner pace. Besides, my writing addiction is stronger than ever.

Truth, Damned Truth, and Statistics

Statistics may not be what floats your boat, but statistics can tell some important things about software development projects.

In this article, I'll show how even a very simple analysis of some basic project measurements can be exceedingly useful. The data in the following analysis is fake, but the situations they describe are from real projects.

A commercial software development project is a system that turns requirements into money. Therefore it makes sense to use economic measures, at least for a birds eye view of a project. If you have read my earlier postings, you will be familiar with the basic equation describing the state of a business system, the Return On Investment (ROI) equation:

ROI = (T - OE) / I

T = Throughput, the rate at which the project generates money. Because it is hard to put a monetary value on each unit of client valued functionality, this is commonly measured in Story Points, Function Points, or Feature Descriptions instead. I'll use Story Points, because it is the unit I am most comfortable with.

OE = Operating Expenses, the money expended by the project as it produces T. This includes all non-variable costs, including wages.

I = Inventory, money tied up in the system. This is the value of unfinished work, and assets that cannot be easily liquidated, like computers and software. Also called "Investment".

In this installment, I'll focus on Throughput, and a close relative, the Defect Rate. In a following article, I will discuss measuring Investment and Operating Expenses.

Throughput Diagram with Confidence Interval

Let's begin by looking at Throughput. The total Throughput of a project is the value added by the project. That is, if you get €300,000 for the project, then T = €300,000. A project is not an all or nothing deal. For example, the €300,000 project may have six partial deliveries. That would make each delivery worth €50,000. Normally, each partial delivery comprises several units of client valued functionality. A unit of client valued functionality is, for example, a use case. Thus, a use case represents monetary value.

Use cases do not make good units to measure Throughput, because they vary in size. Measuring Throughput in use cases would be like measuring ones fortune in bills, without caring about the denomination. However, a Story Point (SP), defined, for the purposes of this article, as the amount of functionality that can be implemented during one ideal working hour, has a consistent size. That is, a 40 SP use case is, on average, worth twice as much as a 20 SP use case. (This is of course a very rough approximation, but it is good enough for most development projects.)

We can estimate the size of use cases (or stories, if you prefer the XP term), in SP. Once we have done that, it is possible to measure Throughput. Just total the SPs of the use cases the team finishes in one iteration. The word "finished" means "tested and ready to deploy in a live application". No cheating!


Figure 1: Throughput Per Week

Figure 1 shows the Throughput per week for a project over a 12 week period. As you can see, the productivity varies a lot. The diagram has two lines indicating the upper and lower control limits of the Throughput. Within the limits, or within the confidence interval, as it is also called, the project is in statistical control.

The control limits in this case are +- 3 standard deviations from the mean Throughput. What this means is that if the Throughput stays within the confidence interval each week, we have a stable production process, and we can, with 95% certainty, say that the future productivity will be within the limits set by the confidence interval.

If the Throughput is outside the control limits, as it is in Figure 1, the development process is out of control. This means that it is not possible to make predictions about the productivity of a week in the future based on past productivity. It also means it is useless for management to ask the developers how much work they can finish the next week.


Figure 2: Throughput Per Month

A project that is unpredictable in a short perspective, may well be predictable if you take a longer perspective. Figure 2 shows Throughput data for the same project, over the same time period as Figure 1. The difference is that the Throughput data has been aggregated into monthly figures. As you can see, the productivity for each month is pretty even, and well within the statistical control limits. The development team can, with 95% certainty, promise to deliver at least 47 SP each month. They can also promise an average production rate of 59 SP per month.

Given the Throughput, and the total number of SP for the project, it is possible to predict how long a project will take with fairly good accuracy. Obviously, such measurements must be continuously updated, because circumstances change. Requirements are added, or removed, the team may loose or add members, the code complexity tends to increase over time. All these factors, and many more, affect the Throughput over time, and may also cause a project to change from a controlled to an uncontrolled state.

Note that just because a project is in a controlled state, it does not mean the project is going well. A project that consistently has 0 Throughput is in a controlled state. Having the project in a controlled state does mean that we can make predictions about the future.

Having a project in a controlled state also means that if a project has problems, the causes are most likely to be systemic. That is, the problems arise from poor organization, a poor process, or outdated policy constraints, too little training, the wrong training, etc. Statistical Process Control (SPC) people call systemic causes "common causes".

Common cause problems must be fixed by the process owners, i.e. the management. The developers can't do much about it because the problem is out of their control.

When something happens that puts a project outside the predicted confidence interval, the cause is a random event. SPC people call random events "special causes". Special causes have to be dealt with on a case by case basis.

In practice, special causes are fairly rare. In the book Out Of the Crisis (1982), Edward Deming states that about 6% of all problems in production processes are due to special causes. The vast majority, 94%, of all problems have systemic causes. Much of the problems we experience in software projects are due to confusing special causes with common causes, i.e. causes of systemic failure.

A couple of years ago I worked in a project plagued with problems. Out of all the problems besetting the development team every day, only one problem was a special cause problem: we lost an afternoons work because we had to vacate the building. There was a fire in an adjacent building belonging to another company. The fire was a special cause of delay because it occurred only once. Had fires been a recurring problem, it would have been a common cause problem, and management would have had the responsibility to deal with it. (Three solutions off the top of my head: Get the other company evicted. Teach the other company safety procedures so there are no more fires. Move to other, safer, premises.)

Let's focus on common causes. They are the most interesting, because the vast majority of problems fall in this category. The problem with common causes are that management usually fails to identify them for what they are. the failure to identify common causes is of course itself a systemic failure, and has a common cause. (I leave it to you to figure out what it is. It should not be too hard.) The result is that management resorts to firefighting techniques, which leaves the root cause unaffected. Thus, it won't be long until the problem crops up again.

The first thing to do with a problem is to notice that it is there. The second thing to do is to put it in the right category. A diagram with a confidence interval can help you do both.

Once you know the problem is there, you can track it down using some form of root cause analysis, for example Five Why, or a Current Reality Tree (from the TOC Thinking Tools). My advice is to start with Five Why, because it is a very simple method, and then switch to the TOC Thinking Tools if the problem proves hard to identify, you suspect many causes, or if it is not immediately obvious how to deal with the problem.

Defect Diagram With Confidence Interval

The Throughput diagram does not give a complete picture of how productive a project is. It is quite common for a development team to produce new functionality very quickly, but with poor quality. This means a lot of the development effort can be spent on rework. I won't go into the economic details in this article, but fixing defects may account for a substantial part of the cost of a project. In addition, defects cause customers to be dissatisfied, which can cause even greater losses. A high defect rate is also an indicator of a high level of complexity in the code. This complexity reduces Throughput, and in most cases it is not necessary. (I haven't seen a commercial mid to large software development project yet that did not spend a lot of effort dealing with complexity that did not have to be there in the first place.)


Figure 3: Defect Graph

Figure 3 shows a defect graph. These are defects caught by a test team doing inspection type testing, or by customers doing acceptance testing, or using the code. It is important to note that the graph shows when defects were created, not when they were detected. This is important to know, because if you do not know when defects where created, you won't know if process improvements you make have any effect on the defect rate. If you measure when defects are detected, as most projects do, there may be years until you see any effects from an improved process.

In this case, the defect rate is within the control limits all the time, which means the defects that do occur are a systemic problem. The control limits are rather far apart, indicating a process with a lot of variation in results. Reducing the defect rate is clearly a management problem.

The Limitations of Calculating the Confidence Interval

The confidence interval method has several limitations. First of all, you need a process that varies within limits. If you measure something that has a continuous increasing or declining trend, confidence intervals won't be very useful.

Second, the method can detect only large shifts, on the order of 1.5 standard deviations or more. For example, in Figure 3, the number of defects seem to be declining, but all data points are within the confidence interval. It is impossible to say whether the number of defects are really going down, or if there are just two lucky months in a row. Thus, if some method of defect prevention was instigated at the beginning of month 2, it is not possible to tell if these measures had any real effect. A more sophisticated statistical method, like Exponentially Weighted Moving Average (EWMA), could probably have determined this.

Third, it is assumed that the data points are relatively independent of each other. This is likely to be the case in a well run project, but in a badly organized project, the Throughput may be influenced by a wave phenomenon caused by excessive Inventory. (I'll discuss that in the next article in this series.) When such conditions exist, the confidence interval looses its meaning. On the up side, excessive Inventory shows up very clearly in a Design In Process graph, so management can still get on top of the problem.

Calculating a confidence interval for a chart is still useful in many cases. It is a simple method. Any statistical package worth its salt has support for it. (I used the Statarray Ruby package, and drew the graphs with Gruff. You can find it on RubyForge.)