06.12.07

Review: Some basic determinants of computer programming productivity

Posted in programmer productivity, review at 4:08 pm by ducky

Some basic determinants of computer programming productivity by Earl Chrysler (1978) measured the (self-reported) time it took professional programmers to complete COBOL code in the course of their work and looked for correlations.

He correlated the time it took with characteristics of the program and, not surprisingly, found a bunch of things that correlated with the number of hours that it took: the number of files, number of records, number of fields, number of output files, number of output records, number of output fields, mathematical operations, control breaks, and the number of output fields without corresponding input fields. This is not surprising, but it is rather handy to have had someone do this.

He then looked at various features of the programmers themselves to see what correlated. He found that experience, experience at that company, experience in programming business applications, experience with COBOL, years of education, and age all correlated, with age correlating the most strongly. (The older, the faster.) This was surprising. I’ve seen a number of other academic studies that seemed to show no effect of age or experience. The Vessey paper and the Schenk paper (which I will blog about someday, really!), for example, have some “experts” with very little experience and some “novices” with lots of experience.

The academic studies, however, tend to have a bunch of people getting timed doing the same small task. Maybe the people who are fastest in small tasks are the ones who don’t spend much time trying to understand the code — which might work for small tasks but be a less successful strategy in a long-term work environment.

Or the paper is just messed up. Or all the other papers are messed up.

Gotta love research.

Update: Turley and Bieman in Competencies of Exceptional and Non-Exceptional Software Engineers (1993) also say that experience is the only biographical predictor of performance. (”Exceptional engineers are more likely than non-exceptional engineers to maintain a ‘big picture’, have a bias for action, be driven by a sense of mission, exhibit and articulate strong convictions, play a pro-active role with management, and help other engineers.”)

Update update: Wolverton in The cost of developing large-scale software found that the experience of the programmer didn’t predict the routine unit cost. It looks like he didn’t control for how complex the code written was.

06.05.07

Review: Vessey’s Expertise in debugging computer programs: a process analysis

Posted in programmer productivity, review at 4:07 pm by ducky

I really wanted to like Expertise in Debugging Computer Programs: a Process Analysis by Iris Vessey (Systems, Man and Cybernetics, IEEE Transactions on, Vol. 16, No. 5. (1986), pp. 621-637.) Van Mayrhauser and Vans said in Program Comprehension During Software Maintenance and Evolution indicated that Vessey had shown that experts use a breadth-first approach more than novices.

I was a little puzzled, though, because Schenk et al, in Differences between novice and expert systems analysts: what do we know and what do we do? cited the exact same paper, but said that novices use a breadth-first approach more than experts! Clearly, they both couldn’t be right, but I was putting my money on Schenk just being wrong.

It also looked like the Vessey paper was going to be a good paper just in general, that it would tell me lots of good things about programmer productivity.

Well, it turns out that the Vessey paper is flawed. She studied sixteen programmers, and asked them debug a relatively simple program in the common language of the time — COBOL. (The paper was written in 1985.) It is not at all clear how they interacted with the code; I have a hunch that they were handed printouts of the code. It also isn’t completely clear what the task termination looked like, but the paper says at one point “when told they were not correct”, so it sounds like they said to Vessey “Okay, the problem is here” and Vessey either stopped the study or told them “no, keep trying”.

Vessey classified the programmers as expert and novice based on the number of times they did certain things while working on the task. (Side note: in the literature that I’ve been reading, “novice” and “expert” have nothing to do with how much experience the subjects have. They are euphemisms for “unskilled” and “skilled” or perhaps “bad” and “good”.)

She didn’t normalize by how long they spent to finish the task, she just looked at the count of how many times they did X. There were three different measures for X: how often did the subject switch what high-level debugging task they were doing (e.g. “formulate hypothesis” or “amend error”); start over; change where in the program they were looking.

She then noted that the experts finished much faster than the novices. Um. She didn’t seem to notice that the that the count of the number of times that you do X during the course of a task is going to correlate strongly with how much time you spend on the task. So basically, I think she found that people who finish faster, finish faster.

She also noted that the expert/novice classification was a perfect predictor of whether the subjects made any errors or not. Um, the time they took to finish was strongly correlated with whether they made any errors or not. If the made an error, they had to try again.

Vessey said that 15/16 experts could be classified by a combination of two factors: whether they used a breadth-first search (BFS) for a solution or a depth-first search (DFS) whether they used systems thinking or not However, you don’t need both of the tests; just the systems-thinking test accurately predicts 15/16. All eight of the experts always used BFS and systems-thinking, but half of the novices also used BFS, while only one of the novices used systems-thinking.

Unfortunately, Vessey didn’t do a particularly good job of explaining what she meant by “system thinking” or how she measured it.

Vessey also cited literature that indicated that the amount of knowledge in programmer’s long-term memory affected how well they could debug. In particular, she said that the chunking ability was important. (Chunking is a way to increase human memory capacity by re-encoding the data to match structures that are already in long-term memory, so that you merely have to store a “pointer” to the representation of the aggregate item in memory, instead of needing to remember a bunch of individual things. For example, if I ask you to remember the letters C, A, T, A, S, T, R, O, P, H, and E, you will probably just make a “pointer” to the word “catastrophe” in your mind. If, on the other hand, I ask you to remember the letters S, I, T, O, W, A, J, C, L, B, and M, that will probably be much more difficult fo you.)

Vessey says that higher chunking ability will manifest itself in smoother debugging, which she then says will be shown by the “count of X” measures as described above, but doesn’t justify that assertion. She frequently conflates “chunking ability” with the count of X, as if she had fully proven it. I don’t think she did, so her conclusions about chunking ability are off-base.

One thing that the paper notes is that in general, the novices tended to be more rigid and inflexible about their hypotheses. If they came up with a hypothesis, they stuck with it for too long. (There were also two novices who didn’t generate hypotheses, and basically just kept trying things somewhat at random.) This is consistent with what I’ve seen in other papers.

04.24.07

Review: Ko’s A framework and methodology for studying the causes of software errors in programming systems

Posted in programmer productivity, review at 2:12 pm by ducky

I’m reading A framework and methodology for studying the causes of software errors in programming systems by Ko and Myers that classifies the causes of software errors, and there is a diagram (Figure 4) that I’m really taken with. They talk about there being three different kinds of breakdowns: skill breakdowns (what I think of as “unconscious” errors, like typos), rule breakdowns (using the wrong approach/rule, perhaps because of a faulty assumption or understanding), and knowledge breakdowns (which usually reflect limitations of mental abilities like inability to hold all the problem space in human memory at one time).

Leaving out the breakdowns, the diagram roughly looks like this:

Specification activities
Action Interface Information
Create Docs Requirement specifications
Understand Diagrams Design specifications
Modify Co-workers  
Implementation activities
Action Interface Information
Explore Docs Existing architecture
Design Diagrams Code libraries
Reuse Online help Code
Understand Editor  
Create    
Modify    
Runtime (testing and debugging) activities
Action Interface Information
Explore Debugger Machine behavior
Understand Output devices Program behavior/td>
Reuse Online help Code

I’m a bit fuzzy on what exactly the reuse “action” is, and how “explore” and “understand” are different, but in general, it seems like a good way to describe programming difficulties. I can easily imagine using this taxonomy when looking at my user study.

02.16.07

Review: DeMarco and Lister’s Programmer performance and the effects of the workplace

Posted in programmer productivity, review at 5:12 pm by ducky

I got started on looking at productivity variations again and just couldn’t stop. I found Programmer performance and the effects of the workplace by DeMarco and Lister (yes, the authors of Peopleware). The paper is well worth a read.

In their study, they had 83 pairs of professional programmers work all on the same well-specified programming problem using their normal language tools in their normal office environment. Each pair was two people from the same company, but they were not supposed to work together.

  • They found a strong correlation between the two halves of a company pair, which may be in part fromcorrelation of productivity across a company
    • a pretty stunning correlation between the office environment and productivity
    • (or perhaps due to different companies having radically different tools/libraries/languages/training/procedures, which they didn’t discuss)
  • The average time-to-complete was about twice the fastest time-to-complete.
  • Cobol programmers as a group took much longer to finish than the other programmers. (Insert your favorite Cobol joke here.)

productivity differences

Over and over, I keep seeing that the median time to complete a single task is on the order of 2x to 4x times the fastest, not 100x. This study seems to imply that a great deal of that difference is due not to the individual’s inherent capabilities, but rather the environment where they work.

Review: Dickey on Sackman (via Bowden)

Posted in programmer productivity, review at 4:17 pm by ducky

To reiterate, there’s a paper by Sackman et al from 1966 that people have seized upon to show a huge variation in programmer productivity, a paper by Dickey in 1981 that refuted Sackman pretty convincingly, and an article by Curtis in the same issue as Dickey’s. I didn’t talk much about the Dickey paper, but Tony Bowden has a good blog posting on the Dickey paper, where Dickey reports on a more reasonable interpretation of numbers from the Sackman’s data.

(Basically, Sackman compared the time it took to complete a task using a batch system against the time it took using a timeshare system. This was interesting in 1966 when they were trying to quantify the benefit of timeshare systems, but it’s not good to look at those numbers and say, “Ah, see, there is a huge variation in programmers!”)

Because I like making pretty histograms, here are the Sackman numbers via Dickey via Bowden — the way they ought to be interpreted. These show the time to complete for two tasks, the “Algebra” task and the “Maze” task.

The small sample size hurts, but (as in the Curtis data and the Ko data) I don’t see an order of magnitude difference in completion speed.

02.15.07

Review: Robillard’s How Effective Developers Investigate Source Code: An Exploratory Study

Posted in programmer productivity, review at 6:18 pm by ducky

(See my previous programmer productivity article for some context.)

Martin Robillard did a study in conjunction with my advisor. In it, he had five programmers work on a relatively complex task for two hours. Two of the programmers finished in a little over an hour, one finished in 114 minutes, and two did not finish in two hours:

Robillard carefully looked at five subtasks that were part of doing the main task; there was a very sharp distinction between the three who finished and the two who did not. The two who didn’t finish only got one of the subtasks “right”. S for “Success” means everything worked. I for “Inelegant” means it worked but was kind of kludgy. B for “Buggy” means that there were cases where it didn’t work; U for “Unworkable” means that it usually didn’t work; NA for “Not Attempted” means they didn’t even try to do that subtask.

Coder ID Time to finish Check box Disabling Deletion Recovery State reset Years exper.
#3 72 min S! S! S! S! S! 5
#2 62 min S! S! S! S! B 3
#4 114 min I S! B S! B 5
#1 125 min (timed out) S! U U U NA 1
#5 120 min (timed out) S! U U NA NA 1

Because coder #1 and coder #5 timed out, I don’t know how much of a conclusion I can draw from this data about what the range of time taken is. From this small sample size, it does look like experience matters.
This study did have some interesting observations:

  • Everyone had to spend an hour looking at the code before they started making changes. Some spent this exploration period writing down what they were going to change, then followed that script during the coding phase. The ones who did were more successful than the ones who didn’t.
  • The more successful coders (#2 and #3) spread their changes around as appropriate. The others tried to make all of the changes in one place.
  • The more successful coders looked at more methods, and they were more directed about which ones they looked at. The second column in the table below is a ratio of the number of methods that they looked at via cross-references and keyword searching over the total number of methods that they looked at. The less-successful coders found their methods more frequently by scrolling, browsing, and returning to an already-open window.
    Coder ID Number of methods examined intent-driven:total ratio Time to finish
    #3 34 31.7% 72 min
    #2 27.5 23.3% 62 min
    #4 27.5 30.8% 114 min
    #1 8.5 2.0% 125 min (timed out)
    #5 17.5 10.7% 120 min (timed out)
  • From limited data, they conclude that skimming the source isn’t very useful — that if you don’t know what you are looking for, you won’t notice it when it passes your eyeballs.

12.15.06

Review: Ko’s An exploratory study of how developers seek, relate, and collect relevant information during software maintenance tasks

Posted in programmer productivity, review at 4:58 pm by ducky

I found another paper that has some data on variability of programmers’ productivity, and again it shows that the differences between programmers isn’t as big as the “common wisdom” made me think.

In An Exploratory Study of How Developers Seek, Relate, and Collect Relevant Information during Software Maintenance Tasks, Andrew Ko et al report on the times that ten experienced programmers spent on five (relatively simple tasks). They give the average time spent as well as the standard deviation, and — like the Curtis results I mentioned before. (Note, however, that I believe Ko et al include programmers who didn’t finish in the allotted time. This will make the average lower than it would be if they waited for people to finish, and make the standard deviation appear smaller than it really is.)

Task name Average time Standard deviation std dev / avg
Thickness 17 8 47%
Yellow 10 8 80%
Undo 6 5 83%
Line 22 12 55%
Scroll 64 55 76%

These numbers are the same order of magnitude as the Curtis numbers (which were 53% and 87% for the two tasks).

Ko et al don’t break out the time spent per-person, but they do break out actions per person. (They counted actions by going through screen-video-captures, ugh, tedious!) While this won’t be an exact reflection of time, it presumably is related: I would be really surprised if 12 actions took longer than 208 actions.

I’ve plotted the number of actions per task in a bar graph, with each color representing a different coder. The y-axis shows the time spent on each task, with negative numbers if they didn’t successfully complete the task. (Click on the image to see a wider version.)

Ko paper results

Looking at the action results closely, I was struck by the impossibility of ranking the programmers.

  • Coder J (white) was the only one who finished the SCROLL task (and so the only one who finished all five), so you’d think that J was really hot stuff. However, if you look at J’s performance on other tasks, it was solid but not overwhelmingly fast.
  • Coder A (orange) was really fast on YELLOW and UNDO, so maybe Coder A is hot stuff. However, Coder A spun his or her wheels badly on the LINE task and never finished.
  • Coder G (yellow) got badly badly stuck on LINE as well, but did a very credible job on most of the other tasks, and it looks like he or she knew enough to not waste too much time on SCROLL.
  • Coder B (light purple) spent a lot of actions on THICKNESS and YELLOW, and didn’t finish UNDO or SCROLL, but B got LINE, while A didn’t.

One thing that is very clear is that there is regression to the mean, i.e. the variability on all the tasks collectively is smaller than the variability on just one task. People seem to get stuck on some things and do well on others, and it’s kind of random which ones they do well/poorly on. If you look the coders who all finished THICKNESS, YELLOW, UNDO, and SCROLL, the spread is higher on the individual tests than the aggregate:

  • The standard deviation of the number of actions taken divided by the average number of actions is 36%, 69%, 60%, and 25%, respectively.
  • If you look at the number of actions each developer takes to do *all* tasks, however, then the standard deviation is only 21% of the average.
  • The most actions taken (”slowest”) divided by the least actions (”fastest”) is 37%, 20%, 26%, and 54% respectively.
  • The overall “fastest” coder did 59% as many actions as the “slowest” overall coder.

It also seemed like there was more of a difference between the median coder and the bottom than between the median and the top. This isn’t surprising: there are limits to how quickly you can solve a problem, but no limits to how slowly you can do something.

What I take from this is that when making hiring decisions, it is not as important to get the top 1% of programmers as it is to avoid getting the bottom 50%. When managing people, it is really important to create a culture where people will give to and get help from each other to minimize the time that people are stuck.