CPSC 1010/1020 Bits of Programming Wisdom

Mark Smotherman
v. 1.5
December 4, 2021

This is a collection of links, ideas, and people that have influenced my programming and that I want to pass on to my students in CPSC 1010 and 1020. Of course, it is a personal review and not an exhaustive review of all the wisdom and experience that is available in the published literature and on the web.

C++ Specific Guides
- C++ core guidelines
- Google C++ style guide
People who have influenced my programming
Techniques
General issues

C++ Specific Guides

C++ Core Guidelines

Bjarne Stroustrup (the designer of C++) and Herb Sutter have collected "a set of guidelines for using C++ well". See the guidelines here. The intent of the guidelines is described as:

The guidelines are focused on relatively high-level issues, such as interfaces, resource management, memory management, and concurrency. Such rules affect application architecture and library design. Following the rules will lead to code that is statically type safe, has no resource leaks, and catches many more programming logic errors than is common in code today. And it will run fast – you can afford to do things right.

Google C++ Style Guide

Google provides a C++ Style Guide here. In discussing the goals for their style guide, the Google authors identify two basic reasons for being consistent in coding:

Optimize for the reader, not the writer
Our codebase (and most individual components submitted to it) is expected to continue for quite some time. As a result, more time will be spent reading most of our code than writing it. We explicitly choose to optimize for the experience of our average software engineer reading, maintaining, and debugging code in our codebase rather than ease when writing said code. "Leave a trace for the reader" is a particularly common sub-point of this principle: When something surprising or unusual is happening in a snippet of code (for example, transfer of pointer ownership), leaving textual hints for the reader at the point of use is valuable (std::unique_ptr demonstrates the ownership transfer unambiguously at the call site).

Be consistent with existing code
Using one style consistently through our codebase lets us focus on other (more important) issues. Consistency also allows for automation: tools that format your code or adjust your #includes only work properly when your code is consistent with the expectations of the tooling. In many cases, rules that are attributed to "Be Consistent" boil down to "Just pick one and stop worrying about it"; the potential value of allowing flexibility on these points is outweighed by the cost of having people argue over them.

Google currently recommends C++17:

Currently, code should target C++17, i.e., should not use C++2x features, with the exception of designated initializers. The C++ version targeted by this guide will advance (aggressively) over time.
Do not use non-standard extensions.
Consider portability to other environments before using features from C++14 and C++17 in your project.

People Who Have Influenced My Programming

Fred Brooks

Fred Brooks was an architect for the IBM Stretch and Harvest computers in the 1950s and the project manager for IBM's System/360 hardware and software efforts in the 1960s. He left IBM to found the Computer Science department at UNC. His efforts in computer architecture, 3D interactive computer graphics, HCI, software engineering, virtual worlds, and the design process have led to numerous awards, including the highest award in computer science, the ACM A.M. Turing Award. A short biography is here, and his wikipedia entry is here.

Brooks has wonderfully described the joys and woes of programming in his book, The Mythical Man Month:

The Joys of the Craft
Why is programming fun? What delights may its practitioner expect as his reward?
First is the sheer joy of making things. As the child delights in his mud pie, so the adult enjoys building things, especially things of his own design. I think this delight must be an image of God's delight in making things, a delight shown in the distinctness and newness of each leaf and each snowflake.
Second is the pleasure of making things that are useful to other people. Deep within, we want others to use our work and to find it helpful. In this respect the programming system is not essentially different from the child's first clay pencil holder "for Daddy's office."
Third is the fascination of fashioning complex puzzle-like objects of interlocking moving parts and watching them work in subtle cycles, playing out the consequences of principles built in from the beginning. The programmed computer has all the fascination of the pinball machine or the jukebox mechanism, carried to the ultimate.
Fourth is the joy of always learning, which springs from the nonrepeating nature of the task. In one way or another the problem is ever new, and its solver learns something: sometimes practical, sometimes theoretical, and sometimes both.
Finally, there is the delight of working in such a tractable medium. The programmer, like the poet, works only slightly removed from pure thought-stuff. He builds his castles in the air, from air, creating by exertion of the imagination. Few media of creation are so flexible, so easy to polish and rework, so readily capable of realizing grand conceptual structures. (As we shall see later, this very tractability has its own problems.)
Yet the program construct, unlike the poet's words, is real in the sense that it moves and works, producing visible outputs separate from the construct itself. It prints results, draws pictures, produces sounds, moves arms. The magic of myth and legend has come true in our time. One types the correct incantation on a keyboard, and a display screen comes to life, showing things that never were nor could be.
Programming then is fun because it gratifies creative longings built deep within us and delights sensibilities we have in common with all men.
The Woes of the Craft
Not all is delight, however, and knowing the inherent woes makes it easier to bear them when they appear.
First, one must perform perfectly. The computer resembles the magic of legend in this respect, too. If one character, one pause, of the incantation is not strictly in proper form, the magic doesn't work. Human beings are not accustomed to being perfect, and few areas of human activity demand it. Adjusting to the requirement for perfection is, I think, the most difficult part of learning to program.
Next, other people set one's objectives, provide one's resources, and furnish one's information. One rarely controls the circumstances of his work, or even its goal. In management terms, one's authority is not sufficient for his responsibility. It seems that in all fields, however, the jobs where things get done never have formal authority commensurate with responsibility. In practice, actual (as opposed to formal) authority is acquired from the very momentum of accomplishment.
The dependence upon others has a particular case that is especially painful for the system programmer. He depends upon other people's programs. These are often maldesigned, poorly implemented, incompletely delivered (no source code or test cases), and poorly documented. So he must spend hours studying and fixing things that in an ideal world would be complete, available, and usable.
The next woe is that designing grand concepts is fun; finding nitty little bugs is just work. With any creative activity come dreary hours of tedious, painstaking labor, and programming is no exception.
Next, one finds that debugging has a linear convergence, or worse, where one somehow expects a quadratic sort of approach to the end. So testing drags on and on, the last difficult bugs taking more time to find than the first.
The last woe, and sometimes the last straw, is that the product over which one has labored so long appears to be obsolete upon (or before) completion. Already colleagues and competitors are in hot pursuit of new and better ideas. Already the displacement of one's thought-child is not only conceived, but scheduled.
This always seems worse than it really is. The new and better product is generally not available when one completes his own; it is only talked about. It, too, will require months of development. The real tiger is never a match for the paper one, unless actual use is wanted. Then the virtues of reality have a satisfaction all their own.
Of course the technological base on which one builds is always advancing. As soon as one freezes a design, it becomes obsolete in terms of its concepts. But implementation of real products demands phasing and quantizing. The obsolescence of an implementation must be measured against other existing implementations, not against unrealized concepts. The challenge and the mission are to find real solutions to real problems on actual schedules with available resources.

The Mythical Man Month is a reflection on why the software effort for S/360 had so many more difficulties than the hardware effort. It presented Brooks's Law: "Adding manpower to a late software project makes it later." The full text of the first edition of the book (1975) can be found here (funded by Gordon Bell). For a 20th-anniversary edition of Mythical Man Month in 1995, Brooks added four chapters, including his 1986 essay "No Silver Bullet – Essence and Accidents of Software Engineering" in which he writes, "There is no single development, in either technology or in management technique, that by itself promises even one order-of-magnitude improvement in productivity, in reliability, in simplicity." and a reflection on his essay after a nine-year period, "'No Silver Bullet' Refired", in which he again stated that there was no "magical solution" to improving programming productivity. Recently, at ICSE 2018, he extended his prediction that no silver bullet will be found within the next ten years.

Paul Graham

Paul Graham is a pioneer in software as a service and one of the founders of Y Combinator, a startup incubator. A short biography is located here, and his wikipedia entry is here.

One of his early essays recommended "Programming Bottom-Up," and he gives four advantages (citation removed):

By making the language do more of the work, bottom-up design yields programs which are smaller and more agile. A shorter program doesn't have to be divided into so many components, and fewer components means programs which are easier to read or modify. Fewer components also means fewer connections between components, and thus less chance for errors there. As industrial designers strive to reduce the number of moving parts in a machine, experienced Lisp programmers use bottom-up design to reduce the size and complexity of their programs.
Bottom-up design promotes code re-use. When you write two or more programs, many of the utilities you wrote for the first program will also be useful in the succeeding ones. Once you've acquired a large substrate of utilities, writing a new program can take only a fraction of the effort it would require if you had to start with raw Lisp.
Bottom-up design makes programs easier to read. An instance of this type of abstraction asks the reader to understand a general-purpose operator; an instance of functional abstraction asks the reader to understand a special-purpose subroutine.
Because it causes you always to be on the lookout for patterns in your code, working bottom-up helps to clarify your ideas about the design of your program. If two distant components of a program are similar in form, you'll be led to notice the similarity and perhaps to redesign the program in a simpler way.

Steve Maguire

Steve Maguire is a Microsoft veteran who authored two books: Writing Solid Code, and Debugging the Development Process. A short biography is located here, and his wikipedia entry is here.

In the introduction to Writing Solid Code, Maguire states (emphasis in original):

The most critical requirement for writing bug-free code is to become attuned to what causes bugs. All of the techniques and guidelines presented in this book are the result of programmers asking themselves two questions over and over again, year after year, for every bug found in their code:

How could I have automatically detected this bug?
How could I have prevented this bug?

For example, one technique he recommends is that your program should scan and validate any data tables at startup. This catches any bug in the data as early as possible. Maguire provides a number of memory logging routines in his book, but I would now recommend that you use valgrind instead (see below).

Steve McConnell

Steve McConnell is a Microsoft and Boeing veteran who later started his own software company, Construx Software. He wrote Code Complete as a practical handbook on the construction of software along with five other books, and he served as editor-in-chief for IEEE Software. A short biography is located here, and his wikipedia entry is here.

Code Complete is over 800 pages in length and full of coding wisdom. I find it hard to pick one best idea, but let me highlight part of his discussion of software quality:

There's no such thing as a free lunch, and even if there were, there's no guarantee that it would be any good. Software development is a far cry from haute cuisine, however, and software quality is unusual in a significant way. The General Principle of Software Quality is that improving quality reduces development costs.
Understanding this principle depends on understanding a key observation: The best way to improve productivity and quality is to reduce the time spent reworking code, whether the rework arises from changes in requirements, changes in design, or debugging. The industry-average productivity for a software product is about 8 to 20 lines of code per person per day. It takes only a matter of minutes to type in 8 to 20 lines of code, so how is the rest of the day spent?
A software product takes much longer to develop than a simple program and involves a lot more than merely typing in the code. But that's not what usually fills the rest of the day.
The rest of the the day is usually spent debugging. Debugging takes about 50 percent of the time spent in a traditional, naive software-development cycle. Getting rid of debugging by preventing errors improves productivity. Therefore, the most obvious method of shortening a development schedule is to improve the quality of the product and decrease the amount of time spent debugging and reworking the software.

Links to his articles in IEEE Software and other periodicals can be found here. The front matter, chapter 5, and the index of Code Complete 2 can be found here.

David Parnas

David Parnas is an award-winning academician who has taught at six universities and has studied the development of software over the course of four decades. He has emphasized that academic studies and recommendations must have an impact on the work of professional programmers, even making this call for relevance a part of his acceptance speech in response to being recognized for two of the "most influential" papers at the leading software engineering conference, ICSE. A short biography is located here, and his wikipedia entry is here.

Parnas has had lasting impact on software design with his concept of information hiding, which is the idea that programmers should identify and encapsulate design decision "secrets" to reduce complexity and to restrict the impact of the most-likely changes to the design. Here are excerpts from his 2009 presentation on "Making Information Hiding Work" (emphasis in original):

Information Hiding is often considered an empirical result.
It is a theoretical result!
Theorem: If you can prove program A correct knowing only the interface to program B, then if you change B without changing its interface, A need not change (proof still valid).
Requires empirical verification
• That people can know what to hide and what they are hiding.
• That the interfaces can be efficient
• That we can describe interfaces without revealing secrets.
...
We still have bad software because Information Hiding is not done well.
• Secrets show (even in textbook examples).
• The abstraction is not documented.
...
What would make a module structure good?
• Parts can be designed independently.
• Parts can be implemented independently.
• There are no duplicate or almost alike pieces of code (clones, cut/pasties)
• Parts can be tested independently.
• Parts can be changed independently.
• Integration is easy if the modules met their specification
• Maintainers can find the relevant code and change it quickly

Joel Spolsky

Joel Spolsky is a veteran of Microsoft and founder of his own software company, Fog Creek Software. He is the author of the "Joel on Software" blog and started the programmer Q&A site Stack Overflow. He has authored or edited five books on software and user interface issues. A short biography is located here, and his wikipedia entry is here.

He recommends that students learn more than just Java and that they work to fully comprehend pointers and recursion:

But beyond the prima-facie importance of pointers and recursion, their real value is that building big systems requires the kind of mental flexibility you get from learning about them, and the mental aptitude you need to avoid being weeded out of the courses in which they are taught. Pointers and recursion require a certain ability to reason, to think in abstractions, and, most importantly, to view a problem at several levels of abstraction simultaneously. And thus, the ability to understand pointers and recursion is directly correlated with the ability to be a great programmer.
Nothing about an all-Java CS degree really weeds out the students who lack the mental agility to deal with these concepts. As an employer, I've seen that the 100% Java schools have started churning out quite a few CS graduates who are simply not smart enough to work as programmers on anything more sophisticated than Yet Another Java Accounting Application, although they did manage to squeak through the newly-dumbed-down coursework.

He also emphasizes the importance of fixing bugs early in his essay, The Joel Test: 12 Steps to Better Code:

5. Do you fix bugs before writing new code?
..."zero defects" meant that at any given time, the highest priority is to eliminate bugs before writing any new code. Here's why.
In general, the longer you wait before fixing a bug, the costlier (in time and money) it is to fix.
For example, when you make a typo or syntax error that the compiler catches, fixing it is basically trivial.
When you have a bug in your code that you see the first time you try to run it, you will be able to fix it in no time at all, because all the code is still fresh in your mind.
If you find a bug in some code that you wrote a few days ago, it will take you a while to hunt it down, but when you reread the code you wrote, you'll remember everything and you'll be able to fix the bug in a reasonable amount of time.
But if you find a bug in code that you wrote a few months ago, you'll probably have forgotten a lot of things about that code, and it's much harder to fix. By that time you may be fixing somebody else's code, and they may be in Aruba on vacation, in which case, fixing the bug is like science: you have to be slow, methodical, and meticulous, and you can't be sure how long it will take to discover the cure.
And if you find a bug in code that has already shipped, you're going to incur incredible expense getting it fixed.

Techniques

This section is less "wisdom" and more about my personal opinions and some approaches I have adopted across the years.

Conditions

It is helpful to classify different conditions being tested as dependent (i.e., mutually exclusive) or independent. In the former case the set of tests can be structured as a series of if-else statements, while in the latter case the set of tests should be structured as a series of if statements.

Although discouraged by style guidelines, I have found it helpful to use spaces to pad out the conditions for better visual effect. Placing simple statement blocks on the same lines as the conditions and vertically aligning all the blocks also add to the visual table effect, e.g.,


if      ( newNum  < oldNum ) { decrFlag  = 1; }
else if ( newNum == oldNum ) { equalFlag = 1; }
else                         { incrFlag  = 1; }

After testing, you can use a "code beautifier" tool like indent to reformat the code to meet the expected style guidelines.

If the dependent conditions being tested are simple values, then the nest of if-else statements can usually be replaced by a switch statement or perhaps by a lookup table for a more concise program segment.

An advanced technique is to encode various conditions into an index variable using bitwise or-ing into specific fields of the index value and then using the index value to access the appropriate entry in a lookup table. E.g., the code segment above becomes:


if      ( newNum  < oldNum ) { indexValue = indexValue | 4; }
else if ( newNum == oldNum ) { indexValue = indexValue | 2; }
else                         { indexValue = indexValue | 1; }

Loops

In C and C++ a for loop is an abbreviation of a while loop. Therefore, the loop:

for ( i = 0; i < count; i++ ) { // loop body }

expands into:

i = 0; while ( i < count ) { // loop body i++; }

I rarely use do-while loops. Note that the condition test in a do-while loop is executed after the body of the loop has been executed; so, unlike a while loop, a do-while loop has to execute at least once. A colleague sent me this humorous comparison.

Functions

In Code Complete, Steve McConnell lists sixteen reasons to use functions or procedures, including making code more readable, avoiding duplicate code, limiting the effects of changes, and simplifying complicated tests. He recommends naming functions to describe the return value and naming procedures with a strong verb followed by the object. He recommends that you establish a consistent naming convention for common operations and that you use no more than seven parameters for any function.

Authors and sources vary on suggested maximum routine sizes. I remember being advised to limit PL/I routines to at most one page of line printer output (66 lines). McConnell reviewed several studies on routine sizes and error rates; while he saw no consistent pattern across the studies, he recommends that you limit routines to 200 lines. The Google C++ Style Guide recommends that functions should be less than 40 lines, and the C++ Core Guidelines state that 1-5 line functions should be considered normal.

The Google C++ Style Guide and the C++ Core Guidelines have numerous other recommendations about function design and preferred parameter-passing choices in C++.

Linked Lists

There are a range of linked list implementation choices:

Singly-linked versus doubly-linked
Only a head pointer versus both a header pointer and a tail pointer
Using one or more dummy nodes versus none
Making a doubly-linked list circular versus not

Other courses will discuss these choices in more depth, but here are the general considerations:

Doubly-linked lists makes arbitrary deletions more efficient
A tail pointer makes FIFO inserts more efficient
A dummy header node removes the special case handling of a null head pointer to represent an empty list
- the data field in the dummy node can be repurposed using union to hold a current count of nodes in the list
A doubly-linked circular list requires only a head pointer for efficient FIFO inserts and removes

A simple structure found in many intro courses is a singly-linked list with a head pointer, no dummy nodes, and non-circular. This simple structure is perfect for a LIFO stack with push and pop operations at the head of linked list.

This simple structure is also sometimes used for a FIFO queue with an insert operation at the tail and a remove operation at the head. However, the insert operation requires a list traversal if there is only a head pointer. As mentioned above, adding a tail pointer makes the inserts at the tail more efficient (in particular, inserts become O(1) as opposed to O(N)).

A testing tip for linked lists is to have at least one test vector that populates a list, empties it, and then re-populates it. In grading student programs over the years, restoring the empty list condition after removing the last node of a list is a common source of bugs (e.g., the head pointer may be correctly set to null but the tail pointer is a dangling reference).

When feasible, consider using C++ vectors or one of the various container classes in the C++ STL in place of writing your own linked list routines.

Knuth's linear search

Knuth's linear search makes a time-space tradeoff by introducing an extra location in an unsorted array to be searched. The search key is first loaded into this special location at the end of the unsorted array, and the search loop is merely a while loop that ends when the key is found. If the key is found prior to the extra location, then the search was successful. If the search finds the key only in the last location, the search was unsuccessful. The two tests per iteration required by the loop in a normal linear search have thus been reduced to one. However, Knuth's version is still O(N).

Reporting errors

There are a number of ways to report errors encountered by a function. Printing an error message and exiting is likely the initial way most of us learned to program, but production code needs a robust approach that allows the calling routine to handle and recover from errors. Better approaches include:

Make the error code a special case of the return value. This is applicable only when the normal return value has a limited range of validity so that an out-of-range value can be used to indicate an error, e.g., the NULL return value for malloc().
Make the error code the return value and make what would normally be the return value into an output parameter.
Make a pointer to a caller-allocated error code variable the last parameter to a function. This parameter can be defaulted to nullptr in languages such as C++, and the error code can be stored by the called function only when the pointer is non-null.
Use multiple return values with the error code being one of the values. C++17 provides structured bindings, a feature that elegantly handles multiple return values. (See an example here.)
Throw an exception in languages that support exceptions. Note that the Google C++ Style Guide recommends that Google coders not use C++ exceptions.

Separate source files

Splitting your program into separate source files provides these advantages:

a useful component can be easily reused when it is already packaged in a separate source file and separate header file
components can be easily unit-tested with individual test drivers designed for the separate source files
the amount of recompilation is minimized when a change occurs in only one source file (note: use makefiles to automate this process!)
the source files provide a nice structure for dividing a project between several team members

It is easier to detect errors and debug when you unit test. When you combine previously tested and debugged components, the remaining errors will likely stem from misunderstandings about the interface specifications (i.e., the number, ordering, and data types of the actual parameters in the procedure and function calls).

Classes

A class declaration is an approximation of an abstract data type, which is a data type that specifies the values that the data type can hold and the operations that can be performed, without specifying how the data type is implemented.

By separating the method definitions from the class declaration, a C++ class can partially satisfy the ideal of an abstract data type. However, a class declaration does not completely hide the implementation since it discloses the private attributes in the declaration. Furthermore, limits on the values that the private attributes can hold are not specified in the class declaration.

Good class design means that there is a class invariant that must be maintained by all methods; thus, unrestricted access should not be allowed. Use a struct instead of a class if you plan to provide unrestricted getter and setter methods for each private variable. To illustrate this, consider a point data structure with x and y coordinates. If there is no invariant that must be maintained (e.g., the point must lie in Quadrant I), then the point should be defined as a struct rather than a class. See Erik Engheim, "To Hell With Setters and Getters," Medium, November 4, 2017.

Abstract Data Type C++ Class

specifies the values that the data type can hold not in general; invariants can help

specifies the operations that can be performed yes, through the list of public methods

without specifying how the data type is implemented partially, through the separate definition of the class methods; however,

private attributes and methods are visible in the class declaration
public getter and setter methods also reveal the implementation

Abstract Data Type	C++ Class
specifies the values that the data type can hold	not in general; invariants can help
specifies the operations that can be performed	yes, through the list of public methods
without specifying how the data type is implemented	partially, through the separate definition of the class methods; however, private attributes and methods are visible in the class declaration public getter and setter methods also reveal the implementation

ASCII-Art UML in class declarations

Kevin Plis encouraged me to introduce UML to CPSC 1020 students by including a simple version of a UML class diagram in the comments for class declarations. For example,

// UML
//
//   +---------------+
//   | Cup           |
//   +---------------+
//   | - max_volume_ |     private attributes
//   | - volume_     |       names use Google C++ style with _ suffix
//   +---------------+
//   | + Cup()       |     public methods
//   | + Capacity()  |       abbreviated versions without parameter
//   | + Contains()  |       and return value data types
//   | + Increase()  |
//   | + Decrease()  |
//   +---------------+
//
...
//
// invariants
//
//   max_volume_ >= 3
//
//   0 <= volume_ <= max_volume_

File input patterns

There are three common patterns in reading input sequentially:

Count of input values preceding the input – the count must be present and correct

Input file Code pattern

4 10 20 30 40 scanf( "%d", &count ); for ( i = 0; i < count; i++ ) { scanf( "%d", &value ); // use value }

Input file			Code pattern
`4 10 20 30 40`			`scanf( "%d", &count ); for ( i = 0; i < count; i++ ) { scanf( "%d", &value ); // use value }`

Sentinel value – you must identify an out-of-range value (e.g., zero, negative)

Input file Code pattern

10 20 30 40 -1 scanf( "%d", &value ); while ( value != -1 ) { // use value scanf( "%d", &value ); }

End of file (EOF) – this is the most flexible approach; no extra item is needed in the input file, and the end of the input is recognized by the end of the file; note that the EOF condition is signaled by control-D when reading from the keyboard

Input file Code pattern

10 20 30 40 while ( scan( "%d", &value ) != EOF ) { // use value }

Input file			Code pattern
`10 20 30 40 -1`			`scanf( "%d", &value ); while ( value != -1 ) { // use value scanf( "%d", &value ); }`

Input file			Code pattern
`10 20 30 40`			`while ( scan( "%d", &value ) != EOF ) { // use value }`

Filter pattern in Unix

On the Unix command line you can redirect the output of one process to the input of another process. The connection between the two processes is called a pipe and is represented by the vertical bar. You can often solve a problem by connecting multiple existing Unix commands via pipes. head, grep, sort, tail, uniq, and wc are Unix commands that are often helpful when combined with each other.

Furthermore, many programs are written for the Unix environment using the filter pattern so that they can easily participate in a pipeline. This means that a program is written for a single purpose and that it reads a simple text stream from stdin and writes a simple text stream to stdout. (Note that stderr does not flow through the pipeline without extra redirection.) The problem-solving power comes from the ability to easily combine several single-purpose programs on the command line.

See this lesson on Unix pipes and filters. See also Eric S. Raymond's discussion of the filter pattern here and my discussion here.

Tailoring program execution

There are multiple ways to set a parameter that governs the execution of a program (for example, the size of an array):

A defined constant using the pre-processor, e.g., #define SIZE 1000
A constant integer variable, e.g., const int SIZE = 1000;
A value given at the start of the input file.
A value from the command line. Often, there is a default value defined within the program that will be used unless it is overridden by a value on the command line.
A value given in a special file called the configuration file. This file is separate from the regular input file. As the program prepares to initialize its data structures, it opens the configuration file, reads the parameters, and then closes the file. The parameter values are then used to control the initialization of the program before the processing of the regular input file starts. Sometimes the configuration file will have the same name as the executable program but will have a file name extension of ".config".

The first two ways require recompilation if there is a change in parameter value. The latter three are ways to customize the execution of a program without recompiling.

DIY unit testing tool

If you aren't using a commercial or open source unit testing tool, you may think that testing will require building separate test programs. However, rather than hard code your test vectors into separate programs, you can build a DIY (Do It Yourself) unit testing tool structured as an interpreter. The interpreter implements a mini-language you design for exercising the module being tested. The tokens in the mini-language are commands to invoke the different methods of the module and the parameters for those methods. The test vectors will then be structured as concise "programs" in the mini-language and stored in separate text files. You can use a shell script that runs multiple test vectors using the interpreter for regression testing.

For example, consider testing a stack module. The mini-language could contain commands such as:

+ value to push a value on the stack
- to pop a value from the stack
t to copy the value from the top of the stack
e to assert that the stack is empty
v value to assert that the last value-producing command yielded this value

Then a simple test vector could be:

e + 3 t v 3 - e

And the shell script could be:

./test_interpreter < test_vector_1 ./test_interpreter < test_vector_2 ...

Debugging

Know the common bugs, for programmers in general and for you specifically. These common errors are the things to consider and rule out first. For example, wrong type declarations, uninitialized variables, off-by-one errors, memory leaks, null pointers, and wrong parameters passed (wrong type, wrong order, too few, too many).

Turn on all the compiler warnings (e.g., -Wall). Use your own assertion statements to validate input, parameters, assumptions, preconditions, postconditions, and even "impossible" conditions (i.e., sanity checks).

If you have a long list of compiler errors, fix the first few then look down the remainder for obvious errors that you can also fix. After you have fixed all the obvious errors and those at the top of the list, recompile. Some of the non-obvious error messages may have been generated because statement parsing in the compiler was thrown off by one of the early errors, and you may now see a much shorter list.

Use version control or consider saving a copy of your original code. You may find that the "fixes" you made while debugging are wrong and that you want to restart from the original code, or you want to use diff to compare the original and latest copies.

Think before you throw in extra print statements. Using assertions and/or a debugger may be a better use of your time. Alternatively, try to recreate the bug in a smaller or more constrained context. You may want to write a trivial program to verify your understanding of a language concept. You may want to write a special test case that will confirm or deny your guess about what went wrong.

If you do decide to add print statements, consider a "binary search" approach to limit the amount of information you need to review. For example, add the print statement to the middle of your program, and then see if the error occurs before or after this point.

If you are convinced that you need to add a large number of print statements, consider using macros and conditional compilation. Or perhaps consider defining a verbose mode for your program, which can be enabled by a command line flag. Use special formats for the extra debugging output so that you can easily visually scan for unusual patterns or other clues. The extra output lines can also be piped through grep to screen for critical messages.

Alternatively, you can write the extra information to a log file, or to a "flight recorder" internal data structure (i.e., a circular buffer). If you choose logging, consider logging the sizes of your data structures at key points to gain clues from imbalances, perhaps using a graphical approach for summarizing the extra information (e.g., a histogram). This is usually done by writing a script to post-process the log file.

Consider logging each entry into key functions or procedures, where the log record contains not just the routine name but also the values of the input parameters. You may want to additionally log each exit from these key routines where the log record contains the return value or values of the output parameters. Reviewing these values may give you clues about any bugs in the calling routines.

Consider using signature checking for dynamic data structures, that is, defining an extra field that will contain a structure-specific identifier (i.e., the signature, or what McConnell calls a dog tag and others call a canary or sentinel). The identifier can be a character string or an unusual hex number. When a pointer to a data structure is passed to a routine, have the routine first validate the data structure by checking for the correct signature. Just prior to freeing the data structure, be sure to check and then change the signature; this will help catch double-free and use-after-free errors. McConnell recommends including signatures at the beginning and the end of dynamically allocated areas to catch overwrite errors. A checksum field can also be used to catch corrupted data areas.

After you have experienced the woes of debugging several times, use the memory of these "war stories" to counteract the resentment you may feel towards taking the extra time to unit test.

Valgrind

Valgrind is a suite of tools for analyzing memory accesses, function/procedure calling, and threading in programs. Valgrind can help you detect memory leaks, use of uninitialized memory, and many other common errors. Find out more about valgrind here.

Caches, branch prediction, and big O

Big O analysis helps you choose an algorithm with a small growth rate, but it ignores constant factors. In particular, program performance is greatly impacted by memory caches and branch prediction in modern processors. However, big O analysis ignores this. Once you identify an algorithm or set of algorithms with a suitable growth rate, you should look for "cache friendly" and "branch prediction friendly" implementations of these algorithms.

For example, the running time for matrix multiplication can be greatly reduced when picking the best loop nesting order and/or adding cache blocking (also called tiling). For common algorithms there may be compiler optimizations and autotuners that generate efficient code. PHiPAC, Portable High Performance ANSI C, and ATLAS, Automatically Tuned Linear Algebra Software, are example autotuners for common dense linear algebra applications.

Recognize the different binding times

Binding is a general term for specifying the value of an attribute. Binding can be early or late as well as static or dynamic. (You can think of the differences between compilation and interpretation.) Dynamic binding typically provides more flexibility and a better debugging environment, but it typically requires a table to map an attribute to its current value.

Typically binding times are:

edit time, e.g., editing macros
macro processing time, e.g., macro expansion and conditional compilation
compile time
static link time
program load time
execution time
- function call time, e.g., dynamic linking
- operating system event time, e.g., the start and end of the operating system scheduling quantum
- hardware event time, e.g., I/O interrupt, page fault, cache miss

Encoding state machines and program generation

An advanced technique is to encode conditions or even state machines in binary and use a simple for loop to exhaustively enumerate the conditions. You partition the binary representation of the index value into single bit fields when you need true/false conditions or multiple-bit fields when you need parameter ranges (you can reject the unneeded enumerations when you have a non-power-of-2 parameter range). For example, consider generating a true/false condition using a first bit and a parameter with range from 0-2 using a subsequent field of two bits.

Index (Binary) Condition Parameter Reject?

0 (0 00) F 0

1 (0 01) F 1

2 (0 10) F 2

3 (0 11) F 3 yes

4 (1 00) T 0

5 (1 01) T 1

6 (1 10) T 2

7 (1 11) T 3 yes

Index	(Binary)	Condition	Parameter	Reject?
0	(0 00)	F	0
1	(0 01)	F	1
2	(0 10)	F	2
3	(0 11)	F	3	yes
4	(1 00)	T	0
5	(1 01)	T	1
6	(1 10)	T	2
7	(1 11)	T	3	yes

The encoding of a state machine would use bit fields of the index value to represent the output and transition tables. An MS thesis at CMU implemented this approach for comparing all possible 2-bit branch predictors, which would need 20 bits:

   each possible fsm is encoded as a 20-bit string

   inputs and predictions: 0 = untaken, 1 = taken
   four states: 0,1,2,3 (encoded in two bits)
   state 0 is the intial state

                output table:
    .---------- prediction for state 0 (0 or 1)
    |  .------- prediction for state 1 (0 or 1)
    |  |  .---- prediction for state 2 (0 or 1)
    |  |  |  .- prediction for state 3 (0 or 1)
    |  |  |  |
   p0 p1 p2 p3 n0[] n1[] n2[] n3[]
                |    |    |    |
                |    |    |    |  transition table:
                |    |    |    `- next states for state 3
                |    |    `------ next states for state 2
                |    `----------- next states for state 1
                `---------------- next states for state 0

  where ni[] is a 4-bit string of
    2-bit next state for state i and input 0
    2-bit next state for state i and input 1

Using this encoding technique, you can execute the various possibilities directly or you can structure the code with the for loop as a generator program that writes out complete C programs to different source files. Another alternative is to structure the generator to write a program to a temporary file and use system() calls in Linux to compile and run the generated program. The generated program should then write the index value (i.e., its "signature") to an output file along with its other output. Then only the output files need to be saved and can be postprocessed for various metrics with the best metrics tagged with the temporary program signatures.

Jacob Sorber Videos

Based on his experience with helping students in his courses, my colleague at Clemson, Jacob Sorber, has recorded a number of videos about programming in C. The playlist is here.

General Issues

Don't neglect design

There is a spectrum of approaches to software design, from the rigid "waterfall" approach, where the design is expected to be complete before coding starts, to the freewheeling "code and fix", where little attention is paid to the overall software design. There are problems at both edges of the spectrum, and many experienced programmers favor an incremental or spiral approach, where an initial design is refined during its implementation.

"Code and fix" – Years ago, mainframe programmers had to wait hours or even days to get results from the run of a compiler. Consequently, they were more careful about avoiding bugs when writing a program. Charles Mann describes the unintended negative effect of today's immediate feedback when compiling code in "Why Software Is So Bad":

Instead of meticulously planning code, programmers stayed up in caffeinated all-night hacking sessions, constantly bouncing results off the compiler. Again and again, the compiler would spit back error messages; the programmers would fix the mistakes one by one until the software compiled properly. "The attitude today is that you can write any sloppy piece of code and the compiler will run diagnostics," says SRI's Neumann. "If it doesn't spit out an error message, it must be done correctly, right?"
As programs grew in size and complexity, however, the limits of this "code and fix" approach became evident. On average, professional coders make 100 to 150 errors in every thousand lines of code they write, according to a multiyear study of 13,000 programs by Humphrey of Carnegie Mellon.
... The inadequate design in the final products, critics argue, reflects inadequate planning in the process of creating them. According to a study by the Standish Group, a consulting firm in West Yarmouth, MA, U.S. commercial software projects are so poorly planned and managed that in 2000 almost a quarter were canceled outright, creating no final product. The canceled projects cost firms $67 billion; overruns on other projects racked up another $21 billion. But because "code and fix" leads to such extensive, costly rounds of testing, even successful projects can be wildly inefficient. Incredibly, software projects often devote 80 percent of their budgets to repairing flaws they themselves produced-a figure that does not include the even more costly process of furnishing product support and developing patches for problems found after release.
"System testing goes on for almost half the process," Humphrey says. And even when "they finally get it to work, there's still no design." In consequence, the software can't be updated or improved with any assurance that the updates or improvements won't introduce major faults. "That's the way software is designed and built everywhere-it's that way in spaceships, for God's sake."

Unfortunately, most student projects are done in the code-and-fix style because they are used once and then thrown away, what my colleague Brian Malloy calls "Dixie cup" programming. Steve McConnell uses the analogy of building a doghouse (i.e., code and fix) versus building a skyscraper. I find the building analogy useful, e.g.,

Consider the ease of use and safety when designing. For example, if turning on a light is merely a matter of connecting two wires, why do we use light switches on the wall rather than having two wires hang down from each light fixture?
It is important to find design flaws early. Consider moving the location of a wall in a building. Such a change will get progressively more costly as construction continues. Don't waste time and money fixing problems that could have been avoided by a better initial design.

Agile – There are numerous incremental approaches to software development. Agile development is one of those approaches that is currently popular. Bertrand Meyer has analyzed Agile methods and written a book, Agile! The Good, the Hype and the Ugly. He rejects Agile's replacements of software requirements and specifications by user stories and tests; however, he praises Agile's short iteration cycles and its association of tests with functionality. An ACM webcast by Meyer presenting the key ideas from his book can be found here.

Design by contract – My colleague Murali Sitaraman highly recommends design by contract, which Bertrand Meyer introduced in the Eiffel programming language in the 1980s. There are now a number of third-party libraries and preprocessors for implementing design by contract in other programming languages, such as C++. Design by contract relies on the programmer explicitly listing (ideally prior to any coding) the preconditions, postconditions, and invariants for a software component. This causes the programmer to pay careful attention to the design of a software component. Design assumptions are placed directly in the code and are automatically used for assertion checking during software development. Direct language support reduces the clutter of individual assertion/if statements that would otherwise be needed, and the assertion checking can be turned off for better performance in trusted production code. For more information and examples, see here.

Object-oriented design – Brooks termed object-oriented (O-O) programming a "brass bullet" in his "'No Silver Bullet' Refired" reflection, since it promises to allow programmers to build with bigger pieces. Brooks considered why O-O hasn't been a silver bullet and shared an excerpt from correspondence with Parnas:

The answer is simple. It is because [O-O] has been tied to a variety of complex languages. Instead of teaching people that O-O is a type of design, and giving them design principles, people have taught that O-O is the use of a particular tool. We can write good or bad programs with any tool. Unless we teach people how to design, the languages matter very little. The result is that people do bad designs with these languages and get very little value from them. If the value is small, it won't catch on.

Further work in O-O design has led to the development of design patterns, with over 50 patterns identified in the wikipedia article.

Learn the vocabulary of programming

Tom Dalling, author of Programming for Beginners, writes in a blog post about the importance of vocabulary:

Doctors do not discuss diagnoses with each other in simple terms. Medical schools don't teach in simple terms, either. Medicine is complicated, so they communicate with a complicated set of terminology. Everyone understands the term "broken leg," but that is too vague for a medical diagnosis. Using precise terminology like "transverse compound fracture of the tibia" allows medical staff and students to communicate accurately.
Programming is not as complex as the human body, but it is still complex. This complexity is why programming has so much terminology. There are lots of concepts, and all of those concepts have names. Just like in medicine, we use terminology to communicate accurately, because simple terms are often too vague.
... there are no three tips that will fix this problem. You just have to invest time into learning the vocabulary.

In his "'No Silver Bullet' Refired" reflection, Brooks noted the problem of programmers needing a larger vocabulary to facilitate reuse of large building blocks:

Higher-level languages have larger vocabularies, more complex syntax, and richer semantics.
As a discipline, we have not pondered the implications of this fact for program reuse. To improve quality and productivity, we want to build programs by composing chunks of debugged function that are substantially higher than statements in programming languages. Therefore, whether we do this by object class libraries or procedure libraries, we must face the fact that we are radically raising the sizes of our programming vocabularies.
Vocabulary learning constitutes no small part of the intellectual barrier to reuse. So today people have class libraries with over 3000 members. Many objects require specification of 10 to 20 parameters and option variables.
Anyone programming with that library must learn the syntax (the external interfaces) and the semantics (the detailed functional behavior) of its members if they are to achieve all of the potential reuse.
This task is far from hopeless. ... We need research to appropriate for the software reuse problem the large body of knowledge as to how people acquire language.

... and choose your own names carefully

Good choices of names in a design can lessen the burden of vocabulary learning. Ralph Johnson and Brian Foote highlight this in their paper on "Designing Reusable Classes":

Although protocols are important for defining interfaces within programs, they are even more important as a way for programmers to communicate with [each] other. Shared protocols create a shared vocabulary that programmers can reuse to ease the learning of new classes. Just as mathematicians reuse the names of arithmetic operations for matrices, polynomials, and other algebraic objects, so Smalltalk programmers use the same names for operations on many kinds of classes. Thus, a programmer will know the meaning of many of the components of a new program the first time it is read.

The Google C++ Style Guide says:

The most important consistency rules are those that govern naming. The style of a name immediately informs us what sort of thing the named entity is: a type, a variable, a function, a constant, a macro, etc., without requiring us to search for the declaration of that entity. The pattern-matching engine in our brains relies a great deal on these naming rules.

Finally, be aware that software engineering is a broad body of knowledge

The Software Engineering Body of Knowledge (SWEBOK) is an attempt to define and describe fifteen knowledge areas that together comprise software engineering. The chapter that describes the software construction knowledge area can be found here. This chapter describes five fundamentals for building software:

minimizing complexity,
anticipating change,
constructing for verification,
designing for reuse, and
making use of standards.

Other chapters cover knowledge areas such as requirements, testing, and professional practice.

A Limited Collection

It's clear that this page is very limited in what I have presented. I am leaving out contributions to programming skills and software engineering from Bentley, Dijkstra, Hoare, Kernighan and Plauger, Knuth, Scott Meyers, Wirth, and so many others. I hope that what I have included is helpful to you, the reader, and encourages you in the craft of programming.

[Mark's homepage] [CPSC homepage] [Clemson Univ. homepage]

mark@clemson.edu