Contributing to the Linux Kernel - Diff and Patch

Joe Pranevich - First published by Linux Journal Magazine, Issue #74 (Jun 2000)

Everyone who knows about Linux also knows about the ways Linux is ``different'' from other, more commercial operating systems. Because the Linux kernel is open source, it is possible for each and every user to become a contributor. Certainly, nearly everyone reading this knows so; it's sort of like preaching to the choir. However, the fact is that most Linux users, even those skilled in the programming arts, have never contributed to the code of the Linux kernel, even though most of us have had days where we thought to ourselves, ``gee, I wish Linux could do this...'' Through this article and others, I hope to convince some of you to take a look at the Linux kernel in a new, more proactive light.

What are some valid reasons for not contributing to the kernel efforts? First, maybe you can't legally. Many programmers sign contracts that limit their ability to code outside of work, even on non-commercial projects. This is the main reason I chose a profession that has relatively little to do with programming, other than the occasional Perl script. Second, it is possible you don't know how. Many Linux users are relatively new programmers trained in traditional computer science. I know from my own CS education that many schools tend to teach ``modern'' programming skills--I was one of the few in my particular school who chose (or knew how) to program without an IDE (integrated development environment). Sad, but true. Third, many professional programmers now tend to work with revision control systems in the workplace, and may be hesitant to contribute to projects (such as the kernel development effort) which still use the ``bare metal'' approach. Last and most likely, many programmers with the skills to hack Linux don't have the time to do so. These are all valid reasons why perfectly qualified programmers with good ideas, a fresh outlook and a desire to contribute have chosen not to. Nothing I can say can help them get past some of those issues, but I hope I can make kernel programming more accessible to at least a small percentage of people.

This is the first article in a series, and I will attempt to dispel some of the mystery behind revision control. Many open-source projects, including the Linux kernel, still use the diff and patch method of content control for a variety of reasons. Most open-source projects still accept patches in this format, even if they distribute code via CVS or some other revision-control system. First, diff and patch provide a project maintainer with an immense amount of control. Patches can be submitted and distributed via e-mail and in plain text; maintainers can read and judge the patch before it ever gets near a tree. Second, there's never a worry about access control or the CVS server going down. Third, it's readily available, generally doesn't require any special tools that aren't distributed as part of every GNU system, and has been used for years. However, bare-bones revision control makes it difficult to track changes, maintain multiple branches or do any other ``advanced'' things provided by Perforce, CVS or other revision control systems.

diff and patch are a set of command-line programs designed to generate and integrate changes into a source tree. There are multiple ``diff'' formats supported by the GNU utilities. One major advantage of diff and patch over newer revision-control systems is that diff, especially the unified diff format, allows kernel maintainers to look at changes easily without blindly integrating them.

The diff Family

For the uninitiated, diff and patch are just two of the commands in a complete set of GNU utilities. While they are the most commonly used in practice, other tools are often employed in specific situations. For the purposes of this document, I won't concentrate on these utilities, but will treat them only briefly. For a more complete look, check out your local set of man and info pages.

diff is the first command in the set. It has a simple purpose: to create a file (often confusingly called a patch or a diff) which contains the differences between two text files or two groups of text files. These files are constructed to make it easy to merge the differences into the older file. diff can write in multiple formats, although the unified difference format is generally preferred. The patches this command generates are much easier to distribute than whole files, and they allow maintainers to see quickly and easily what changed and to make a judgment.

patch is diff's logical complement, although oddly, it didn't come along until well after diff was in relatively common use. patch takes a patch file generated by diff and applies it against a file or group of files. patch will notify the user if there is a conflict, although it is often smart enough to resolve simple conflicts. Additionally, patch can act in the reverse; with an updated file and the original patch, this command can revert a file back to its pristine form.

cmp is diff's counterpart for binary files. As applications for binary files in source control are limited, this command is often not used in that environment. Usually, projects that include binary files (for example, a logo) have some other mechanism for updating these components. (Keep in mind that the XPM image format common in Linux applications can actually be text-based and can be controlled using the above commands.)

diff3 is a variant on diff that allows for computing and merging the differences among three files. Personally, I tend to use diff for these purposes, but there are likely reasons why this command is useful in specific situations with which I have not yet dealt.

And finally, sdiff is an interactive version of patch that allows for smarter patching using your very own brain.

These tools have many uses other than content control. I do not want to slight them by implying they are not useful. But, like many tools, they shine only in certain circumstances. (Like that annoying fine screwdriver you get with the set for which you've never seen a screw small enough to use it on, tools are only as good as the situations they are applied against.)

diff Concepts

When talking about different patch formats, a number of concepts deserve some thought. The point is probably moot: nearly all open-source projects that use diff and friends have long since settled on a patch format. However, these are some of the qualities in a patch that makes one format more useful than another.

The first thing admired in a patch format is context. Context consists of the extra lines (often three) before and after the difference blocks in a patch. While context adds greatly to the size of the patch file, it allows patches to be not entirely dependent on the exact file on which the patch was based. This quality is very important in a revision-control system, where it is expected that your working files will be slightly different from the master copy. Patching programs can use these context lines to guess where the offending lines can go--and usually get it right.

Second, patches should be reversible. Reversing is useful when you want to go back to a previous version of the source, or when you mistakenly flipped the options to diff and generated an inherently reversed patch (don't laugh--we've all done it). Not all patch formats are reversible, however.

Patch files should be efficient: small and easily readable, but not so large as to be unwieldy for projects with large numbers of changes (such as the Linux kernel). There is a tradeoff here, of course. Patches without context are more efficient, but definitely not very useful in source maintenance.

Finally, readability is a very important aspect for this style of revision control. Patches should be obvious when it comes to figuring out what changed and should not require the user to flip his or her perspective from one block of text to another to figure out the differences. Making a format human-readable is much more difficult than creating a computer-readable one, especially when you are trying to balance all the other format components.

Each of the various diff formats ranks differently in each of these metrics, and different projects may choose different formats. The Linux kernel, for example, uses the unified difference format.

The diff Formats

The POSIX, or ``normal'', diff is the default format used by the diff utilities. It's a terse format without any lines of context, but is reversible. I've often seen it used as a sort of ``generic'' patch format, since non-GNU versions of the diff utilities can parse it.

The context diff format is another reversible format similar to the POSIX one, but which supports context around changes. Some projects prefer this format, especially if they include developers outside the GNU sphere. Using GNU diff, this format can be specified with the --context option.

The unified diff format is another contextual format that is generally more readable than the context variety. Unlike with context diff, this format displays all the changes in a single block, thus eliminating many of the redundant lines with context diffs. Because of the relative merits of this format for revision control and easy reading of patches, this is the preferred format for the Linux kernel and many GNU projects. On the down side, many non-GNU patch programs are unable to recognize this format. With GNU diff, this format can be specified using the --unified option.

The side-by-side format is great for human reading of patches, but is not readily usable for revision control. It displays the originating file and the changes side by side. This format is mostly just for human patch browsing, and the patch program doesn't actually support it. GNU diff users can enable this with the --side-by-side option.

The ed format is an old format that outputs a script for the ed text editor rather than a special patch format. This option was needed before the modern patch utility was created. Since it outputs a script, it contains no context information and no reversal information. GNU provides this option only for compatibility, and it may be invoked with the --ed switch.

The forward ed format is similar to the ed format, but is even more useless. patch can't process files in this format; neither can ed. If you insist, however, GNU diff still can generate it with the --forward-ed option.

The RCS format is the one used by the revision-control system RCS and its derivatives. It's generally not used standalone, and patch can't actually read the format.

The preprocessor format is not quite a patch format and not quite a script. Instead, it is an output file that contains the contents of both files separated by C preprocessor directives such as #ifdef, #endif, #elsif, etc. It is possible to compile either version of the file by setting preprocessor variables (#define) in the source file or by using the -D switch of the GNU compiler. Obviously, this format isn't of much use for revision control, but can occasionally be useful when you want to test changes.

Using diff

The actual usage of diff and patch is not complicated. While these commands, like many GNU tools, support many options allowing users to refine the way these tools work, these options are not actually required for everyday use. For more information on all the command-line options of these utilities, check out their info pages.

At its simplest, a diff command line for comparing two files would be:

diff old.txt new.txt > oldnew.patch
This will create a patch in the POSIX format that could later be applied to files similar to old.txt. Please note that the output of diff generally appears on standard output (STDOUT) and we have used redirection to get that information into the patch file. For GNU projects, we generally want the results in unified diff format, so we add the -u (or --unified) option to the command line:

diff -u old.txt new.txt > oldnew.patch
Generally, however, a comparison of two source trees is often desired. These trees would be multiple revisions of a single project or something similar. The command I generally prefer for this would be:

diff -ruN old new > oldnew.txt
In this example, I have added two new switches. The first, -r (or --recursive), indicates we want to take a recursive look at directories instead of files. The last switch, -N (or --new-file), indicates we do not want to ignore whether files have been added or removed from either set. In that case, if the new directory included a file called foo.txt but the old directory didn't, the patch would behave as if there was a zero-byte file called foo.txt in the old directory and add it into the patch.

To actually get good use out of the diff command as a form of revision control, a bit of legwork must be done first. I'll discuss this later on.

Using patch

Generally, once a diff is generated or downloaded, the process of patching the file is even simpler. Based on our first example above, we could do something like this:

patch < oldnew.patch
This command would read the patch file from the standard input and apply it to whatever files were in the current directory. Most patch formats include information on the name of the file being patched. In our first example, it would have specified that old.txt was the original file, and the patch command will look for a file by that name here. If that file could not be found, it would then prompt you for the name of the file to which the patch should be applied.

A number of things can go wrong during the patch process. Occasionally, a diff may be made backwards, or you may want to reverse a patch. By using the --reverse option to patch, you can make new.txt old.txt again. Additionally, the patch utility can detect whether the file being patched already contains the patch you are applying. In this case, patch will ask you whether you want it to reverse or attempt to apply the patch anyway. Finally, the patch could fail. If this happens, a file named old.txt.rej (or something similar) will be created, and patch will exit. At that point, it is up to you to look at the contents of the .rej file (which will be in a patch format) and manually apply the contents to the source. (I have occasionally gotten a reject file to apply by using a larger --fuzz value of patch, but this can lead to patch application errors and subtle bugs that you'll be scratching your head about later.) Once the problems are worked through, however, you will have effectively merged two sets of changes into one.

Obviously, these examples are a bit contrived. In real-world practice, patch files are generally not applied by the same people who made them. Instead, you will probably be either a provider of a specific patch (a source maintainer) or one who applies a specific patch (an end user).

Using patch-kernel

Although somewhat beyond the scope of this document, the Linux kernel actually includes a script which will aid you in keeping it up to date with the latest revisions. The patch-kernel script is located in the /usr/src/linux/scripts directory and will apply, in order, all the patches necessary to bring your kernel up to the latest revision, provided you have already downloaded them and put them in that directory. If you don't actually intend to participate in the Linux development effort, but just want to keep up with the latest and greatest source, this handy script will allow you to bypass the real workings of patch until you start developing. Once you start adding your own changes to your source tree, I recommend you use the manual method.

Maintenance of Source Trees

Getting back to the point of using diff and cousins as a sort of bare-bones revision-control system, I should mention there is obviously no one right or wrong way to maintain separate branches of a source tree. Low-level tools like these provide you with the framework to define the way you work, rather than forcing you into using one method or another. My suggestions here are only suggestions, and have been useful to me in the past when writing patches for open-source projects. However, I haven't exactly clocked the man-hours of Alan Cox working on the kernel or any other project, so there may be a better way. Please feel free to e-mail me at the address listed below with your thoughts.

When dealing with source trees, my general rule is always to maintain more than one. Unlike CVS or other revision-control systems, there is often no going back when you make a mistake. It is very easy to add instability to a stable patch, and no easy way to roll back your changes to a ``last good'' configuration. For example, let's imagine I have a source tree for a hypothetical project called ``Foogram''. If I were maintaining this with CVS, I would need only one tree and a CVS server (which obviously maintains multiple trees). Since we don't have the advantage of a separate server with the bare-metal approach, I would generally have at least two directories for the project: foogram and foogram-last. (These are my personal naming conventions.) The last-released version would be in foogram, it's known to be stable and it's what I have to generate all my patches against when I want to distribute them. The foogram-last directory would contain my latest changes. This two-tiered approach is often effective enough for general use.

However, many projects actually get too complicated for this approach. I have been known to create a third (or fourth or fifth) directory which contains the latest changes relative to foogram-last. For our purposes, I'll call this directory foogram-work. It often contains unstable and recent changes I've made to my source trees, which I want to keep separate from my more stable patches. It's important to keep a baseline directory to generate patches against, but this tends to lead to a rapidly expanding number of directories that other directories are relative against, and a mess for patching the manual way. In this example, I would try to maintain foogram-last as the baseline for the foogram-work directory by making sure I merge forward stable changes as I make them. If this were to become over-complicated, I would have to create a copy of foogram-last, called foogram-stable, which contained the copy of foogram-last that foogram-work was drawn against, so I could use created patches between the last and the stable directory to apply against the working directory in order to keep it up to date.

Confused yet? I sure am! Obviously, that's an extreme example, and many programmers will naturally simplify the process based on their own needs. Many projects do not require that level of complexity from individual developers. If you are reading this and still getting anything useful from it, you probably won't need to go to that complicated extreme to develop your own open-source modification patches, and instead will use the easy method.

The easy method is how it's possible to get away with keeping only one tree around, and this has worked for me on nearly every light project where I needed to modify only one or two source files as part of my patch. In that case, I recommend just making a copy of the source files you are editing, with a different extension (such as .old); and in the main tree, creating a diff that way using individual file patches. These small patches can be concatenated together when distributed to project maintainers. When doing it this way, you should be careful to run diff from the root directory of the source tree, so that the patch will be able to figure out later where the changed files were, especially if they were in different directories.

Conclusion

Much of the information here can be extracted from info pages and common sense. However, I hope that by documenting my own experiences with open-source projects and patches, I can encourage that small percentage of you who have the skill and desire to program the kernel, but have not chosen to do so. In the future, I hope to cover some of the other facets of kernel development that may be turning developers away, and I would be interested in hearing the reasons you might be nervous about lending your brain to some of these projects. Until then, happy hacking.