OmniMark Developer Resources: Literate Programming Using OmniMark

1.	Introduction
2.	Template Processing
3.	Weaving
4.	Tangling
5.	Handling Cross-References
6.	Main Loop
7.	Generating Output Filenames
8.	Epilogue

1. Introduction

Traditionally, programs are written to be read solely by the computer. In contrast, literate programming is a programming technique or methodology wherein the program is considered to be a work of literature, meant to be read by humans. Rather than being written in a top-down or bottom-up fashion, a literate program is written in a way that clarifies its presentation to the reader: material is presented as it is needed by the reader, rather than when it is needed by the program compiler. At its most basic, the idea is that the programmer will truly understand the program being written only once it can be explained to someone else.

Donald Knuth devised the literate programming methodology while working on his TeX and METAFONT typesetting systems. The methodology was embodied in his WEB programming system, based on the Pascal programming language. (Oddly enough, this meaning of the term WEB is largely forgotten, even though it pre-dates the introduction of the World Wide Web by at least five years.) Knuth used WEB to rewrite both TeX and METAFONT, both of which are published in literate programming form.

Literate programming has the advantage of keeping the design of the program together with its implementation. Proponents of literate programming argue that the tools encourage design decisions and algorithm descriptions to be included into the program itself, rather than haphazardly provided in external documents. In fact, since the tools and methodology of literate programming encourage the programmer to examine and think more carefully about the code being written, it is argued that the quality of the resulting code is increased. This has the effect of easing maintenance of the resulting programs: TeX is considered by some to be the system that comes closest to being bug-free.

A literate program consists of one input file and two output files. The input file consists of blocks of code and textual descriptions of the functioning of the code. The first output files, called the web, is a version of the program formatted for human consumption. The second output file is the executable version. The job of the literate programming tools (referred to as weave and tangle) is to transform the input file into the desired format. Since document transformations are what OmniMark excels at, it seems reasonable to write a literate programming tool suite in OmniMark. In our case, weave and tangle will be combined into a single program.

For our purposes, a literate program is an input document conforming to the following simple SGML DTD:

The document type is program. The output attribute of the program element is used to specify the name of the tangled output file; the name of the weaved file is generated.

The section element allows a program to be subdivided into smaller components. We could add a mechanism for cross-referencing from one section to another, but this wouldn't add anything to the discussion that follows; consider it left as an exercise for the reader.

Most of the remaining elements in this DTD (i.e., title (title), paragraph (p), bold text (b), italicised text (i), and fixed-width font text (tt)) are standard: they allow for rudimentary structural or stylistic markup. Later, we will see that these elements are essentially passed-through to the weaved output, to be handled by a later formatting phase. If a more sophisticated markup scheme is desired or required, its new elements would be treated in a similar manner. Again, these issues are outside the scope of this presentation.

The code element is used to mark up code blocks. The id attribute gives an identity to the code block: the tangling process takes the id attribute for a code block and defines an SGML entity with that name whose value is either the content of the code element, or a link to that content. The code block is then used by using the entity in your text; this is called a code block reference.

A code block reference is treated differently by the tangling and weaving process, respectively. In the tangling process, a code block reference expands to the code block's content. In the weaving process, on the other hand, a code block reference expands to a link to the code block instead.

This defines an entity main-process which we can then use elsewhere in our literate program. If we use this entity inside of a paragraph,

with the code block reference being a link to the actual code block. The format of the code block reference is conventional, and dates back to Knuth's original literate programming tools: each code block is assigned a unique number, and code block references include both the unique number and the code block's id in their references. This same format is used to define the code block's value: for the code block defined above, the weaved output might look like

We can append to a code block by simply re-using the same id attribute value. So, if subsequent to our previous code block we have

Note that the = on the first line has been changed to a +=, indicating that this code block is appending its content to a previously-defined code block. The tangling process takes care of concatenating these code blocks together in its final output.

Keeping in mind that a literate program is written so as to be read by a human rather than a computer, appending to a code block in this fashion is useful when it clarifies the exposition to describe different portions of the block in different parts of the literate program.

The name attribute of the code element is used to override the format of the code block reference in the weaved output: if the name attribute is present, the code block reference will be formatted using the code block's unique number and the value of the the name attribute. This allows us to give code block references more readable names than allowed by SGML's name attribute datatype.

The output attribute of the code element instructs the tangling process to redirect the contents of the code element to the file named by the attribute's value. This is useful for when part of the literate program is describing some file (e.g., a DTD) which is auxiliary to the main program. Without this, all code blocks would end up in the same tangled output file. The do-tangle attribute is for the similar case where a code block would help the exposition, but its contents are not needed in the tangled output file.

Neither the output attribute nor the do-tangle attribute have any effect on the weaved output.

Using SGML as the input format greatly simplifies our task. If a code block contains data that would otherwise be interpreted as markup, it can be enclosed in a CDATA marked section: this provides a simple escaping mechanism.

Note the default entity declaration at the end of the DTD: it's this default entity declaration that makes this literate programming tool so simple.

Literate Programming Using OmniMark

Contents

1. Introduction