Literate Programming Using OmniMark
Contents
1. Introduction
Traditionally, programs are written to be read solely by the
computer. In contrast, literate programming is a programming technique or
methodology wherein the program is considered to be a work of
literature, meant to be read by humans. Rather than being
written in a top-down or bottom-up fashion, a literate program
is written in a way that clarifies its presentation to the
reader: material is presented as it is needed by the reader,
rather than when it is needed by the program compiler. At its
most basic, the idea is that the programmer will truly
understand the program being written only once it can be
explained to someone else. Donald Knuth devised the literate programming methodology
while working on his TeX and METAFONT typesetting
systems. The methodology was embodied in his WEB
programming system, based on the Pascal programming language.
(Oddly enough, this meaning of the term WEB is largely
forgotten, even though it pre-dates the introduction of the World Wide Web by at least five years.) Knuth used WEB
to rewrite both TeX and METAFONT, both of which are
published in literate programming form. Literate programming has the advantage of keeping the design
of the program together with its implementation. Proponents of
literate programming argue that the tools encourage design
decisions and algorithm descriptions to be included into the
program itself, rather than haphazardly provided in external
documents. In fact, since the tools and methodology of literate
programming encourage the programmer to examine and think more
carefully about the code being written, it is argued that the
quality of the resulting code is increased. This has the effect
of easing maintenance of the resulting programs: TeX is
considered by some to be the system that comes closest to being
bug-free. A literate program consists of one input file and two output
files. The input file consists of blocks of code and textual
descriptions of the functioning of the code. The first output
files, called the web, is a version of the program
formatted for human consumption. The second output file is the
executable version. The job of the literate programming tools
(referred to as weave and tangle) is to transform the
input file into the desired format. Since document
transformations are what OmniMark excels at, it seems reasonable
to write a literate programming tool suite in OmniMark. In our
case, weave and tangle will be combined into a single
program. For our purposes, a literate program is an input document
conforming to the following simple SGML DTD:
<!element program - - (title, section+)>
<!attlist program output cdata #required>
<!element section - - (title, (p | code)+)>
<!element title o o (#pcdata)>
<!element p - o (b | i | tt | #pcdata)+>
<!element (b | i | tt) - - (#pcdata)>
<!element code - - (#pcdata)>
<!attlist code id name #implied
name cdata #implied
output cdata #implied
do-tangle (tangle | no-tangle) tangle>
<!entity #default system "">
The document type is program. The output attribute
of the program element is used to specify the name of the
tangled output file; the name of the weaved file is generated. The section element allows a program to be
subdivided into smaller components. We could add a mechanism for
cross-referencing from one section to another, but this wouldn't
add anything to the discussion that follows; consider it left as
an exercise for the reader. Most of the remaining elements in this DTD (i.e., title
(title), paragraph (p), bold text (b),
italicised text (i), and fixed-width font text (tt))
are standard: they allow for rudimentary structural or stylistic
markup. Later, we will see that these elements are essentially
passed-through to the weaved output, to be handled by a later
formatting phase. If a more sophisticated markup scheme is
desired or required, its new elements would be treated in a
similar manner. Again, these issues are outside the scope of
this presentation. The code element is used to mark up code blocks. The id attribute gives an identity to the code block: the
tangling process takes the id attribute for a code block
and defines an SGML entity with that name whose value is either
the content of the code element, or a link to that
content. The code block is then used by using the entity in your
text; this is called a code block reference. A code block reference is treated differently by the tangling
and weaving process, respectively. In the tangling process, a
code block reference expands to the code block's content. In the
weaving process, on the other hand, a code block reference
expands to a link to the code block instead. Suppose we have the following code block:
<code id="main-process">
process
output "Hello, World!"
</code>
This defines an entity main-process which we can
then use elsewhere in our literate program. If we use this
entity inside of a paragraph,
<p> This is a reference to the &main-process; code block.
our weaved output might look like
This is a reference to the <1 main-process> code block.
with the code block reference being a link to the actual code
block. The format of the code block reference is conventional,
and dates back to Knuth's original literate programming tools:
each code block is assigned a unique number, and code block
references include both the unique number and the code block's
id in their references. This same format is used to define
the code block's value: for the code block defined above, the
weaved output might look like
<1 main-process> =
process
output "Hello, World!"
We can append to a code block by simply re-using the same id attribute value. So, if subsequent to our previous code
block we have
<code id="main-process">
process-end
output "Goodbye!"
</code>
in our literate program, the weaved for this block output
might look like
<1 main-process> +=
process-end
output "Goodbye!"
Note that the = on the first line has been changed to a
+=, indicating that this code block is appending its
content to a previously-defined code block. The tangling process
takes care of concatenating these code blocks together in its
final output. Keeping in mind that a literate program is written so as to be
read by a human rather than a computer, appending to a code
block in this fashion is useful when it clarifies the exposition
to describe different portions of the block in different parts
of the literate program. The name attribute of the code element is used to
override the format of the code block reference in the weaved
output: if the name attribute is present, the code block
reference will be formatted using the code block's unique number
and the value of the the name attribute. This allows us to
give code block references more readable names than allowed by
SGML's name attribute datatype. The output attribute of the code element instructs
the tangling process to redirect the contents of the code
element to the file named by the attribute's value. This is
useful for when part of the literate program is describing some
file (e.g., a DTD) which is auxiliary to the main program.
Without this, all code blocks would end up in the same tangled
output file. The do-tangle attribute is for the similar
case where a code block would help the exposition, but its
contents are not needed in the tangled output file. Neither the output attribute nor the do-tangle
attribute have any effect on the weaved output. Using SGML as the input format greatly simplifies our task. If
a code block contains data that would otherwise be interpreted
as markup, it can be enclosed in a CDATA marked section: this
provides a simple escaping mechanism. Note the default entity declaration at the end of the
DTD: it's this default entity declaration that makes this
literate programming tool so simple.
Next section: Template Processing
|