It is easy to create a set of independent filters and to stream data through those filters sequentially, as it is to write a single filter with multiple rules. This allows you to choose the most natural algorithm to solve each content engineering challenge you encounter.
To enable streaming in this fashion, OmniMark provices sink and source types. Here is a function of type string source
, which means that the function returns a source of string
data. It also takes an
argument of type string source
, meaning that it expects to be passed a source of string
data. The
purpose of the function is to remove excess white space from string data:
define string source function compress-whitespace (value string source s) as repeat scan s match blank* "%n" blank* output "%n" match blank+ output "%_" match [any \ white-space]+ => chars output chars again
This function can be called in any context that expects a data source, such as a submit
action. It can
accept any source as an argument, such as #main-input
.
process submit compress-whitespace (#main-input)
This program will stream input from #main-input
, through the function
compress-whitespace ()
, to submit
, where it can be processed by find
rules. The
find
rules will receive a stream of data from which all excess whitespace has been removed by the function
compress-whitespace ()
. Data flows through the program in a completely streaming fashion, with no
buffering of data. This means that you can now connect any number of streaming filters in a chain. Suppose that
you want to process an unstructured document to create an XML representation and then create an HTML output. You
could do this with a traditional OmniMark context-translate
program; however, this would mean that you
could only have one find
rule pass and one markup rule pass at the data. But with
string source
functions, you can connect as many text filters or markup parsers together as you want. In
this case, the most natural algorithm might be:
compress-whitespace ()
):
define string source function compress-whitespace (value string source s) as repeat scan s ; ...
compress-whitespace ()
to wrap XML tags around the elements of the input data
in the simplest possible fashion (text2xml ()
):
define string source function text2xml (value string source s) as submit s ; ...
text2xml ()
to tidy up the XML, removing unneeded elements and adding structure
and ID attributes (tidy-xml ()
):
define string source function tidy-xml (value string source s) as do xml-parse scan s ...
tidy-xml ()
to create HTML (xml2html ()
):
define string source function xml2html (value string source s) as do xml-parse scan s ...
You would then invoke those functions as a chain of streaming filters with a simple output
action:
process output xml2html (tidy-xml (text2xml (compress-whitespace (#main-input))))
The flow of data here is from right to left (as the program is written). Each function, starting with
compress-whitespace ()
on the right, takes a string source
as its input and returns a string
source
to the function on its left.
Another way to structure this program would be to write the xml2html ()
function as a string
sink
rather than as a string source
. This means that the function becomes a destination to which data
is sent, and processes that data before sending it on to another sink. Here is the xml2html ()
function
written as a sink function:
define string sink function xml2html (value string sink destination) as using output as destination do xml-parse scan #current-input output "%c" done
This function can be used anywhere a string sink
(data destination for string
s) is expected,
such as a using output as
statement, and can accept any string sink
expression as an
argument, such as #main-output
:
process using output as xml2html (#main-output) output tidy-xml (text2xml (compress-whitespace (#main-input)))
Here again, data is streamed through the chain of streaming filters implemented by the string source
functions to the current output scope, which is the string sink
function xml2html ()
, which in
turn streams it to #main-output
. Once again, the data is never buffered. The output data streams from
left to right (as the program is written) from the xml2html ()
function to the main output.
Since the current output scope of an OmniMark program can include more than one sink, you can define multiple
string sink functions and stream data to them simultaneously. In the following example, the original source is
converted to XML, then that XML is streamed directly to a file, to an HTML output function, and to an XSL/FO
output function, creating three different output formats simultaneously:
process using output as xml2html (file #args[2]) & xml2fo (file #args[3]) & file #args[4] output tidy-xml (text2xml (compress-whitespace (file #args[1])))