Representation is the essence of programming.
-- Fred Brooks
Implementing Extreme Perl is a simple matter of programming. Practicing XP
clarifies its values. Perl's virtues come alive as you read and write
it. The subject matter language crystallizes as the subject matter
oriented program evolves along with the programmers' mastery of the
problem domain.
This chapter coalesces the book's themes (XP, Perl, and SMOP) into a
single example: a DocBook XML to HTML translator. We walk through the
planning game, split the story into tasks, discuss coding style,
design simply, pair program, test first, and iterate. The subject
matter oriented program (SMOP) is a hash of XML to HTML tags
interpreted by a declarative algorithm. Along the way, declarative
programming is defined and related to other paradigms (object-oriented
and imperative programming).
The Problem
The example in this chapter converts DocBook XML[1] (the source form of
this book) to HTML. The example went beyond this chapter, and the
version here was enhanced to produce the review copies for the entire
book.[2]
DocBook is a technical document description language. I described the
paragraphs, sections, terms, etc. of this book with tags as follows:
<blockquote><para>
To learn about <firstterm>XML</firstterm>, visit
<systemitem role="url">http://www.w3c.org/XML</systemitem>.
</para></blockquote>
It's unreadable in source form, and my resident reviewer in chief pointed
this out after I asked her to review the first chapter in source form.
She was expecting something like this:
To learn about XML, visit
http://www.w3c.org/XML.
Since eating your own dog food is a great way to make sure it's
palatable, my resident reviewer in chief agreed to act as the on-site
customer[3]
for a DocBook to HTML translator.
I am happy to say she is satisfied with the output of the
program.[4]
Planning Game
The planning game was brief. There was only one story card (shown
completed):
The Story Card
Note the simplicity of the story. One sentence is usually sufficient
to describe the problem.
During the planning game, we decided almost immediately the output
would be HTML. We briefly discussed simply stripping the XML tags to
produce a plain text document. We dropped this idea quickly, because
the output would not be very easy to read, for example, footnotes
would appear in the middle of the text. We touched on alternative
formatting languages, such as, Rich Text Format (RTF), but the
simplicity of HTML and its relatedness to XML made the decision for
us. HTML provides enough formatting to give a sense of the layout,
which is all Joanne needed to read and print the chapters.
Dividing the Story into Tasks
We already had a chapter as a sample input. This made it easy to
define the tasks. I needed the tasks to be bite-sized chunks, because
my programming partners were only available for brief periods.
The task split was simple. I scanned the chapter looking for
problematic tags. The first task specifies simple tags. The other
tasks specify one problematic tag each. Only DocBook
tags used in the chapter were included, and each tag can be found in
at least one test case.
Coding Style
The story card does not mention declarative programming. It also
doesn't specify what language, operating system, or hardware is to be
used. The customer simply wants readable and printable chapters. She
doesn't care how we do it.
Too often we begin coding without an explicit discussion about how
we're going to code, that is. what language and what style. For this project,
we chose Perl for the following reasons:
XML maps naturally to Perl's built-in data structures (lists and
hashes),
CPAN has several ready-to-use XML parsers,
It's easy to generate HTML in Perl, and
I needed an example for this book.
The last reason is important to list. One of the core values of XP is
communication. By listing my personal motivation, I'm being honest
with my team. Miscommunication often comes from hidden agendas.
Simple Design
The problem lends itself to simplicity. XML and HTML are declarative
languages, and an important property of declarative languages is ease
of manipulation. For example, consider the following DocBook snippet:
<para>
<classname>XML::Parser</classname> parses XML into a tree.
</para>
The relationships are clear, and the mapping to HML is simply:
<p>
<tt>XML::Parser</tt> parses XML into a tree.
</p>
One could translate the tags with a simple tag for tag mapping, such as:
s{<(/?)para>}{<$1p>}g;
s{<(/?)classname>}{<$1tt>}g;
This design is too simple, however. It assumes the XML is well-formed, which
it often isn't when I write it. For example, if I were to leave off
</classname>, the closing
</tt> would be missing in the HTML,
and the output would be in
tt font for the rest of the document.
The classic response to this is: garbage in, garbage out.
However, we did better without added complexity[5],
and the solution evolved with minimal changes.
We favored hubris and impatience over doing the simplest thing that
could possibly work. A little chutzpah is not bad every now and then
as long as you have benchmarks to measure your progress. If the
implementation size grew too rapidly, we would have backed off to the
simpler solution. If we blew our task estimates, we'd have to ask if
we didn't under-estimate the complexity of the more radical approach.
The design we chose starts with the output of the CPAN package,
XML::Parser. If the XML is not well-formed,
XML::Parser dies. There is no output when
the input is garbage.
XML::Parser preserves the semantic structure of
the input. The translation is an in-order tree traversal,
so the output is likely to be well-formed HTML, which also is a tree.
Imperative versus Declarative
To help understand the benefits of declarative languages, let's
consider an alternate problem. Suppose I was writing this book in
troff[6],
an imperative text formatting language:
.PP
\fBXML::Parser\fR parses XML into a tree.
The commands in troff are not relational. The
.PP does not bracket the paragraph it introduces.
troff interprets the .PP as a
paragraph break, and what follows need not be a paragraph at all. The
command is imperative, because it means do something right now
irrespective of context.
The \fB and \fR commands do not
relate to the value they surround, either. \fB
turns on bold output, and \fR turns off all
emphasis statefully. Drop one or the other, and the
troff is still well-formed.
troff commands are unrelated to one another except
by the order in which they appear in the text.
Writing a troff parser is straightforward. The
complete grammar is
not much more complicated than the example above. Translating
troff to HTML is much more difficult.[7]
For example, consider the following troff:
\fBXML::Parser\fI is sweet!\fR
The equivalent HTML is:
<b>XML::Parser</b><i> is sweet!</i>
A simple command-to-tag translation is insufficient. The program
must maintain the state of the font in use, and output the corresponding
closing tag (</b>) when the font changes
before appending the font tag (<i>).
The same is true for font sizes, indentation
level, line spacing, and other stateful troff
commands. The program has to do two jobs: map commands to tags and
emulate the state management of troff.
As you'll see, the XML to HTML translator does not maintain global
input state to perform its job. The XML tags are translated based on
minimal local context only. The only relational information required
is the parent of the current tag. The mapping is stateless and therefore
simpler due to XML's declarativeness.
Pair Programming
Programming courses rarely mention declarative programming.
Imperative programming is the norm. It is all too easy to use
imperative forms out of habit or as a quick fix, especially when
working alone under pressure. You may need to refactor several times
to find appropriate declarative forms.
Dogmatic pursuit of declarative forms is not an end in itself,
however. Sometimes it's downright counter-productive. Since Perl allows us
to program in multiple paradigms, it is tricky to choose how or when
to program using objects, imperatively, and declaratively.
For these reasons, it's helpful to program in pairs when
coding declarative constructs.
It takes time to learn how to code declaratively, just like it takes time
to test-first, code simply, and refactor. The learning process is
accelerated when you program in pairs.
All tasks and tests in this chapter were implemented in pairs. I would
like to thank Alex Viggio for partnering with me on the last three
tasks and Justin Schell for helping with the first. Thanks to Stas Bekman
for being my virtual partner in the final refactoring-only iteration.
Test First, By Intention
The first task involves simple tags only. This allowed us to
address the basic problem: mapping XML tags to HTML tags. Each XML tag in
the first test case maps to zero, one, or two HTML tags.
The first test case input is the trivial DocBook file:
<chapter>
<title>Simple Chapter</title>
<simplesect>
<para>Some Text</para>
</simplesect>
</chapter>
Here's the expected result:
<html><body>
<h1>Simple Chapter</h1>
<p>Some Text</p>
</body></html>
The test case input and expected result are stored in
two files named 01.xml and
01.html, respectively. In my experience, it pays
to put the test cases in a separate subdirectory
(DocBook), and to check the test
cases into the collective repository. If the program runs amok and
overwrites these files, you can always retrieve them from the
repository. Also, storing all the test data and programs in the
repository ensures programmer workstations are stateless. This allows you to
switch tasks and/or workplaces easily.
The unit test program is:
#!perl -w
use strict;
use Bivio::Test;
use Bivio::IO::File;
Bivio::Test->new('Bivio::XML::DocBook')->unit([
'Bivio::XML::DocBook' => [
to_html => [
['DocBook/01.xml'] => [Bivio::IO::File->read('DocBook/01.html')],
],
],
]);
The function to_html takes a file name as an
argument and returns a string reference, which simplifies testing.
There is no need to create a temporary output file nor delete it when
the test is over. A testable design is a natural outcome of
test-first programming.
Bivio::IO::File->read returns the contents of
the file name passed to it. The output is a scalar reference. If the
file does not exist or an I/O error occurs, the function dies.
Statelessness
Bivio::IO::File->read is stateless, or
idempotent. It always does the same thing
given the same arguments. This isn't to say the file that it reads
is stateless. Files and I/O are stateful. Rather, the operation
retains no state itself. If the underlying file does not change,
exactly the same data is returned after each call.
Many of Perl's I/O functions are stateful. For example, Perl's
read returns different values when called with
the same arguments, because it maintains an internal buffer and file pointer.
Each call returns the current buffer contents and advances an internal
buffer index, possibly filling the buffer if it is empty. If the
underlying file changes between calls, the old buffered contents are
returned regardless. read buffers the file (uses
state) to
improve performance (decrease time), and we pay a price: the data read may not
match a valid file state (old and new states may be mixed).
Bivio::IO::File->read cannot be used in all
cases. Sometimes the file is too large to fit in memory, or the file
may be a device that needs to be read/written alternately in a
conversational mode. For our test program,
Bivio::IO::File->read meets our needs, and
the declarative operation simplifies the code and
ensures data integrity.[8]
In terms of XP, stateless programming supports unit testing.
It is easier to test stateless than stateful
operations. Internal state changes, such as caching results in buffers,
multiply the inputs and outputs implicitly. It's harder
to keep track of what you are testing. Stateful test cases must take
ordering into account, and tests of implicit state changes make unit
tests harder to read.
Stateless APIs can be tested independently. You only need to consider
the explicit inputs and outputs for each case, and the cases can be
written in any order. The tests are easier to read and maintain.
XML::Parser
Before we dive into the implementation, we need to understand the
output of XML::Parser. It parses the XML into a
tree that constrains our
implementation choices. Given 01.xml,
the following data structure is returned by parsefile:
[
chapter => [
{},
0 => "\n",
title => [
{},
0 => 'Simple Chapter',
],
0 => "\n",
simplesect => [
{},
0 => "\n",
para => [
{},
0 => 'Some Text',
],
0 => "\n",
],
0 => "\n",
],
];
The tree structure is realized by nesting arrays. The first element in the
array is a hash of attribute tags, which we can ignore for this first
implementation, because 01.xml doesn't use XML attributes.
The special tag 0 indicates raw
text, that is, the literal strings bracketed by the XML tags.
First SMOP
The implementation of the first task begins with the map from DocBook
to HTML, which is the subject matter oriented program (SMOP):
my($_TO_HTML) = {
chapter => ['html', 'body'],
para => ['p'],
simplesect => [],
title => ['h1'],
};
The subject matter (tags and their relationships) is expressed
succinctly without much syntactic clutter.Perl's simple quoting comma
(=>) describes the program's intention to replace
the tag on the left with the list of tags on the right. As an
exercise, try to translate this SMOP to another programming language.
You'll find that Perl's expressiveness is hard to beat.
First Interpreter
The SMOP above is known as descriptive
declarativeness, just like HTML and XML. The primary
advantage of descriptive languages is that they are easy to evaluate.
The first interpreter is therefore quite short[9]:
sub to_html {
my($self, $xml_file) = @_;
return _to_html(XML::Parser->new(Style => 'Tree')->parsefile($xml_file));
}
sub _to_html {
my($tree) = @_;
my($res) = '';
$res .= _to_html_node(splice(@$tree, 0, 2))
while @$tree;
return \$res;
}
sub _to_html_node {
my($tag, $tree) = @_;
return HTML::Entities::encode($tree)
unless $tag;
die($tag, ': unhandled tag')
unless $_TO_HTML->{$tag};
# We ignore the attributes for now.
shift(@$tree);
return _to_html_tags($_TO_HTML->{$tag}, '')
. ${_to_html($tree)}
. _to_html_tags([reverse(@{$_TO_HTML->{$tag}})], '/');
}
sub _to_html_tags {
my($names, $prefix) = @_;
return join('', map({"<$prefix$_>"} @$names));
}
The execution is driven by the XML, not our SMOP.
to_html starts the recursive (in order
tree traversal) algorithm by calling parsefile with the
appropriate arguments. The XML tree it returns is passed to
_to_html. The tags are translated by the SMOP as
they are encountered by _to_html_node, the
workhorse of the program. The tag names in the SMOP are converted to
HTML tags (surrounded by angle brackets, <>)
by _to_html_tags.
Functional Programming
The subroutine _to_html_tags is a
pure function, also known as a function without side
effects. A pure function is a declarative
operation, which is defined formally as follows
[10]:
A declarative operation is independent
(does not depend on any execution state outside itself),
stateless (has no internal execution state that is
remembered between calls), and deterministic
(always gives the same results when given the same arguments).
_to_html_tags only depends on its inputs. Its
output (the HTML tags) is the only state change to the program. And,
we expect it to do exactly the same thing every time we call it.
Functional programming is the branch of
declarative programming that uses pure functions exclusively. One of
Perl's strengths is its functional programming support.
For example,
map and join allowed us
to implement _to_html_tags functionally.
Other Perl operations, such as foreach, support
imperative programming only. Code that uses
foreach must rely on stateful side-effects for
its outputs. To illustrate this point, let's look at
_to_html_tags implemented with
foreach:
sub _to_html_tags_imperatively {
my($names, $prefix) = @_;
my($res) = '';
foreach my $name (@$names) {
$res .= "<$prefix$name>";
}
return $res;
}
The foreach does its job of iterating through
$names, and nothing more. It abdicates any
responsibility for achieving a result. The surrounding code must
introduce a variable ($res) to extract the output
from inside the loop. The variable adds complexity that is
unnecessary in the functional version.
Outside In
Like most programmers, I was trained to think imperatively. It's
hard to think declaratively after years of programming languages like
C and Java. For example, _to_html in our
first interpreter uses the
imperative while, because a functional version
didn't spring to mind. This was the simplest thing that could
possibly work. It was Stas who suggested the functional
refactoring in Final Implementation.
Functional programming requires a paradigm shift from traditional
imperative thinking.
_to_html_tags_imperatively concatenates its
result on the inside of the
foreach. The functional
_to_html_tags concatenates the result on the
outside of the map.
Functional programming is like turning an imperative program inside
out.[11]
Or, as some of my co-workers have noted, it's like programming
while standing on your head.
May I, Please?
The inside out analogy helps us refactor. We can use it to
simplify imperative programs. To program functionally from
the outset, a different analogy may help: think in terms of
requests, not demands. Paul Graham states this eloquently, "A
functional program tells you what it wants; an imperative program
tells you what to do."[12]
When we apply this analogy to the example, we see that
_to_html_tags_imperatively tells us it
formats tag names one at a time, and it appends them to the end of a string.
When its done with that, it'll return the result.
The functional _to_html_tags has a list of tag
names and wants a string to return, so it asks
join, a function that conatenates a list into a
string. join asks for a separator and a list
of tags to concatenate. map wants to format
tag names, so it asks for a formatter
({"<$prefix$_>"}) and a list of tag names.
All we're missing is some polite phrases like please and may I, and we
can expand this analogy to familial relationships. Imperative kids
tell their parents, "I'm taking the car." Declarative
kids politely ask, "May I borrow the car, please?". By
communicating their desires instead of demands, declarative kids give
their parents more leeway to implement their requests.
Pure functions, and declarative programs in general, are more flexible
than their imperative cousins. Instead of demanding a calling order
that is implicitly glued together with state (variables), declarative
programs define relationships syntactically. This reduces the problem
of refactoring from an implicit global problem of maintaining state
transitions to an explicit local one of preserving syntactic
relationships. Functional programs are easier to refactor.
Second Task
The second task introduces asymmetric output. The test case input
file (02.html) is:
<chapter>
<title>Chapter with Epigraph</title>
<epigraph>
<para>
Representation <emphasis>is</emphasis> the essence of programming.
</para>
<attribution>Fred Brooks</attribution>
</epigraph>
<simplesect>
<para>Some Text</para>
</simplesect>
</chapter>
The output file (02.xml) we expect is:
<html><body>
<h1>Chapter with Epigraph</h1>
<p>
Representation <b>is</b> the essence of programming.
</p>
<div align=right>-- Fred Brooks</div>
<p>Some Text</p>
</body></html>
The XML attribute tag doesn't map to a simple
HTML div tag, so the existing SMOP language didn't
work. But first we had to update the unit test.
Unit Test Maintenance
To add the new case to the unit test, we copied the line
containing the first test case, and changed the the filenames:
#!perl -w
use strict;
use Bivio::Test;
use Bivio::IO::File;
Bivio::Test->new('Bivio::XML::DocBook')->unit([
'Bivio::XML::DocBook' => [
to_html => [
['DocBook/01.xml'] => [Bivio::IO::File->read('DocBook/01.html')],
['DocBook/02.xml'] => [Bivio::IO::File->read('DocBook/02.html')],
],
],
]);
Woops! We fell into the dreaded copy-and-paste trap. The new line is
identical to the old except for two characters out of 65. That's too
much redundancy (97% fat and 3% meat). It's hard to tell the
difference between the two lines, and as we add more tests it will be
even harder. This makes it easy to forget to add a test, or
we might copy-and-paste a line and forget to change it.
We factored out the common code to reduce redundancy:
#!perl -w
use strict;
use Bivio::Test;
use Bivio::IO::File;
Bivio::Test->new('Bivio::XML::DocBook')->unit([
'Bivio::XML::DocBook' => [
to_html => [
map({
my($html) = $_;
$html =~ s/xml$/html/;
[$_] => [Bivio::IO::File->read($html)];
} sort(<DocBook/*.xml>))
],
],
]);
This version of the unit test is maintenance free. The test
converts all .xml files in the
DocBook subdirectory. All we need to do is
declare them, i.e., create the
.xml and .html files.
We can execute the cases in any order, so we chose to sort them to
ease test case identification.
Second SMOP
We extended the SMOP grammar to accommodate asymmetric output. The
new mappings are shown below:
my($_TO_HTML) = _to_html_compile({
attribution => {
prefix => '<div align=right>-- ',
suffix => '</div>',
},
chapter => ['html', 'body'],
emphasis => ['b'],
epigraph => [],
para => ['p'],
simplesect => [],
title => ['h1'],
});
attribution maps to a hash that defines the
prefix and suffix. For the other tags, the prefix and suffix is
computed from a simple name. We added
_to_html_compile which is called once at
initialization to convert the simple tag mappings (arrays) into the
more general prefix/suffix form (hashes) for efficiency.
Second SMOP Interpreter
We extended _to_html_node to handle asymmetric
prefixes and suffixes. The relevant bits of code are:
sub _to_html_compile {
my($config) = @_;
while (my($xml, $html) = each(%$config)) {
$config->{$xml} = {
prefix => _to_html_tags($html, ''),
suffix => _to_html_tags([reverse(@$html)], '/'),
} if ref($html) eq 'ARRAY';
}
return $config;
}
sub _to_html_node {
my($tag, $tree) = @_;
return HTML::Entities::encode($tree)
unless $tag;
die($tag, ': unhandled tag')
unless $_TO_HTML->{$tag};
# We ignore the attributes for now.
shift(@$tree);
return $_TO_HTML->{$tag}->{prefix}
. ${_to_html($tree)}
. $_TO_HTML->{$tag}->{suffix};
}
_to_html_compile makes
_to_html_node simpler and more efficient, because
it no longer calls _to_html_tags with the ordered
and reversed HTML tag name lists. Well, I thought it was more
efficient. After performance testing, the version in
Final Implementation turned out to be just as
fast.[13]
The unnecessary compilation step adds complexity without improving
performance. We added it at my insistence.
I remember saying to Alex, "We might as well add
the compilation step now, since we'll need it later anyway."
Yikes! Bad programmer! Write "I'm not going to need it"
one hundred times in your PDA. Even in pairs, it's hard to avoid
the evils of pre-optimization.
Spike Solutions
As long as I am playing true confessions, I might as well note that
I implemented a spike solution to this problem
before involving my programming partners. A spike solution in XP is a
prototype that you intend to throw away. I wrote a spike to see how
easy it was to translate DocBook to HTML. Some of my partners knew about it,
but none of them saw it.
The spike solution affected my judgement. It had a compilation step,
too. Programming alone led to the pre-optimization. I was too
confident that it was necessary when pairing with Alex.
Spike solutions are useful, despite my experience in this case. You
use them to shore up confidence in estimates and feasibility of a
story. You write a story card for the spike, which estimates the cost
to research possibilities. Spike solutions reduce risk through
exploratory programming.
Third Task
The third task introduces contextually related XML tags. The DocBook
title tag is interpreted differently depending on
its enclosing tag. The test case input file
(03.xml) is:
[COMMENT: The following wastes printed pages. Should I trim the test cases
or compact them to save printed pages?]
<chapter>
<title>Chapter with Section Title</title>
<simplesect>
<programlisting>
print(values(%{{1..8}}));
</programlisting>
<para>
Some other tags:
<literal>literal value</literal>,
<function>function_name</function>, and
<command>command-name</command>.
</para>
<blockquote><para>
A quoted paragraph.
</para></blockquote>
</simplesect>
<sect1>
<title>Statelessness Is Next to Godliness</title>
<para>
A new section.
</para>
</sect1>
</chapter>
The expected output file (03.html) is:
<html><body>
<h1>Chapter with Section Title</h1>
<blockquote><pre>
print(values(%{{1..8}}));
</pre></blockquote>
<p>
Some other tags:
<tt>literal value</tt>,
<tt>function_name</tt>, and
<tt>command-name</tt>.
</p>
<blockquote><p>
A quoted paragraph.
</p></blockquote>
<h2>Statelessness Is Next to Godliness</h2>
<p>
A new section.
</p>
</body></html>
The chapter title
translates to an HTML h1 tag. The
section title
translates to an h2 tag.
We extended our SMOP language to handle these two contextually different
renderings of title.
Third SMOP
We discussed a number of ways to declare the contextual
relationships in our SMOP. We could have added a
parent attribute to the hashes (on the right)
or nested title within a
hash pointed to by the chapter tag. The syntax
we settled on is similar to the one used by XSLT.[14]
The XML tag names can be prefixed with a parent tag name, for example,
"chapter/title". The SMOP became:
my($_XML_TO_HTML_PROGRAM) = _compile_program({
attribution => {
prefix => '<div align=right>-- ',
suffix => '</div>',
},
blockquote => ['blockquote'],
'chapter/title' => ['h1'],
chapter => ['html', 'body'],
command => ['tt'],
emphasis => ['b'],
epigraph => [],
function => ['tt'],
literal => ['tt'],
para => ['p'],
programlisting => ['blockquote', 'pre'],
sect1 => [],
'sect1/title' => ['h2'],
simplesect => [],
});
Third SMOP Interpreter
We refactored the code a bit to encapsulate the contextual lookup in
its own subroutine:
sub to_html {
my($self, $xml_file) = @_;
return _to_html(
'',
XML::Parser->new(Style => 'Tree')->parsefile($xml_file));
}
sub _eval_child {
my($tag, $children, $parent_tag) = @_;
return HTML::Entities::encode($children)
unless $tag;
# We ignore the attributes for now.
shift(@$children);
return _eval_op(
_lookup_op($tag, $parent_tag),
_to_html($tag, $children));
}
sub _eval_op {
my($op, $html) = @_;
return $op->{prefix} . $$html . $op->{suffix};
}
sub _lookup_op {
my($tag, $parent_tag) = @_;
return $_XML_TO_HTML_PROGRAM->{"$parent_tag/$tag"}
|| $_XML_TO_HTML_PROGRAM->{$tag}
|| die("$parent_tag/$tag: unhandled tag");
}
sub _to_html {
my($tag, $children) = @_;
my($res) = '';
$res .= _eval_child(splice(@$children, 0, 2), $tag)
while @$children;
return \$res;
}
# Renamed _compile_program and _compile_tags_to_html not shown for brevity.
The algorithmic change is centralized in
_lookup_op, which wants a tag and its parent to
find the correct relation in the SMOP. Precedence is given to
contextually related tags ("$parent_tag/$tag")
over simple XML tags ($tag). Note that the root
tag in to_html is the empty string
(''). We defined it to avoid complexity in the
lower layers. _lookup_op need not be specially
coded to handle the empty parent tag case.
The Metaphor
This task implementation includes several name changes. Alex didn't
feel the former names were descriptive enough, and they lacked coherency.
To help think up good names, Alex suggested that our program was
similar to a compiler, because it translates a high-level language
(DocBook) to a low-level language (HTML).
We refactored the names to reflect this new
metaphor. $_TO_HML became
$_XML_TO_HTML_PROGRAM, and
_to_html_compile to
_compile_program. and so on. An
$op is the implementation of an operator, and
_lookup_op parallels a compiler's symbol table
lookup. _eval_child evokes a compiler's
recursive descent algorithm.
The compiler metaphor helped guide our new name choices. In an XP
project, the metaphor subsitutes for an architectural overview
document. Continuous design means that the architecture evolves with
each iteration, sometimes dramatically, but a project still needs to
be coherent. The metaphor brings consistency without straitjacketing
the implementation. In my opinion, you don't need a metaphor at the
start of a project. Too little is known about the code or the
problem. As the code base grows, the metaphor may present itself
naturally as it did here.
Fourth Task
The fourth and final task introduces state to generate the HTML for
DocBook footnotes. The test case input file
(04.xml) is:
<chapter>
<title>Chapter with Footnotes</title>
<simplesect>
<para>
Needs further clarification.
<footnote><para>
Should appear at the end of the chapter.
</para></footnote>
</para>
<itemizedlist>
<listitem><para>
First item
</para></listitem>
<listitem><para>
Second item
</para></listitem>
</itemizedlist>
<para>
Something about XML.
<footnote><para>
Click here <systemitem role="url">http://www.w3c.org/XML/</systemitem>
</para></footnote>
</para>
<para>
<classname>SomeClass</classname>
<varname>$some_var</varname>
<property>a_property</property>
<filename>my/file/name.PL</filename>
<citetitle>War & Peace</citetitle>
<quote>I do declare!</quote>
</para>
</simplesect>
</chapter>
The expected output file (04.html) is:
<html><body>
<h1>Chapter with Footnotes</h1>
<p>
Needs further clarification.
<a href="#1">[1]</a>
</p>
<ul>
<li><p>
First item
</p></li>
<li><p>
Second item
</p></li>
</ul>
<p>
Something about XML.
<a href="#2">[2]</a>
</p>
<p>
<tt>SomeClass</tt>
<tt>$some_var</tt>
<tt>a_property</tt>
<tt>my/file/name.PL</tt>
<i>War & Peace</i>
"I do declare!"
</p>
<h2>Footnotes</h2><ol>
<li><a name="1"></a><p>
Should appear at the end of the chapter.
</p></li>
<li><a name="2"></a><p>
Click here <a href="http://www.w3c.org/XML/">http://www.w3c.org/XML/</a>
</p></li>
</ol>
</body></html>
The footnotes are compiled at the end in a
Footnotes section. Each footnote is linked through
HTML anchor tags (#1 and #2).
Incremental indexes and relocatable output were the new challenges in
this implementation.
Fourth SMOP
We pulled another blade out of our Swiss Army chainsaw for this task. Perl's
anonymous subroutines were used to solve the footnote problem. The
subroutines bound to chapter and
footnote use variables to glue
the footnotes to their indices and the footnotes section to the end of
the chapter. Here are the additions to the SMOP:
chapter => sub {
my($html, $clipboard) = @_;
$$html .= "<h2>Footnotes</h2><ol>\n$clipboard->{footnotes}</ol>\n"
if $clipboard->{footnotes};
return "<html><body>$$html</body></html>";
},
citetitle => ['i'],
classname => ['tt'],
footnote => sub {
my($html, $clipboard) = @_;
$clipboard->{footnote_idx}++;
$clipboard->{footnotes}
.= qq(<li><a name="$clipboard->{footnote_idx}"></a>$$html</li>\n);
return qq(<a href="#$clipboard->{footnote_idx}">)
. "[$clipboard->{footnote_idx}]</a>";
},
itemizedlist => ['ul'],
listitem => ['li'],
property => ['tt'],
quote => {
prefix => '"',
suffix => '"',
},
systemitem => sub {
my($html) = @_;
return qq(<a href="$$html">$$html</a>);
},
varname => ['tt'],
We didn't see a simple functional solution. Although it's certainly
possible to avoid the introduction of $clipboard,
we let laziness win out over dogma. There was no point in smashing
our collective head against a brick wall when an obvious solution was
staring right at us. Besides, you've got enough functional
programming examples already, so you can stop standing on your head
and read this code right side up.
Fourth SMOP Interpreter
The interpreter changed minimally:
sub to_html {
my($self, $xml_file) = @_;
return _to_html(
'',
XML::Parser->new(Style => 'Tree')->parsefile($xml_file),
{});
}
sub _eval_op {
my($op, $html, $clipboard) = @_;
return $op->($html, $clipboard)
if ref($op) eq 'CODE';
return $op->{prefix} . $$html . $op->{suffix};
}
$clipboard is initialized as a reference to an empty hash by
to_html. If $op is
a CODE reference, _eval_op
invokes the subroutine with $clipboard and the html
generated by the children of the current tag. The anonymous
subroutines bound to the tags can then use all of Perl to fulfill
their mapping obligation.
Object-Oriented Programming
$clipboard is a reference to a simple hash. An alternative
solution would be to instantiate
Bivio::DocBook::XML, and to store
footnote_idx and footnotes
in its object fields.
Objects are very useful, but they would be overkill here. To
instantiate Bivio::DocBook::XML in Perl, it's
traditional to declare a factory method called
new to construct the object. This would clutter
the interface with another method. We also have the option in Perl to
bless a hash reference inline to instantiate the object. In either case, an
objectified hash reference is more complex than a simple hash, and does not add
value. The semantics are not attached to the hash but are embedded in
the anonymous subroutines. Objects as simple state containers are
unnecessarily complex.
Additionally, object field values are less private than those stored
in $clipboard. An object has fields to enable
communication between external calls, for example, a file handle has
an internal buffer and an index so that successive
read calls know what to return. However, it's
common to abuse object fields for intra-call communication, just like
global variables are abused in structured languages (C, FORTRAN,
Pascal, etc.). In most pure object-oriented languages, there's no
practical alternative to object fields to pass multiple temporary
values to private methods. Choice is one of Perl's strengths, and a
simple hash localizes the temporary variable references to the
subroutines that need them.
Hashes and lists are the building blocks of functional programming.
Perl and most functional languages include them as primitive
data types. It's the simple syntax of a Perl hash that makes the SMOPs in
this chapter easy to read. In many languages,
constructing and using a hash is cumbersome, and SMOP languages like
this one are unnatural and hard to read, defeating their purpose.
In object-oriented programming, state and function are inextricably
bound together. Encapsulating state and function in objects
is useful. However, if all you've got is a hammer, every problem
looks like a nail. In functional programming, state and function are
distinct entities. Functional languages decouple function reuse from
state sharing, giving programmers two independent tools instead of one.
Success!
The first iteration is complete. We added all the business value the
customer has asked for. The customer can translate a complete
chapter. Time for a victory dance! Yeeha!
Now sit down and stop hooting. We're not through yet. The customer
gave us some time to clean up our code for this book. It's time for a
little refactoring. We missed a couple of things, and the code could
be more functional.
Virtual Pair Programming
The second iteration evolved from some review comments by Stas.
I wrangled him into partnering with me after he suggested the code
could be more functional. The one hitch was that Stas lives in
Australia, and I live in the U.S.
Pair programming with someone you've never met and who lives on the
other side of the world is challenging. Stas was patient with me, and
he paired remotely before.[15]
His contribution was worth the hassle, and I learned a lot from the
experience.
The fact that he lived in Australia was an added bonus.
After all, he was already standing on his head from my perspective,
and he was always a day ahead of me.
Open Source Development with XP
Correspondence coding is quite common. Many open source projects,
such as GNU, Apache, and Linux, are developed by people who live apart
and sometimes have never met, as was the case with Stas and me.
Open source development is on the rise as result of our increased
communications capabilities. The Internet and the global
telecommunication network enables us to practice
XP remotely almost as easily as we can locally.
Huge collective repositories, such as http://www.sourceforge.net and http://www.cpan.org, enable geographically
challenged teams to share code as easily as groups of developers
working in the same building. Sometimes it's easier to share on the
Internet than within some corporate development environments I've
worked in! Open source encourages developers to program egolessly.
You have to expect feedback when you share your code. More
importantly, open source projects are initiated, are used, and
improve, because a problem needs to be solved, often quickly.
Resources are usually limited, so a simple story is all that is
required to begin development.
Open source and XP are a natural fit. As I've noted before, Perl--one
of the oldest open source projects--shares many of XP's values.
CPAN is Perl's collective repository. Testing is a core practice in
the Perl community. Simplicity is what makes Perl so accessible to
beginners. Feedback is what makes Perl so robust. And so on.
Open source customers may not speak in one voice, so you need to
listen to them carefully, and unify their requests. But, pair
programming is possible with practice.
Geographically challenged programmers can communicate as effectively
as two programmers sitting at the same computer. It's our attitude
that affects the quality of the communication. Stas and I wanted to
work together, and we communicated well despite our physical
separation and lack of common experience. Open source works for the
same reason: programmers want it to.
To learn more about the open source development, read
the book Open Sources: Voices from the Open
Source Revolution, edited by Chris DiBona et al.
available in paperback and also online at
http://www.oreilly.com/catalog/opensources.
|
Deviance Testing
We forgot to test for deviance in the prior iteration.
XML::Parser
handles missing or incomplete tags, so we don't need to test for them
here. The unit test should avoid testing other APIs to keep
the overall test suite size as small as possible.
However, XML::Parser
treats all tags equally, and
Bivio::XML::DocBook
should die if a tag is not in the SMOP. We added the following test
(05-dev.xml) to validate this case:
<chapter>
<unknowntag></unknowntag>
</chapter>
The case tests that _lookup_op throws an
exception when it encounters unknowntag.
The unit test had to change to expect a die
for deviance cases. We also made the code more functional:
#!perl -w
use strict;
use Bivio::Test;
use Bivio::IO::File;
Bivio::Test->new('Bivio::XML::DocBook')->unit([
'Bivio::XML::DocBook' => [
to_html => [
map({
["$_.xml"] => $_ =~ /dev/
? Bivio::DieCode->DIE
: [Bivio::IO::File->read("$_.html")];
} sort(map(/(.*)\.xml$/, <DocBook/*.xml>))),
],
],
]);
The map inside the sort
returns the case base names (DocBook/01,
DocBook/05-dev, etc.), and the outer
map reconstructs the filenames from them.
This purely functional solution is shorter than the previous
version.
If the case file name matches the /dev/ regular
expression, the map
declares the deviance case by returning a
Bivio::DieCode as the expected value.
Otherwise, the input file is conformant, and the
map returns the expected HTML wrapped in an
array.
Bivio::Test lets us declare deviance and
conformance cases similarly. When picking or building your test
infrastructure, make sure deviance case handling is built in. If it's
hard to test APIs that die, you'll probably write
fewer tests for the many error branches in your code.
Final Implementation
The final SMOP and interpreter are shown together with comments, and
POD, and changes highlighted:
package Bivio::XML::DocBook;
use strict;
our($VERSION) = sprintf('%d.%02d', q$Revision: 1.10 $ =~ /\d+/g);
=head1 NAME
Bivio::XML::DocBook - converts DocBook XML files to HTML
=head1 SYNOPSIS
use Bivio::XML::DocBook;
my($html_ref) = Bivio::XML::DocBook->to_html($xml_file);
=head1 DESCRIPTION
C<Bivio::XML::DocBook> converts DocBook XML files to HTML. The mapping is only
partially implemented. It's good enough to convert a simple chapter.
=cut
#=IMPORTS
use Bivio::IO::File ();
use HTML::Entities ();
use XML::Parser ();
#=VARIABLES
my($_XML_TO_HTML_PROGRAM) = {
attribution => {
prefix => '<div align="right">-- ',
suffix => '</div>',
},
blockquote => ['blockquote'],
'chapter/title' => ['h1'],
chapter => sub {
my($html, $clipboard) = @_;
$$html .= "<h2>Footnotes</h2><ol>\n$clipboard->{footnotes}</ol>\n"
if $clipboard->{footnotes};
return "<html><body>$$html</body></html>";
},
citetitle => ['i'],
classname => ['tt'],
command => ['tt'],
emphasis => ['b'],
epigraph => [],
filename => ['tt'],
footnote => sub {
my($html, $clipboard) = @_;
$clipboard->{footnote_idx}++;
$clipboard->{footnotes}
.= qq(<li><a name="$clipboard->{footnote_idx}"></a>$$html</li>\n);
return qq(<a href="#$clipboard->{footnote_idx}">)
. "[$clipboard->{footnote_idx}]</a>";
},
function => ['tt'],
itemizedlist => ['ul'],
listitem => ['li'],
literal => ['tt'],
para => ['p'],
programlisting => ['blockquote', 'pre'],
property => ['tt'],
quote => {
prefix => '"',
suffix => '"',
},
sect1 => [],
'sect1/title' => ['h2'],
simplesect => [],
systemitem => sub {
my($html) = @_;
return qq(<a href="$$html">$$html</a>);
},
varname => ['tt'],
};
=head1 METHODS
=cut
=for html <a name="to_html"></a>
=head2 to_html(string xml_file) : string_ref
Converts I<xml_file> from XML to HTML. Dies if the XML is not well-formed or
if a tag is not handled by the mapping. See the initialization of
$_XML_TO_HTML_PROGRAM for the list of handled tags.
=cut
sub to_html {
my($self, $xml_file) = @_;
return _to_html(
'',
XML::Parser->new(Style => 'Tree')->parsefile($xml_file),
{});
}
#=PRIVATE SUBROUTINES
# _eval_child(string tag, array_ref children, string parent_tag, hash_ref clipboard) : string
#
# Look up $tag in context of $parent_tag to find operator, evaluate $children,
# and then evaluate the found operator. Returns the result of _eval_op.
# Modifies $children so this routine is not idempotent.
#
sub _eval_child {
my($tag, $children, $parent_tag, $clipboard) = @_;
return HTML::Entities::encode($children)
unless $tag;
# We ignore the attributes for now.
shift(@$children);
return _eval_op(
_lookup_op($tag, $parent_tag),
_to_html($tag, $children, $clipboard),
$clipboard);
}
# _eval_op(any op, string_ref html, hash_ref clipboard) : string
#
# Wraps $html in HTML tags defined by $op. If $op is a ARRAY, call
# _to_tags() to convert the simple tag names to form the prefix and
# suffix. If $op is a HASH, use the explicit prefix and suffix. If $op
# is CODE, call the subroutine with $html and $clipboard. Dies if
# $op's type is not handled (program error in $_XML_TO_HTML_PROGRAM).
#
sub _eval_op {
my($op, $html, $clipboard) = @_;
return 'ARRAY' eq ref($op)
? _to_tags($op, '') . $$html . _to_tags([reverse(@$op)], '/')
: 'HASH' eq ref($op)
? $op->{prefix} . $$html . $op->{suffix}
: 'CODE' eq ref($op)
? $op->($html, $clipboard)
: die(ref($op) || $op, ': invalid $_XML_TO_HTML_PROGRAM op');
}
# _lookup_op(string tag, string parent_tag) : hash_ref
#
# Lookup $parent_tag/$tag or $tag in $_XML_TO_HTML_PROGRAM and return.
# Dies if not found.
#
sub _lookup_op {
my($tag, $parent_tag) = @_;
return $_XML_TO_HTML_PROGRAM->{"$parent_tag/$tag"}
|| $_XML_TO_HTML_PROGRAM->{$tag}
|| die("$parent_tag/$tag: unhandled tag");
}
# _to_html(string tag, array_ref children, hash_ref clipboard) : string_ref
#
# Concatenate evaluation of $children and return the resultant HTML.
#
sub _to_html {
my($tag, $children, $clipboard) = @_;
return \(join('',
map({
_eval_child(@$children[2 * $_ .. 2 * $_ + 1], $tag, $clipboard);
} 0 .. @$children / 2 - 1),
));
}
# _to_tags(array_ref names, string prefix) : string
#
# Converts @$names to HTML tags with prefix ('/' or ''), and concatenates
# the tags into a string.
#
sub _to_tags {
my($names, $prefix) = @_;
return join('', map({"<$prefix$_>"} @$names));
}
1;
To keep the explanation brief and your attention longer, here are the
list of changes we made in the order they appear above:
The attribution mapping was not fully compliant
HTML. Values must be surrounded by quotes.
The compilation of $_XML_TO_HTML_PROGRAM
was eliminated. This version is less complex, and is not perceptibly
slower.
_eval_op implements the SMOP operator based on its
type. Stas and I had a (too) long discussion about the formatting and
statement choices. Do you prefer the version above or would you like
to see a if/elsif/else construct? The former is functional, and the
latter is imperative.
_to_html was refactored to be a pure function by
replacing the while and $res
with a join and a map.
The implementation is no longer destructive. The
splice in the previous version modified
$children.
_eval_child is still destructive, however.
_to_tags was renamed from
_compile_tags_to_html.
Separate Concerns
This completes the second and final iteration of our DocBook XML to
HTML translator. The second iteration didn't change anything the
customer would notice, but it improved the program's quality.
Pride of craftsmanship is a motivating factor for most people. The
customer benefits directly when programmers are giving the freedom to
fix up their code like this. Quality is the intangible output of
motivated people.
Craftsmanship plays a role in many professions. For example, one of
my favorite pastimes is baking bread. It's hard to bake well. There
are so many variables, such as, ambient temperature, humidity, and
altitude. A skilled baker knows how to balance them all.
Anybody can bake bread by following a recipe. Just buy the ingredients,
and follow the step-by-step instructions. These days even inexpensive
kitchen appliances can do it.
While fresh bread from a bread machine tastes fine, it wouldn't win a
competition against a skilled baker.
My bread wouldn't either, but I might beat out a bread machine.
Becoming a skilled baker takes practice. Following a recipe
isn't enough. Indeed, most good bakers instinctively adjust the
recipe for temperature, humidity, altitude, and so on. They probably
won't follow the instructions exactly as written either. A simple recipe
tells them what the customer wants, that is, the ingredient
combination, but the know how of a good baker would fill a book.
When you separate the what from the how, you get qualitative
differences that are impossible to specify. In the case of our
translator, the SMOP is the what and the interpreter is the how. The
quality is the succinctness of the mapping from DocBook to HTML.
The program is less than 100 lines of Perl without documentation, and
we can add new mappings with just one line. You can't get more
concise than that.
XP achieves quality by asking the customer what she wants and allowing
programmers to implement it the best they know how. The feedback
built in to XP gives the customer confidence that she's getting what
she wants, much like the feedback from testing tells the
programmers the code does what they want.
In plan-driven methodologies, the lines between the what and the how
are blurred. Specifications are often an excuse for the customers and
analysts to attempt to control how the programmers code.
While the aim is to ensure quality, the result is often the opposite.
The programmers act like unthinking automatons following
the specifications to the letter, even when they know the spec is
wrong.
The programmers are craftsmen, and XP respects their knowledge and
experience. The customer is also a craftsmen, and XP teaches
programmers to respect her skills, too. XP separates concerns to
allow people to excel at their jobs.
Travel Light
When two craftsmen communicate, you don't hear much. Acronyms
abound. Their common experience lets them skip over the details lay
people need spelled out for them, like a recipe.
Perl, XP, and SMOP are terse. In Perl, you don't call the
regular_expression function, you say
//.
A skilled Perl programmer reads and writes //
instinctively. An XP customer writes brief story cards without a
thought about whether the programmers will understand them.
She knows the programmers will ask for elaboration.
There's no need for big fat stateful specifications and programs to
slow down the pipeline from the customer to the programmers to the
computer to deliver value to the users.
An experienced traveler knows that the more baggage you carry, the
harder it is to change planes, switch trains, and climb mountains.
Extreme Perl works best when you drop those bags, and hit the ground
running. Brief plans change easily when reality happens.
Concise code adapts to changed plans rapidly. Travel light, and get
there faster.
Footnotes
DocBook: The Definitive Guide
by Norman Walsh and Leonard Muellner, available online at
http://www.docbook.org/tdg/en/html/docbook.html
The most recent version is available with bOP.
To be sure, I added it to her wedding vows
ex post facto.
One reviewer would have preferred Perl POD. However, XP only works
when the customer speaks in one voice, so I ignored him for the sake
of matrimonial harmony.
One of the reviewers implemented the simple approach, and the two
solutions are of comparable size and complexity.
troff is a
text-formatting language for UNIX man pages and other documents.
After 25 years, it's still in use. For more information, visit
http://www.kohala.com/start/troff/troff.html.
Eric Raymond's doclifter performs an even
more herculean task: the program converts troff to
DocBook. doclifter uses the implicit relations of
common usage patterns to extract higher-level semantics, such as,
knowing that man page references usually match the regular expression:
/\w+\(\d+\)/. The 6,000 line program is
written declaratively in Python, and can be downloaded from
http://www.tuxedo.org/~esr/doclifter.
For purists, the implementation of
Bivio::IO::File->read does not lock the file,
although it could. But read can't, because it
is defined imperatively, and it cannot assume the file can be
read into the buffer in its entirety.
For brevity, I've excluded Perl boilerplate, such as
package and use statements.
The final version is listed in full regalia including header comments
for the subroutines.
From Concepts, Techniques, and Models of Computer
Programming by Peter Van Roy and Seif Haridi, draft version
dated January 6, 2003, p. 109, available at
http://www.info.ucl.ac.be/people/PVR/book.pdf
In On Lisp, Paul Graham explains and
demonstrates this inside-out concept (page 34).
The book is out of print, but you can download it from
http://www.paulgraham.com/onlisp.html.
On Lisp, Paul Graham, p. 33.
Thanks to Greg Compestine for asking the questions: What are the
alternatives, and how do you know is faster?
The XML Stylesheet Language Translation is an XML programming language
for translating XML into XML and other output formats (e.g., PDF and
HTML). For more info, see
http://www.w3.org/Style/XSL/
Stas Bekman co-wrote the
book Practical mod_perl with Eric Cholet who
lives in France. Stas is also an active contributor to the mod_perl code
base and documentation
(http://perl.apache.org).
|