A Codeaholic’s Book

The Banana Book was half completed when I started wondering how to render it. At that point, and mainly by default, I had used Microsoft Word for text entry, because it is versatile and convenient, and because it includes a powerful spell checker and thesaurus, especially useful a feature for non-native English writers like me.

Word is a great tool for entering text, but it is substandard for rendering. Its output is adequate for reports or business letters, but it just lacks the kind of balanced, restful to the eye general impression one gets from a professionally typeset document, book, brochure, article or thesis.

Still, many books are printed using Word. The results are poor compared to the output of top of the line typesetting tools, the gold standards among computer scientists (and most scientists for that matter) being Donald Knuth’s TeX and its extension LaTeX, especially (but not exclusively) for publications that include mathematical material.

Typesetting has fascinated software engineers ever since the trade was invented. It is a fantastic problem, formalizing rules to reach an aesthetic level to be measured by human pleasure and comfort level. It requires a level of detail and precision that can only appeal to the dedicated scientist. Among other achievements, Knuth published a full chapter of a book to thoroughly describe the mathematics that control the typesetting of the letter S (fortunately, it is the only latin letter that requires such sophistication in its specification). And these efforts go beyond text only: Lilypond is a fantastic software system used to typeset the most beautiful music scores ever.

Even Word’s output may not be all that poor in absolute terms. After all, many people don’t seem to mind writing and reading books typeset using Word alone. It may very well be that it is just too frustrating to settle for less when you’ve seen how much better your text can look using LaTeX, or another similarly advanced tool. And the very same goes for music typeset using Lilypond.

At some point, I bought Guy Kawasaki’s book “APE: Author, Publisher, Entrepreneur”, where I got quite some no-nonsense wisdom on the book writing journey. It contains many useful facts and tips, comforted me in some of my opinions and made me change my mind on others. Something felt terribly bad though: he never questions Word as the appropriate tool for typesetting books, and even worse, the process for producing an E-book from a Word document is described as a reasonably short but still totally manual procedure using Adobe’s InDesign, clicks, cut and pastes and more of the same.

And he did not even start to address advanced features I badly needed, like embedded math, complex figures, references, etc.

To a true codeaholic, this manual process is as wrong as can be. Mr Kawasaki may be a writing genius. He may know with absolute certainty when a manuscript is final, and he can then just spend the hour or two it takes to process it, prepare the various deliverables out of it, and there you go.

But I’m not confident in that way. Whatever I write I expect to be reread, changed, corrected, amended. I just can’t predict when a version will be truly final, ever. It is work in progress, again and again.

And going through a manual procedure, even if it takes as little as 30 minutes, every time the input changes, just isn’t an option for me. I need more comfort, more flexibility. I need the ability to change the manuscript, over and over again, without ever being blocked by having wrongly assumed that a given version must have been final.

The codeaholic way


Since Mr Kawasaki’s approach does not work for me, I did something radically different. In a nutshell, this is how The Banana Book is processed:

  • The source text is maintained using Word, with a handful of styles to identify the semantics (titles, etc.). The document is as legit as possible: footnotes are footnotes, math is typeset using math symbols, images are designed externally and pasted in the document, etc. The Word document can be printed as is for proof-reading.

To be compared with LaTeX, where the source is useless: proofreading must always be performed on the typeset version of the document.

  • A VBA (Visual Basic for Applications) script goes through this Word document, and uses the exposed object model to produce an XML representation of the document, including the various paragraph styles, the footnotes, the math, funky characters, etc.
  • The same VBA script also produces the list of images (some of which are originally maintained as Powerpoint slides, some other are graphs produced using graphviz as the simple flowchart shown above, or bit mapped screen captures) by using Word’s capability of exporting the document as a single HTML page with embedded images. The HTML page is discarded, but the exported images can then be made available for the further formatting and typesetting passes of the process.
  • This XML file is cleaned up, normalized, and simplified by a script. The VBA script described above is complex enough. It generates valid XML and that’s all it does. Ensuring that it fulfills more semantic properties would have made it unreasonably complicated)
  • The cleaned up XML file describing the full document (Just over 700K in size) is then fed to a script that produces a LaTeX source file, to be processed to produce a final PDF file, to be sent to the printer, et voila!
  • The three other outputs (HTML, EPUB and Kindle) are variations of each other, as they are all HTML-based. The Kindle version is produced by generating a slightly altered EPUB version, which is then fed to the Kindlegen utility made available by Amazon to produce the Kindle version of the book.
  • When producing any of these HTML outputs, tables are extracted one by one, and the LaTeX renderer is used to convert them into separate bitmap images, to be included in the HTML document.

The tables were converted to bitmaps because the published specifications of the EPUB format indicated that plain HTML tables would not be supported, even though most of the readers I tried did.

This is the helicopter view on the process. As always, the devil lies in the details.

Dealing with references

Word provides an extensive capability to manage bibliographic references. It is way too sophisticated for comfort, given my modest needs and even more my need to programmatically interact with this facility (for instance, to convert these references into LaTeX’s BibTeX). I soon decided not to use it, and built a simplistic bespoke system instead. Bibliographic references are put in footnotes, with a bracketed code to indicate a bibliographic reference, such as:

[B:Aho2006] “Compilers: Principles, Techniques and Tools” (a.k.a. The Dragon Book), Alfred V. Aho, Monica S. Liam, Ravi Sethi, Jeffrey D. Ullman, Gradiance 2006, ISBN-13: 978-0321493026

or a web reference:

[W:Schamel2012] http://www.atchistory.org/History/checklst.htm.

These “[B:…]” and “[W:…]” markers are the only extensions, the sole artefacts put in the text which make no sense for Word, but are required for my rendering machinery. Other than this, all I used are native Word artefacts, decoded and mapped when going to the XML representations (explicit font changes, exotic characters, tables, images, footnotes…)

Ironically, I ended up not even targeting LaTeX’s own BibTeX system for references. It was much simpler to just format the two lists of references (bibliographic and web-based) directly.

These footnotes take a special treatment when mapped onto the intermediate XML representation, and segregating web references from bibliographic ones allows me to a process them differently, using a different color when printed or making them clickable when targeting HTML or one of the e-book formats.

The devil lies in hyphenation

This automated process that starts with the Word document does not allow tweaking the end-result. I just can’t force a non-breakable space or any other local thing to override a generally applicable automated behavior for the aesthetics of a case in point.

It is not much of a restriction when targeting e-books. The text must be resizable, reformatted on the fly if the user changes the font size or uses another reading device. One must rely on the dynamic formatting engine of the reader to do the best possible job in all circumstances.

On print, formatting and more specifically hyphenation are different matters altogether. TeX’s and consequently, LaTeX’s hyphenation algorithm is a form of black art (a full PhD thesis has been written on this sole topic). When in doubt, especially when typesetting for short lines calibrated for a book rather than long lines calibrated for A4 or Letter-sized paper, it will overflow beyond the right margin rather than produce an ugly-looking output with too much spaces between words. TeX and LaTeX were never meant to be used as rendering engines, to be fed with some input generated automatically by a process that cannot react to error messages. They were meant to be used by humans, who could see that some output was not visually adequate and take action accordingly.

When its hyphenation algorithm does not achieve something adequate out of the box, Tex will produce something (deliberately?) ugly, providing a number of mechanisms to the user to tweak the source ad hoc to improve rendering. But as explained above, these ad hoc tweaks were not available to me.

Since I could not alter the input on a case by case basis as the casual LaTeX user would have, I reverted to three devices to address hyphenation-related formatting issues:

  • In addition to specifying hyphenation ad hoc in the text, LaTeX also supports the declaration of an hyphenation strategy for specific words, to be put in the document’s preamble, and which applies to all occurrences of these words in the entire document:

\hyphenation{ple-tho-ra opti-mi-za-tion re-pro-du-ci-ble phi-lo-so-pher}

  • By default, TeX won’t hyphenate a word that contains a dash or a period. By overriding this behavior, I allowed for more words to be hyphenated, somewhat alleviating the pressure and giving TeX more leeway when trying to produce an aesthetically appealing result.
  • And finally, when nothing else worked, I did what I had vowed never to do, namely change the text, replaced words or even removed full sentences for the sole sake of allowing TeX to hyphenate adequately. I reverted to such extreme measures a dozen times in total. This is of course the last resort. All this work would be lost if the book was to be printed on another format, different font or font size, etc.


The story told here is not made up or sugar-coated to make for a more compelling narrative. This really is how this book is produced, and it works. Over the last few months of its design, I have produced hundreds of intermediate versions, sometimes after just adding or removing a comma, by a fully automated process.

As a matter of fact, this process, even if fully automated, is a bit too long for comfort: it takes almost five minutes, over two-thirds of which are spent in the VBA export script, which I should/would/will/might optimize at some point in the future.

It has delivered on its main promise, namely the automated generation of the various publishable artefacts from a single central source. This has made the editing process way more flexible, as changes, even essential ones, could be performed at any stage, without ever having to worry about manual work that would need to be redone or thrown away. It has provided nice touches (such as clickable web references, for instance, making for a more intuitive read on a table or as HTML pages).

This does not mean that the process itself is flawless. It can still be improved, for instance with support for an index in the electronic versions of the book.

Admittedly, there is less of a need for such a facility for electronic publishing, as words can be searched in the text in ways that no paper printed version will allow.

The solution described here may be qualified as a mild form of over-engineering, something I would refer to as predictive engineering: a solution to a problem one could have predicted but without absolute certainty. It could have been a wasted effort, the automation of a process that may not have required automation in the first place. In this case though, it has proven its value over and over again by having produced hundreds of version of the text and saved hours of manual labor in the process.

This is not the unique solution to a unique problem. Using different tools, a different process, for a different kind of book perhaps, one could achieve good results as well.

Let me know your experiences!

15-07-2016 - By in


Leave a Reply

Your email address will not be published.