This appendix describes fft2html
, an Icon
program that translates free-format text into HTML.
Free-format text is text that is formatted by hand using white space and other graphics characters (e.g., asterisks as bullets); the alternative to free-formatted text is text containing explicit formatting commands, such as those provided by HTML or Tex.
<fft2html.icn
>= [D->]
define white_space ' \t'
record block(type, text)
procedure main()
local blocks
blocks := []
while put(blocks, read_block(&input))
recognize_headers(blocks)
recognize_pagenos(blocks)
recognize_image(blocks)
recognize_list_items(blocks)
recognize_definition_list_items(blocks)
blocks := recognize_block_paragraphs(blocks)
recognize_paragraphs(blocks)
blocks := recognize_definition_lists(blocks)
blocks := recognize_lists(blocks)
recognize_simple_tables(blocks)
recognize_shopping_lists(blocks)
blocks := recognize_ei_definitions(blocks)
paragraph_block_paragraphs(blocks)
to_html(blocks, &output)
end
The first translation step is to group the input lines into blocks. A line is the sequence of input characters between two successive newline characters; implicitly, input always starts and ends with a newline. A blank line is a line containing no characters, or only white-space characters (i.e., space, vertical and horizontal tabs, form feeds, and so on). A block is the sequence of lines appearing between two consecutive blank lines; only blocks containing one or more lines are of interest in this problem, so implicitly ``block'' means ``non-empty block.''
Read and return a block; the lines in the block come from the input file
i
.
<fft2html.icn
>+= [<-D->]
procedure read_block(i)
local b, line
while (line := (read(i) | fail)\1) & is_blank(line)
b := [detab(line)]
while (line := read(i)) & not(is_blank(line)) do put(b, detab(line))
return block(&null, b)
end
A blank line contains either no characters or only white-space characters.
<fft2html.icn
>+= [<-D->]
procedure is_blank(l)
return ((*l = 0) | (l ? (tab(many(white_space)) & pos(0))))
end
A text block b
is a header if
b
has no type assigned and the text associated with b
contains
exactly one line, and
b
contains at least two words, the first of which is
made up of only digits and periods, and
1.
and .5
are not header levels, but 1.0
is).
1.0
), the header's
level (i.e., the number of periods in the header tag; note that header tags
such as 1.0
have level 1), and the header text.
<fft2html.icn
>+= [<-D->]
record header_block(level, tag, text)
procedure recognize_headers(blocks)
local b, ln, lv, ok
every b := !blocks do
if /b.type & (*b.text = 1) then {
ln := split_first_word(b.text[1])
if (ln[1] ? (tab(many('0123456789.')) & pos(0))) & (*ln[2] > 1) then {
lv := split_levels(ln[1])
if *lv > 1 then
ok := 1
every if *(!lv) = 0 then ok := 0
if ok = 1 then {
b.type := header_block(*lv, ln[1], ln[2])
if lv[-1] == "0" then b.type.level -:= 1
}
}
}
end
Split the input line l
just after the first word of l
; return the
pieces in a two-element list, the second element of which may be empty if l
contains only one word. White space before and after the first word is deleted
in the output; all other white space is maintained.
<fft2html.icn
>+= [<-D->]
procedure split_first_word(l)
local pieces
(l || " ") ? {
tab(many(white_space))
wd := tab(upto(white_space))
tab(many(white_space))
return [wd, (tab(0)[1:-1] | "")]
}
return pieces
end
<fft2html.icn
>+= [<-D->]
procedure split_levels(lvl)
local l
l := []
(lvl || ".") ? while (put(l, tab(upto('.'))) & move(1))
return l
end
A text block b
is a paragraph if
b
has no type assigned, and
b
have the
same (more or less) indentation, and
b
is indented eight (or
so) spaces with respect to the rest of the lines.
A paragraph has no interesting features other than its text, so the paragraph data structure needs no fields. Unfortunately, Icon records need at least one field.
<fft2html.icn
>+= [<-D->]
record paragraph_block(an_unused_field)
procedure recognize_paragraphs(blocks)
local b, indents, i
every b := !blocks do
if /b.type then {
indents := get_indentation(b.text)
i := indents[1] - (if *indents > 1 then indents[2] else 0)
if (7 <= i) & (i <= 9) then
b.type := paragraph_block()
}
end
A block paragraph is a paragraph in which the first line is not indented. A block paragraph results when a regular paragraph is split by a list or a page break, for example.
A text block b
is a block paragraph if
b
has no type assigned, and
b
have no indentation.
If a block paragraph is separated from a paragraph or another block paragraph by a page number or an image, move the block paragraph over the page number or image and merge it with the other paragraph.
<fft2html.icn
>+= [<-D->]
record block_paragraph_block(indentation)
procedure recognize_block_paragraphs(blocks)
local b, indent, i, s
s := *blocks
every i := s to 1 by -1 do
if /blocks[i].type then
if indent := uniform_indentation(blocks[i].text) then
if (type(blocks[i - 1].type) == ("pageno_block" | "image_block")) then
if ((type(blocks[i - 2].type) == "paragraph_block") &
(indent = 0)) |
((type(blocks[i - 2].type) == "block_paragraph_block") &
(indent = blocks[i - 2].type.indentation)) then {
blocks[i - 2].text |||:= blocks[i].text
blocks := blocks[1:i] ||| blocks[i+1:0]
}
else blocks[i].type := block_paragraph_block(indent)
else blocks[i].type := block_paragraph_block(indent)
return blocks;
end
Create and return a list of integers i
with the
property that i[j]
equals the number of space characters at the
start of the line text[j]
. This code looks at only space
characters; other white-space characters --- tab, in particular --- are not
taken into account by this code.
<fft2html.icn
>+= [<-D->]
procedure get_indentation(text)
local indents
indents := []
every (!text ? put(indents, (many(' ') - 1) | 0)\1)
return indents
end
Return a copy of text line l
in which all tabs have been replaced by an
appropriate number of spaces (each tab stop is assumed to be eight characters
to the right of previous tab stop, starting at character position one on the
left).
<fft2html
>=
procedure detab(l)
local i
while i := upto('\t', l) do {
l := (l[1 : i] | "") || repl(" ", 8 - ((i - 1) % 8)) || (l[i+1 : 0] | "")
return l
end
A text block b
is a page number if
b
has no type assigned, and
b
, and
<fft2html.icn
>+= [<-D->]
record pageno_block(pno)
procedure recognize_pagenos(blocks)
local b, ln
every b := !blocks do
if /b.type & (*b.text = 1) then {
ln := split_first_word(b.text[1])
if (ln[1][1] == "M") & is_page_no(ln[2]) then
b.type := pageno_block(ln[2])
}
end
A text block b
is an image if
b
has no type assigned, and
b
, and
image
and the
second of which is a file name containing the image data.
image_block
data structure holds the name of the file
containing the image.
<fft2html.icn
>+= [<-D->]
record image_block(fname)
procedure recognize_image(blocks)
local b, ln
every b := !blocks do
if /b.type & (*b.text = 1) then {
ln := split_first_word(b.text[1])
if (ln[1] == "image")then
b.type := image_block(ln[2])
}
end
A string s
is a page number if it's an integer possibly proceeded by a
letter and a hyphen (e.g. B-19
, an appendix page number).
<fft2html.icn
>+= [<-D->]
procedure is_page_no(s)
(*s > 0) | fail
(s ? (tab(many(&digits)) & pos(0))) & return
(s ? (tab(many('ABCDEFGHIJKLMNOPQRSTUVWXYZ')) & ="-" &
tab(many(&digits)) & pos(0))) & return
end
A text block b
is a list item if
b
has no type assigned, and
b
have the
same (more or less) indentation, and
b
is out-dented
with respect to the rest of the lines, and
A list entry has no interesting features other than its text, so the paragraph data structure needs no fields. Unfortunately, Icon records need at least one field.
<fft2html.icn
>+= [<-D->]
record list_item_block(tag, indentation, text)
procedure recognize_list_items(blocks)
local b, indents, ln
every b := !blocks do
if /b.type then {
indents := get_indentation(b.text)
if (*indents = 1) | (indents[1] < indents[2]) then {
ln := split_first_word(b.text[1])
if (ln[1] ? (tab(many(&digits ++ &letters)) & ="." & pos(0))) then {
b.type := list_item_block(ln[1][1:-1], indents[1], copy(b.text))
b.type.text[1] := ln[2]
}
}
}
end
A list is recognized by the sequence of its list items; generally, a consecutive sequence of list items belong to the same list.
<fft2html.icn
>+= [<-D->]
record list_start_block(unused_field)
record list_end_block(unused_field)
procedure recognize_lists(blocks)
return _recognize_lists(blocks, "list_item_block",
list_start_block(), list_end_block())
end
<fft2html.icn
>+= [<-D->]
procedure _recognize_lists(blocks, item_type, list_start, list_end)
local i
<migrate page numbers>
return listify(blocks, item_type, list_start, list_end)
end
Unfortunately, there are a number of things that can confuse the simple list-recognition algorithm described in the previous paragraphs. If a list is split across pages, the page number will interrupt the list-entry sequence. Because the idea of a page is flexible in hypertext, one way to deal with a list split across pages is to merge it back into a single list by migrating the page number to the end of the split list. Note this algorithm works only if the list is split across one page; otherwise, the migrating page number bumps into the following page number and the migration stops.
<migrate page numbers>= (U->) every i := 1 to *blocks - 2 do if (type(blocks[i].type) == item_type) & (type(blocks[i + 1].type) == "pageno_block") & (type(blocks[i + 2].type) == item_type) then blocks[i + 1] :=: blocks[i + 2]
Given the block list blocks
, surround the sequences of list items of type
itemt
with the start- and end-list markers lists
and liste
respectively. Changes in list-item indentation are used to keep track of
nested lists.
The stack pagenos
keeps track of page numbers encountered amid the
list-item sequences, as would occur when a list is spread over several pages.
It is difficult to know what to do locally with the stacked-up page numbers;
the best answer is to globally renumber. However, to get going, only the
oldest page number in the stack is restored to blocks
.
<fft2html.icn
>+= [<-D->]
procedure listify(blocks, itemt, lists, liste)
local newblks, indents, b, pagenos
newblks := []
indents := [0]
pagenos := []
every b := !blocks do {
if type(b.type) == itemt then {
if b.type.indentation > indents[1] then {
put(newblks, block(copy(lists)))
push(indents, b.type.indentation)
}
else if b.type.indentation < indents[1] then {
put(newblks, block(copy(liste)))
pop(indents)
if *indents = 1 & *pagenos > 0 then {
put(newblks, pagenos[-1])
pagenos := []
}
}
put(newblks, b)
}
else if type(b.type) == "pageno_block" then
if *indents = 1 then put(newblks, b)
else push(pagenos, b)
else {
while *indents > 1 do {
put(newblks, block(copy(liste)))
pop(indents)
}
if *pagenos > 0 then {
put(newblks, pagenos[-1])
pagenos := []
}
put(newblks, b)
}
}
every 2 to *indents do put(newblks, block(copy(liste)))
if *pagenos > 0 then put(newblks, pagenos[-1])
return newblks
end
A shopping list is a small, informal list, along the lines of
alphaA shopping list differs from a regular list; a shopping list has no bullets or tags, shopping-list items are one line and short, and shopping lists have no inter-item spacing.
milk
walk the dog
<fft2html.icn
>+= [<-D->]
record shopping_list_block(indentation)
procedure recognize_shopping_lists(blocks)
every b := !blocks do
if (type(b.type) == "block_paragraph_block") & (b.type.indentation > 0) &
unfilled_text(b.text) & not(internal_indent(b.text)) then
b.type := shopping_list_block(b.type.indentation)
end # recognize_shopping_lists
Translate the structure in blocks
into HTML, writing the result to out
.
<fft2html.icn
>+= [<-D->]
procedure to_html(blocks, out)
write(out, "<html>")
write(out, "<body>")
every b := !blocks do {
case type(b.type) of {
"block_paragraph_block":
if b.type.indentation = 0 then {
write(out, "<p>")
every write(out, !b.text)
write(out, "</p>")
}
else {
write(out, "<blockquote>")
every write(out, !b.text)
write(out, "</blockquote>")
}
"paragraph_block": {
write(out, "<p>")
every write(out, !b.text)
write(out, "</p>")
}
"ei_definition_block": {
write(out, "<blockquote><dl><dt>", b.text[1], "<dd>")
every write(out, b.text[2 to 5], "<br>")
write(out, "</dl></blockquote>")
}
"header_block":
write(out, "<h", b.type.level, ">",
b.text[1],
"</h", b.type.level, ">")
"pageno_block": {
}
"image_block":
write(out, "<p align = center><img src = \"", b.type.fname, "\"></p>")
"list_start_block":
write(out, "<ol>")
"list_item_block": {
write(out, "<li>")
every write(out, !b.type.text)
}
"list_end_block":
write(out, "</ol>")
"definition_list_start_block":
write(out, "<blockquote><dl>")
"definition_list_item_block": {
write(out, "<dt>", b.type.tag, "<dd>")
every write(out, !b.type.text)
}
"definition_list_end_block":
write(out, "</dl></blockquote>")
"shopping_list_block": {
write(out, "<blockquote>")
every write(out, !b.text, "<br>")
write(out, "</blockquote>")
}
"simple_table_block":
do_html_simple_table(out, b)
default:
write(&errout, "unknown block type \"", type(b.type), "\".")
}
}
write(out, "</body>")
write(out, "</html>")
end
Generate HTML code for the simple table simtab
; the HTML code is
written to out
.
<fft2html.icn
>+= [<-D->]
procedure do_html_simple_table(out, simtab)
local e, cols, left, right;
cols := simtab.type.right_column_edges ||| [0]
write(out, "<blockquote><table>")
every e := !simtab.text do {
write(out, "<tr>")
left := 1
every right := !cols do {
write(out, "<td align = left>", e[left:right], "</td>")
left := right
}
write(out, "</tr>")
}
write(out, "</table></blockquote>")
end
A text block b
is a definition-list item if
b
has no type assigned, and
b
have the
same (more or less) indentation, and
b
is out-dented
with respect to the rest of the lines
The tags distinguish a list item from a definition-list item. A list-item tag is a single letter or digit, while a definition-list-item tag can contain several words (although it is assumed to be small enough to fit between the left edge of the page and the indentation amount). List items should be recognized before definition-list items.
This code also assumes that definition lists are not nested, and so assigns each item a constant indentation of ten.
<fft2html.icn
>+= [<-D->]
record definition_list_item_block(tag, indentation, text)
procedure recognize_definition_list_items(blocks)
local b, indents, ln
every b := !blocks do
if /b.type & (*b.text > 1) then {
indents := get_indentation(b.text)
if (indents[1] < indents[2]) & (*set(indents[2:0]) = 1) then {
tag := b.text[1][1:indents[2]]
b.type := definition_list_item_block(tag, 10, copy(b.text))
b.type.text[1] := b.text[1][indents[2]:0]
}
}
end
A list is recognized by the sequence of its list items; generally, a consecutive sequence of list items belong to the same list.
<fft2html.icn
>+= [<-D->]
record definition_list_start_block(unused_field)
record definition_list_end_block(unused_field)
procedure recognize_definition_lists(blocks)
return _recognize_lists(blocks, "definition_list_item_block",
definition_list_start_block(),
definition_list_end_block())
end
A simple table is a block paragraph in which every line in the block paragraph has a space in the same set of two or more columns. To avoid trivial definitions, the block paragraph must have more than one line.
<fft2html.icn
>+= [<-D->]
record simple_table_block(right_column_edges)
procedure recognize_simple_tables(blocks)
local b, sp, common
every b := !blocks do {
# write(&errout, type(b.type))
if (type(b.type) == "block_paragraph_block") & (*b.text > 2) then {
sp := space_profile(b.text)
common := sp[1]
every common := ol_intersect(common, !sp)
# write(&errout, "*common = ", *common)
if *common > 1 then {
# write(&errout, "indent = ", b.type.indentation,
# ", common[1] = ", common[1])
if b.type.indentation = common[1] - 1 then pop(common)
b.type := simple_table_block(common)
}
}
}
end # recognize_simple_tables
A block of text is unfilled if a majority of the lines in the text are significantly shorter than the maximum line size (here assumed to be 80 characters).
``Significantly shorter'' has a technical meaning: if the first word in the
line following line l
could fit on line l
without exceeding the maximum
line size, then line l
is significantly shorter than the maximum line
size. The code below, however, cheats and assumes a line of less than seventy
characters is significantly shorter.
<fft2html.icn
>+= [<-D->]
procedure unfilled_text(text)
local little, l
little := 0
every l := !text do if *l < 70 then little +:= 1
return ((little > 0) & (*text/little < 2))
end # unfilled_text
A block of text is uniformly indented if each line
has more or less the same amount of indentation. More precisely, a strict
majority of lines are indented i
spaces and the remainder of the
lines are indented i
plus or minus one. Return the uniform
indentation i
if text
is uniformly indented, or fail
of text
is not uniformly indented.
<fft2html.icn
>+= [<-D->]
procedure uniform_indentation(text)
local minor, i
indents := frequencies(get_indentation(text))
if *indents > 3 then fail
minor := 0
every i := 1 to *indents - 1 do minor +:= indents[i][2]
if indents[-1][2] <= minor then fail
every i := 1 to *indents - 1 do
if (abs(indents[-1][1] - indents[i][1]) > 1) then fail
return indents[-1][1]
end
A block of text has an internal indentation if the second word of every
line starts at the same column (give or take). If text
has an internal
indent, return it, otherwise fail.
<fft2html.icn
>+= [<-D->]
procedure internal_indent(text)
local ilist, f
ilist := []
every !text ? (tab(many(white_space))\1 & tab(upto(white_space))\1 &
tab(many(white_space))\1 & put(ilist, &pos))
if *text = *ilist then {
f := frequencies(ilist)
if *f = 1 then return f[1][1]
if (*f = 2) & (abs(f[1][1] - f[2][1]) < 2) then return f[2][1]
}
end # internal_indent
Given a list of integers ilist
, return a list of pairs giving the frequency
of occurrence in ilist. The first coordinate of a pair is an integer from
ilist
, the second coordinate is the number of times the integer occurred in
ilist
. The pairs are sorted in increasing order on frequency.
<fft2html.icn
>+= [<-D->]
procedure frequencies(ilist)
local t, i
t := table(0)
every i := !ilist do t[i] +:= 1
return sort(t, 2)
end
<fft2html.icn
>+= [<-D->]
procedure paragraph_block_paragraphs(blocks)
local b
every b := !blocks do
if type(b.type) == "block_paragraph_block" & (*b.text = 1) then {
if b.type.indentation = 0 then
stop("unindented stand-alone block paragraph:\n \"" ||
b.text[1] || "\"")
b.type := paragraph_block(b.type.indentation)
}
end
An external-interface definition is text of the form
<fft2html.icn
>+= [<-D->]
record ei_definition_block(unused_field)
procedure recognize_ei_definitions(blocks)
local new_blks, i, j, b, saved
new_blks := []
i := 1
while i <= *blocks do {
b := blocks[i]
if (type(b.type) == "block_paragraph_block") & (*\(b.text) = 1) &
(b.text[1] ? (tab(many(' '))\1 & ="External Interface: ")) then {
put(new_blks, block(ei_definition_block(), [b.text[1]]));
j := 1
saved := []
while j <= 4 do {
b := blocks[i +:= 1]
if type(b.type) == ("pageno_block" | "image_block") then put(saved, b)
else {
put(new_blks[-1].text, b.text[1])
j +:= 1
}
}
new_blks |||:= saved
}
else
put(new_blks, b)
i +:= 1
}
return new_blks
end
A space profile for a block of text containing n lines is an n element list in which list element i corresponds to text line i. Each list element is itself a list of integers; each integer represents the position of the right-most end of a non-empty, maximal sequence of spaces within the corresponding text line. For example, j in list element i means text line i has a sequence of spaces that ends a column j; that is, column j contains a space and column j+1 contains a non-space character (note this assumes first that all white-space characters have either been eliminated or translated into spaces, and second that text lines don't end with spaces).
The integers within a list item are given in increasing order; that is, they follow the left-to-right order of the text line; the list items within the list follow the top-to-bottom order of the lines in the text.
<fft2html.icn
>+= [<-D->]
procedure space_profile(text)
local l, sp, i
sp := []
every l := !text do {
put(sp, [])
l ? repeat {
tab(upto(' ') | &pos)
i := many(' ') | break
put(sp[-1], i)
tab(i)
}
}
return sp
end # space_profile
Return the intersection of two increasingly ordered integer lists l1
and l2
; the intersection is itself returned as an increasingly ordered
integer list.
<fft2html.icn
>+= [<-D]
procedure ol_intersect(l1, l2)
local common, e1, e2, next_l1, next_l2
common := []
next_l1 := create !l1
next_l2 := create !l2
e1 := @next_l1 | (return common)
e2 := @next_l2 | (return common)
repeat
if e1 < e2 then e1 := @next_l1 | (return common)
else if e1 > e2 then e2 := @next_l2 | (return common)
else {
put(common, e1)
e1 := @next_l1 | (return common)
e2 := @next_l2 | (return common)
}
end