.\"
.\" troff -ms % | lpr
.\"
.\" revision date - change whenever this file is edited
.ds RD 5 April 1991
.nr PO 1.2i	\" page offset 1.2 inches
.nr PD .7v	\" inter-paragraph distance
.\"
.EH 'RTF-to-troff Translation'- % -''
.OH ''- % -'RTF-to-troff Translation'
.OF 'Revision date:\0\0\*(RD''Printed:\0\0\n(dy \*(MO 19\n(yr'
.EF 'Revision date:\0\0\*(RD''Printed:\0\0\n(dy \*(MO 19\n(yr'
.\"
.\" subscript strings
.ds < \s-2\v'.4m'
.ds > \v'-.4m'\s+2
.\"
.\" I - italic font (taken from -ms and changed)
.de I
.nr PQ \\n(.f
.if t \&\\$3\\f2\\$1\\fP\&\\$2
.if n .if \\n(.$=1 \&\\$1
.if n .if \\n(.$>1 \&\\$1\c
.if n .if \\n(.$>1 \&\\$2
..
.TL
rtf2troff
.sp .5v
An RTF to troff Translator
.AU
Paul DuBois
dubois@primate.wisc.edu
.AI
Wisconsin Regional Primate Research Center
Revision date:\0\0\*(RD
.NH
Introduction
.LP
.I rtf2troff
is a document translator that takes an input file in RTF format
and writes output suitable for processing by
.I troff .
It has a number of features, including:
.IP \(bu
Production of output that, when run through
.I troff ,
on rare occasions possesses a mild resemblance to the original document.
.IP \(bu
Voluminous, inefficient and largely incomprehensible source code (available for free and overpriced at that).
.IP \(bu
Often-incomprehensible output, especially for tables.
.IP \(bu
Complete lack of support for formulas.
.IP \(bu
Support for underlining and strikethrough
that generates prize-winning amounts of output.
Besides its incredible bulk, this has the additional security feature
of being impossible to make sense of for editing.
The reckless user is, however, given the option of disabling this valuable
form of protection.
.IP \(bu
Inability to write
.I nroff -specific
output, or output specific to the
.I \-me ,
.I \-mm
or
.I \-ms
macro packages.
.IP \(bu
Blissful ignorance of the fact that there are any fonts other than the
default, except for purposes of boldface and italics.
.IP \(bu
Ability to completely lose any unrecognized input, or, for variety, core dump.
.IP \(bu
Merges text of footnotes right into the main body of the document.
You didn't really want 'em at the bottom of the page anyway, did you?
.LP
Perhaps over time this list will shrink and this section can be removed.
But don't hold your breath.
.LP
The intended audience of this document is not those who might use
.I rtf2troff
for daily work, but programmers who want to know why it's written the way
it is.
As such, it discusses aspects of the implementation.
The source code should be consulted for further reference.
Actually, I suspect you're reading this because you've already looked
at the source and couldn't make any sense of it!
.NH
Random Implementation Notes
.LP
In the output produced by
.I rtf2troff ,
a distinction is made between
.I content
(or document) text and
.I formatting
(or control)
text.
Content text consists of the characters that are actually supposed
to appear in the finished document.
Formatting text affects how those characters appear.
Formatting text may be inline with content text (e.g., ``\efR'', ``\es+3'')
or on a line by itself (e.g., ``.ft R'', ``.ps +3'').
.NH 2
State Maintenance Issues
.LP
It is possible to write out control language whenever any changes are
made to document, section, paragraph or character formatting
properties, but that
would result in more output than is necessary.
Instead,
.I rtf2troff
maintains notions of two kinds of state:
an
.I internal
state, which is the current state of formatting properties as indicated
by control words encountered in the RTF input stream,
and
.I written
state, which tracks the state corresponding to the
.I troff
control language that has been written to the output (i.e., the state
that
.I troff
will be in).
.LP
Changes to formatting properties are simply accumulated in the
internal state without writing any output.
When content text is to be written out, a check is made for any
discrepancy between the accumulated changes in the internal state, and the
written state.
If there are any differences, control language is generated to bring
the written state into sync with the internal state, before writing
the content text.
This guarantees that the correct formatting properties will apply to
the text, and minimizes the amount of control language generated.
.LP
Control language to set up the initial state is flushed before anything
else comes out.
It's flushed when any of the following are about to be written:
(i) any content text for the main body of the document
(ii) anything at all for headers or footers;
(iii) the beginning of a table.
The initial state is written using absolute values.
State changes are generally written using relative changes to the
current state values.
Use of relative values allows manual changes to be made to the initial
part of the output and have the rest of the document be affected.
For instance, you can change the initial line indent, and the rest
of the document will follow the change.
.LP
RTF files may contain groups.
Normally, a group inherits the state of the group containing it, and
changes made within the group are discarded when the group ends.
To mimic this, a stack of internal states is maintained by
.I rtf2troff .
When a group begins, a new internal state is pushed on the stack,
with the same values as the previous state.
This action does not cause any change to the state values, but the
occurrence of document, section,
paragraph and character formatting control symbols does.
When a group ends, the current state is popped off the stack, and the
previous state becomes the current state.
The may well change the current state values, if changes were made
within the group; when the next content text is written, control
language to undo those changes is generated.
.LP
Internal state 0 is special.
It contains all the RTF default values and is the base state in which the
writer starts.
Moreover, it is never
changed because the first token that should be found in an RTF document
is ``{'', which
causes a new state (state 1) to be pushed on the stack immediately.
Thus the contents of state 0 can be used to restore section, paragraph
and character formatting defaults.
.LP
\fBSection Defaults\*-\fRThe section properties of state 0
are set to the RTF defaults
and are used to restore the section state when ``\esectd'' occurs.
.LP
\fBParagraph Defaults\*-\fRThe paragraph properties of state 0
are set to the RTF defaults
and are used to restore the paragraph state when ``\epard'' occurs.
The ``Normal'' style is then applied,
since the real defaults include not only the static initial values,
but also the formatting produced by that style.
.LP
\fBCharacter Defaults\*-\fRThe character properties of state 0
are set to the RTF defaults
and are used to restore the character state when ``\eplain'' occurs.
.LP
Some groups, such as headers and footers, do
.I not
inherit the
formatting properties of the enclosing group, presumably because the
output that results from those groups is not contiguous with that of
the preceding or following groups\*-they generate output that appears
possibly far away.
To force non-inheritance of the enclosing group's formatting
properties,
the effects of the ``\epard'' and ``\eplain'' tokens are
applied at the beginning of this kind of group.
The specification says that ``\esectd'' should also be applied, but I
don't believe it.
Why should the section break style, title page special value, etc. be
reset just because you're collecting a header?
.LP
A related problem for such groups is that any changes made to the
written state while processing them must
not be allowed to affect the formatting of text following the
group.
In other words, the state that
.I troff
ends up in when the group ends needs to be rewound to the state it was
in when the group began.
To allow changes to the written state to be forgotten properly at the
end of the group, two things must happen.
.I troff
must be told to revert to the pre-group written state, and
.I rtf2troff
must revert its own notion of written state.
To allow
.I troff
to revert its state, such groups are processed using environment
switches within diversions, to collect the group output in a separate
environment and to allow the environment to be restored.
To revert the
.I rtf2troff
written state, a copy of the state is saved before and restored after
processing the group.
.LP
.I rtf2troff
currently needs only one level of diversion, so only a single
state copy is needed.
However, the implementation uses a stack in case a more general mechanism is
needed in the future.
.NH 2
Output Line Length Control
.LP
RTF paragraphs can contain very long strings of text.
To make
.I rtf2troff
output more readable and editable, long paragraphs are broken into multiple
lines.
All output is written for fill mode, so these lines will be joined back
together by
.I troff .
However, for this to work, lines must be broken carefully.
The best ``natural break'' is when there is a single space between
words.
Several implications follow from this observation.
.IP (i)
Must not break to a new line when the next character to be written is
whitespace, or
.I troff
will not join the lines back together (whitespace at the beginning of
an input line forces a break).
.IP (ii)
Must not break in the middle of a word, or there will be extra whitespace
in the middle of it when lines are joined.
.IP (iii)
Must not lose whitespace at end of broken lines (\fItroff\fR tosses
whitespace at ends of lines), so only break when there is a single
space between words.
.IP (iv)
Must not put out an extra newline at the end of the paragraph, if the output
line was just broken after the last character written.
.LP
To complicate matters, it is also desirable that the following be true:
If underlining or stikeout is enabled, the ugly sequences to do them
should be forced onto separate lines for each character, no matter what,
to make the output more editable.
Since this may violate the conditions above (e.g., if only the middle
of a word is underlined), use of ``\ec'' may be necessary.
However, ``\ec'' should only be written when absolutely necessary,
to avoid cluttering up the output.
.LP
Three variables are used to keep track of output written to the current
paragraph.
.I inPara
is true if any characters have been written out to
the current paragraph.
.I oLen
is the number of characters written to the current output line of the
paragraph.
It is zero if the current line is empty.
.I breakOK
is non-zero if it's OK to break a line when the next content character
is written (if the next character isn't a space).
.LP
When a character is about to be written for a paragraph, if
.I inPara
is zero, it is assumed that a paragraph is just beginning, otherwise that
part but not all of a paragraph has been written out.
This assumption is used when writing both content and formatting text.
If a content character is about to be written and
.I inPara
is zero, beginning-of-paragraph processing takes place:
space before paragraphs is written out; if the paragraph
has a top border, a line is drawn; control language is generated
to set the temporary indent (the RTF ``first line indent'').
.I inPara
is also checked when writing formatting text that must be on a line by itself:
if
.I inPara
is not zero, content text has been written which
must be ``flushed'' by writing a newline.
.NH 2
Underlining/Strikethrough
.LP
.I rtf2troff
can do continuous or word underlining.
If a document conversion is simply for printing purposes,
it makes sense to leave underlining conversion on.
If you want to edit the converted file, you may want to turn underlining
conversion off (with the
.B \-u
option),
because it may be difficult to edit the portions of text
that are underlined.
Another reason to turn underlining off is that some printers seem to
take a long time to print documents containing a lot of underlining.
.LP
These remarks apply to strikeout text as well.
.NH 2
Page Orientation
.LP
Some versions of Word (e.g., WfM) don't seem to write out the
``\elandscape'' control word properly, so when
.I rtf2troff
writes out the initial state-setting control language, it checks to see
whether the page height is less than the width.
If so, it prints a message and assumes landscape is on.
.NH 2
Tab Handling
.LP
The defaults are every .5 inch, left justified, motion
only (no leader character).
There are
.I maxTab
tabs (\fImaxTab\fR \(eq 20, which might be enough).
When the state is pushed, the new state inherits the previous state's tab
settings, but if any tabs are set explicitly in the new state, they override
inherited tabs.
A flag is used to know when the first tab is being set in the new state.
When that happens, a new tab set is started instead of
adding the tab to the end of the existing set.
.LP
Tabs may have position and justification.
For a given tab, the position is the
.I last
attribute specified.
Thus, when a justification indicator is encountered in the input
stream, it's entered into the
.I next
tab slot, but the tab count is not incremented.
When the position is specified,
the count is incremented and any justification previously specified
automatically becomes part of the new tab stop.
.LP
In RTF, tabs may also be associated with a leader character, but
.I rtf2troff
only maintains a single tab character.
Thus, if multiple tab leader characters are specified for a paragraph,
only the last one is used, and it applies to all the tabs.
This is certainly unsatisfactory, but that's the way it is.
.LP
Bar tabs are ignored.
Decimal tabs are treated as right-justified tabs.
.NH 2
Tables
.LP
.I rtf2troff
includes some table support, but tables are hard to do well, so you'll
easily find examples which confuse it and convert poorly.
At least
it was hard for
.I me
to figure out how to do them well, but then, I don't understand
.I tbl .
Does anyone?
.LP
Tables are written out under the assumption that you have the
.I tbl
program available.
Each row of a table is written out between .TS/.TE pairs.
Cell contents are written out using
.I tbl 's
``T{''...``T}'' construct.
.LP
One problem with writing tables using the
T{/T} mechanism is that
.I tbl
tries to keep font, point size and vertical spacing changes within one
cell from affecting the next.
Since RTF tables may well expect changes to these three parameters to
carry over to following cells,
.I tbl 's
invisibly resetting these introduces a problem.
The solution to this is straightforward, but results in output that's
even uglier than usual:
force out the current font, point size and vertical spacing at the
beginning of each cell.
It's also necessary to do this after the ``.TE'' since that seems
to mess with point size and vertical spacing.
.LP
One optimization that ought to be possible (says he) to minimize the
amount of output generated is to compare the current values to the
values in force at the beginning of the table and only write them out
if they're different.
Unfortunately, for reasons I don't understand, this doesn't always
work.
Thus brute force prevails.
.LP
Another problem arises in connection with column widths.
.I tbl
assumes a default column separation of 3 ens, in the current point
size.
.I rtf2troff
knows the width that a cell should be and can write the table
specification as such.
Unfortunately, the column separation is added to that width, it's not
a part of it, so the table ends up wider than it should be.
A separation of zero can be used, which will make the table the right
width, but the text is too jammed together.
If there are borders, the text hits the borders.
.LP
Specifying a 1-en separation spaces the text away from the cell
borders, but then the table is again too wide.
This could be handled if there were some easy way of subtracting 1 en
from the column width, but there isn't.
Cell widths are absolute values, whereas ens depend on the current
point size.
You can't specify a column width in the table heading, e.g., as
``l1w(1.5i-1n)'', either.
It
.I almost
always works, but some tables botch terribly.
The ``solution'' adopted in
.I rtf2troff
is to try to guess how big an en is in the current point size and
subtract it from the column width before writing the table header.
This works pretty well if you can get accurate width data for your
version of
.I troff .
.LP
Cell borders are handled to a small extent.
One problem is that each row of an RTF table is written as a separate
.I tbl
table, and the .TS/.TE macros can result in a little space before and after
the table if you use a macro package such as
.B \-me .
This means that ugly gaps between rows may result.
.LP
Tabs within cells are ignored, i.e., mishandled.
.LP
Merged cells are botched.
.LP
In general,
the table-writing code tends to do fairly well with simple tables and
less well with more complex ones.
The problem with the output generated for tables is that it is very
``busy'' and if it's incorrect, it's not always evident how to correct
it, should you wish to do so manually.
.LP
Some newer Xerox 4045 printers have problems with complex tables generated
by
.I xroff .
This is not specific to output generated by
.I rtf2troff ,
so if tables appear to be botched, it might not be
.I rtf2troff 's
fault.
.NH 2
Special Character Translation
.LP
Special characters (those in the range 128..255)
do not have a direct ASCII representation
and are usually written for
.I troff
as a special escape sequence.
For instance, the plus-or-minus symbol ``\(+-'' can be written out
as ``\e(+-''.
.I rtf2troff
uses lookup tables to map special characters values onto these escape
sequences.
.LP
The association between character values and escape sequences is subject
to quite a bit of variation, due to three factors:
.IP \(bu
There are four possible RTF character sets, and they can use different values
to represent a special character.
For instance, the divide symbol ``\(di'' is character value 247 in the ANSI
character set and 214 on the Macintosh.
Thus, different lookup tables are needed for different character sets.
Ugh.
.IP \(bu
Different versions of
.I troff
vary in the set of escape sequences they support.
Thus, different lookup tables may be needed for whatever version(s) of
.I troff
are locally available.
For instance,
.I eroff
may understand different sequences than
.I xroff .
Ugh ugh.
.IP \(bu
Some macro packages like
.I \-me
and
.I \-ms
provide their own escape sequences for special characters.
If
.I rtf2troff
is told to write macro-package-specific output, special character mapping
should be tailored to that package.
Currently,
.I rtf2troff
doesn't know much about macro packages, but it may someday, so this factor
needs to be kept in mind.
Ugh ugh ugh.
.LP
Due to these factors, a scheme is used
to allow character mapping to be done, and to allow site-specific
modifications to be made to accommodate local conventions.
This scheme is complicated and incomprehensible, but it does have the
virtue that it works.
.LP
For each
.I troff
variant supported, there is a set of character mappings, one mapping
for each of the four RTF character sets.
These are the ``default'' mappings.
There are also three other sets of four mappings, one set for each of the
.I \-me ,
.I \-mm ,
and
.I \-ms
macro packages.
The set for each package has a mapping for each character set.
These are ``override'' mappings, and should have entries only for those
special characters for which the macro package provides special escape
sequences.
When a special character is mapped, the override table is consulted first.
If an entry for the character is found, it's used.
Otherwise the default mapping is used.
If nothing is found there, either, ``<<UNKNOWN>>'' is returned, which
should be sufficiently ugly to call your attention to it so you know
some editing needs to be done.
.LP
The maps are selected from the command line with the
.B \-t
and
.B \-me|\-mm|\-ms
options.
For instance, if your site supports
.I xroff
and
.I pstroff ,
you can select tables for one or the other with
.LP
.DS
rtf2troff -t xroff
rtf2troff -t pstroff
.DE
.LP
It might be more convenient simply to create shell scripts, e.g.,
.I rtf2xroff ,
which would contain
.LP
.DS
#!/bin/sh
exec rtf2troff -t xroff "$@"
.DE
.LP
Internally, character mappings are selected using three functions.
They must be called in the order described.
If a
.I troff
name is given (with the
.B \-t
argument),
.I SelectFormatterMaps()
is called.
This selects the proper entry from the mapping table and selects
the set of default maps.
Then, if a macro package argument is given, it should be passed
to
.I SelectMacPackMaps()
so that the proper override tables are used.
During RTF file parsing, when the character set symbol is found,
pass its minor number to
.I SelectCharSetMaps()
to select the charset-specific maps from the default and override maps.
You should be confused enough by the preceding that you will not be
surprised when I adjure you to consult the source code for more information.
.LP
To modify
.I rtf2troff
for local variations, you must
make sure the mapping table in
.I trf-charmap.c
has an entry for each of your
.I troff
versions.
You may have to contruct new mappings yourself.
.LP
If you do make such modifications, please let me know about them.
.NH 2
Error messages
.LP
Tokens not recognized by the reader are echoed.
.LP
``Uh-oh!  RTF group nesting exceeded maximum level,'':
the maximum stack depth needs to be increased.
.LP
``unbalanced brace level'':
the RTF file is malformed; some ``{'' is not matched by a ``}''.
.LP
``unrestored environment'':
indicates a bug in
.I rtf2troff .
.LP
``unrestored indirection'':
indicates a bug in
.I rtf2troff .
.LP
If you get a message ``Trap Loop Death detected'', it means the header
and footer overlap.
This is legal in RTF, but
.I troff
doesn't like it and can end up in an infinite loop because the header
triggers the footer trap, which triggers the header trap, which...
.LP
Most other messages come from inside the reader, e.g., if the stylesheet
or font table readers get confused.
.NH 2
Registers and Macros
.LP
A number of register and macro names are used by
.I rtf2troff .
If these conflict with those used by any preprocessors you might use,
you can change them by editing
.I rtf.h .