.\" .\" troff -ms % | lpr .\" .\" revision date - change whenever this file is edited .ds RD 5 April 1991 .nr PO 1.2i \" page offset 1.2 inches .nr PD .7v \" inter-paragraph distance .\" .EH 'RTF-to-troff Translation'- % -'' .OH ''- % -'RTF-to-troff Translation' .OF 'Revision date:\0\0\*(RD''Printed:\0\0\n(dy \*(MO 19\n(yr' .EF 'Revision date:\0\0\*(RD''Printed:\0\0\n(dy \*(MO 19\n(yr' .\" .\" subscript strings .ds < \s-2\v'.4m' .ds > \v'-.4m'\s+2 .\" .\" I - italic font (taken from -ms and changed) .de I .nr PQ \\n(.f .if t \&\\$3\\f2\\$1\\fP\&\\$2 .if n .if \\n(.$=1 \&\\$1 .if n .if \\n(.$>1 \&\\$1\c .if n .if \\n(.$>1 \&\\$2 .. .TL rtf2troff .sp .5v An RTF to troff Translator .AU Paul DuBois dubois@primate.wisc.edu .AI Wisconsin Regional Primate Research Center Revision date:\0\0\*(RD .NH Introduction .LP .I rtf2troff is a document translator that takes an input file in RTF format and writes output suitable for processing by .I troff . It has a number of features, including: .IP \(bu Production of output that, when run through .I troff , on rare occasions possesses a mild resemblance to the original document. .IP \(bu Voluminous, inefficient and largely incomprehensible source code (available for free and overpriced at that). .IP \(bu Often-incomprehensible output, especially for tables. .IP \(bu Complete lack of support for formulas. .IP \(bu Support for underlining and strikethrough that generates prize-winning amounts of output. Besides its incredible bulk, this has the additional security feature of being impossible to make sense of for editing. The reckless user is, however, given the option of disabling this valuable form of protection. .IP \(bu Inability to write .I nroff -specific output, or output specific to the .I \-me , .I \-mm or .I \-ms macro packages. .IP \(bu Blissful ignorance of the fact that there are any fonts other than the default, except for purposes of boldface and italics. .IP \(bu Ability to completely lose any unrecognized input, or, for variety, core dump. .IP \(bu Merges text of footnotes right into the main body of the document. You didn't really want 'em at the bottom of the page anyway, did you? .LP Perhaps over time this list will shrink and this section can be removed. But don't hold your breath. .LP The intended audience of this document is not those who might use .I rtf2troff for daily work, but programmers who want to know why it's written the way it is. As such, it discusses aspects of the implementation. The source code should be consulted for further reference. Actually, I suspect you're reading this because you've already looked at the source and couldn't make any sense of it! .NH Random Implementation Notes .LP In the output produced by .I rtf2troff , a distinction is made between .I content (or document) text and .I formatting (or control) text. Content text consists of the characters that are actually supposed to appear in the finished document. Formatting text affects how those characters appear. Formatting text may be inline with content text (e.g., ``\efR'', ``\es+3'') or on a line by itself (e.g., ``.ft R'', ``.ps +3''). .NH 2 State Maintenance Issues .LP It is possible to write out control language whenever any changes are made to document, section, paragraph or character formatting properties, but that would result in more output than is necessary. Instead, .I rtf2troff maintains notions of two kinds of state: an .I internal state, which is the current state of formatting properties as indicated by control words encountered in the RTF input stream, and .I written state, which tracks the state corresponding to the .I troff control language that has been written to the output (i.e., the state that .I troff will be in). .LP Changes to formatting properties are simply accumulated in the internal state without writing any output. When content text is to be written out, a check is made for any discrepancy between the accumulated changes in the internal state, and the written state. If there are any differences, control language is generated to bring the written state into sync with the internal state, before writing the content text. This guarantees that the correct formatting properties will apply to the text, and minimizes the amount of control language generated. .LP Control language to set up the initial state is flushed before anything else comes out. It's flushed when any of the following are about to be written: (i) any content text for the main body of the document (ii) anything at all for headers or footers; (iii) the beginning of a table. The initial state is written using absolute values. State changes are generally written using relative changes to the current state values. Use of relative values allows manual changes to be made to the initial part of the output and have the rest of the document be affected. For instance, you can change the initial line indent, and the rest of the document will follow the change. .LP RTF files may contain groups. Normally, a group inherits the state of the group containing it, and changes made within the group are discarded when the group ends. To mimic this, a stack of internal states is maintained by .I rtf2troff . When a group begins, a new internal state is pushed on the stack, with the same values as the previous state. This action does not cause any change to the state values, but the occurrence of document, section, paragraph and character formatting control symbols does. When a group ends, the current state is popped off the stack, and the previous state becomes the current state. The may well change the current state values, if changes were made within the group; when the next content text is written, control language to undo those changes is generated. .LP Internal state 0 is special. It contains all the RTF default values and is the base state in which the writer starts. Moreover, it is never changed because the first token that should be found in an RTF document is ``{'', which causes a new state (state 1) to be pushed on the stack immediately. Thus the contents of state 0 can be used to restore section, paragraph and character formatting defaults. .LP \fBSection Defaults\*-\fRThe section properties of state 0 are set to the RTF defaults and are used to restore the section state when ``\esectd'' occurs. .LP \fBParagraph Defaults\*-\fRThe paragraph properties of state 0 are set to the RTF defaults and are used to restore the paragraph state when ``\epard'' occurs. The ``Normal'' style is then applied, since the real defaults include not only the static initial values, but also the formatting produced by that style. .LP \fBCharacter Defaults\*-\fRThe character properties of state 0 are set to the RTF defaults and are used to restore the character state when ``\eplain'' occurs. .LP Some groups, such as headers and footers, do .I not inherit the formatting properties of the enclosing group, presumably because the output that results from those groups is not contiguous with that of the preceding or following groups\*-they generate output that appears possibly far away. To force non-inheritance of the enclosing group's formatting properties, the effects of the ``\epard'' and ``\eplain'' tokens are applied at the beginning of this kind of group. The specification says that ``\esectd'' should also be applied, but I don't believe it. Why should the section break style, title page special value, etc. be reset just because you're collecting a header? .LP A related problem for such groups is that any changes made to the written state while processing them must not be allowed to affect the formatting of text following the group. In other words, the state that .I troff ends up in when the group ends needs to be rewound to the state it was in when the group began. To allow changes to the written state to be forgotten properly at the end of the group, two things must happen. .I troff must be told to revert to the pre-group written state, and .I rtf2troff must revert its own notion of written state. To allow .I troff to revert its state, such groups are processed using environment switches within diversions, to collect the group output in a separate environment and to allow the environment to be restored. To revert the .I rtf2troff written state, a copy of the state is saved before and restored after processing the group. .LP .I rtf2troff currently needs only one level of diversion, so only a single state copy is needed. However, the implementation uses a stack in case a more general mechanism is needed in the future. .NH 2 Output Line Length Control .LP RTF paragraphs can contain very long strings of text. To make .I rtf2troff output more readable and editable, long paragraphs are broken into multiple lines. All output is written for fill mode, so these lines will be joined back together by .I troff . However, for this to work, lines must be broken carefully. The best ``natural break'' is when there is a single space between words. Several implications follow from this observation. .IP (i) Must not break to a new line when the next character to be written is whitespace, or .I troff will not join the lines back together (whitespace at the beginning of an input line forces a break). .IP (ii) Must not break in the middle of a word, or there will be extra whitespace in the middle of it when lines are joined. .IP (iii) Must not lose whitespace at end of broken lines (\fItroff\fR tosses whitespace at ends of lines), so only break when there is a single space between words. .IP (iv) Must not put out an extra newline at the end of the paragraph, if the output line was just broken after the last character written. .LP To complicate matters, it is also desirable that the following be true: If underlining or stikeout is enabled, the ugly sequences to do them should be forced onto separate lines for each character, no matter what, to make the output more editable. Since this may violate the conditions above (e.g., if only the middle of a word is underlined), use of ``\ec'' may be necessary. However, ``\ec'' should only be written when absolutely necessary, to avoid cluttering up the output. .LP Three variables are used to keep track of output written to the current paragraph. .I inPara is true if any characters have been written out to the current paragraph. .I oLen is the number of characters written to the current output line of the paragraph. It is zero if the current line is empty. .I breakOK is non-zero if it's OK to break a line when the next content character is written (if the next character isn't a space). .LP When a character is about to be written for a paragraph, if .I inPara is zero, it is assumed that a paragraph is just beginning, otherwise that part but not all of a paragraph has been written out. This assumption is used when writing both content and formatting text. If a content character is about to be written and .I inPara is zero, beginning-of-paragraph processing takes place: space before paragraphs is written out; if the paragraph has a top border, a line is drawn; control language is generated to set the temporary indent (the RTF ``first line indent''). .I inPara is also checked when writing formatting text that must be on a line by itself: if .I inPara is not zero, content text has been written which must be ``flushed'' by writing a newline. .NH 2 Underlining/Strikethrough .LP .I rtf2troff can do continuous or word underlining. If a document conversion is simply for printing purposes, it makes sense to leave underlining conversion on. If you want to edit the converted file, you may want to turn underlining conversion off (with the .B \-u option), because it may be difficult to edit the portions of text that are underlined. Another reason to turn underlining off is that some printers seem to take a long time to print documents containing a lot of underlining. .LP These remarks apply to strikeout text as well. .NH 2 Page Orientation .LP Some versions of Word (e.g., WfM) don't seem to write out the ``\elandscape'' control word properly, so when .I rtf2troff writes out the initial state-setting control language, it checks to see whether the page height is less than the width. If so, it prints a message and assumes landscape is on. .NH 2 Tab Handling .LP The defaults are every .5 inch, left justified, motion only (no leader character). There are .I maxTab tabs (\fImaxTab\fR \(eq 20, which might be enough). When the state is pushed, the new state inherits the previous state's tab settings, but if any tabs are set explicitly in the new state, they override inherited tabs. A flag is used to know when the first tab is being set in the new state. When that happens, a new tab set is started instead of adding the tab to the end of the existing set. .LP Tabs may have position and justification. For a given tab, the position is the .I last attribute specified. Thus, when a justification indicator is encountered in the input stream, it's entered into the .I next tab slot, but the tab count is not incremented. When the position is specified, the count is incremented and any justification previously specified automatically becomes part of the new tab stop. .LP In RTF, tabs may also be associated with a leader character, but .I rtf2troff only maintains a single tab character. Thus, if multiple tab leader characters are specified for a paragraph, only the last one is used, and it applies to all the tabs. This is certainly unsatisfactory, but that's the way it is. .LP Bar tabs are ignored. Decimal tabs are treated as right-justified tabs. .NH 2 Tables .LP .I rtf2troff includes some table support, but tables are hard to do well, so you'll easily find examples which confuse it and convert poorly. At least it was hard for .I me to figure out how to do them well, but then, I don't understand .I tbl . Does anyone? .LP Tables are written out under the assumption that you have the .I tbl program available. Each row of a table is written out between .TS/.TE pairs. Cell contents are written out using .I tbl 's ``T{''...``T}'' construct. .LP One problem with writing tables using the T{/T} mechanism is that .I tbl tries to keep font, point size and vertical spacing changes within one cell from affecting the next. Since RTF tables may well expect changes to these three parameters to carry over to following cells, .I tbl 's invisibly resetting these introduces a problem. The solution to this is straightforward, but results in output that's even uglier than usual: force out the current font, point size and vertical spacing at the beginning of each cell. It's also necessary to do this after the ``.TE'' since that seems to mess with point size and vertical spacing. .LP One optimization that ought to be possible (says he) to minimize the amount of output generated is to compare the current values to the values in force at the beginning of the table and only write them out if they're different. Unfortunately, for reasons I don't understand, this doesn't always work. Thus brute force prevails. .LP Another problem arises in connection with column widths. .I tbl assumes a default column separation of 3 ens, in the current point size. .I rtf2troff knows the width that a cell should be and can write the table specification as such. Unfortunately, the column separation is added to that width, it's not a part of it, so the table ends up wider than it should be. A separation of zero can be used, which will make the table the right width, but the text is too jammed together. If there are borders, the text hits the borders. .LP Specifying a 1-en separation spaces the text away from the cell borders, but then the table is again too wide. This could be handled if there were some easy way of subtracting 1 en from the column width, but there isn't. Cell widths are absolute values, whereas ens depend on the current point size. You can't specify a column width in the table heading, e.g., as ``l1w(1.5i-1n)'', either. It .I almost always works, but some tables botch terribly. The ``solution'' adopted in .I rtf2troff is to try to guess how big an en is in the current point size and subtract it from the column width before writing the table header. This works pretty well if you can get accurate width data for your version of .I troff . .LP Cell borders are handled to a small extent. One problem is that each row of an RTF table is written as a separate .I tbl table, and the .TS/.TE macros can result in a little space before and after the table if you use a macro package such as .B \-me . This means that ugly gaps between rows may result. .LP Tabs within cells are ignored, i.e., mishandled. .LP Merged cells are botched. .LP In general, the table-writing code tends to do fairly well with simple tables and less well with more complex ones. The problem with the output generated for tables is that it is very ``busy'' and if it's incorrect, it's not always evident how to correct it, should you wish to do so manually. .LP Some newer Xerox 4045 printers have problems with complex tables generated by .I xroff . This is not specific to output generated by .I rtf2troff , so if tables appear to be botched, it might not be .I rtf2troff 's fault. .NH 2 Special Character Translation .LP Special characters (those in the range 128..255) do not have a direct ASCII representation and are usually written for .I troff as a special escape sequence. For instance, the plus-or-minus symbol ``\(+-'' can be written out as ``\e(+-''. .I rtf2troff uses lookup tables to map special characters values onto these escape sequences. .LP The association between character values and escape sequences is subject to quite a bit of variation, due to three factors: .IP \(bu There are four possible RTF character sets, and they can use different values to represent a special character. For instance, the divide symbol ``\(di'' is character value 247 in the ANSI character set and 214 on the Macintosh. Thus, different lookup tables are needed for different character sets. Ugh. .IP \(bu Different versions of .I troff vary in the set of escape sequences they support. Thus, different lookup tables may be needed for whatever version(s) of .I troff are locally available. For instance, .I eroff may understand different sequences than .I xroff . Ugh ugh. .IP \(bu Some macro packages like .I \-me and .I \-ms provide their own escape sequences for special characters. If .I rtf2troff is told to write macro-package-specific output, special character mapping should be tailored to that package. Currently, .I rtf2troff doesn't know much about macro packages, but it may someday, so this factor needs to be kept in mind. Ugh ugh ugh. .LP Due to these factors, a scheme is used to allow character mapping to be done, and to allow site-specific modifications to be made to accommodate local conventions. This scheme is complicated and incomprehensible, but it does have the virtue that it works. .LP For each .I troff variant supported, there is a set of character mappings, one mapping for each of the four RTF character sets. These are the ``default'' mappings. There are also three other sets of four mappings, one set for each of the .I \-me , .I \-mm , and .I \-ms macro packages. The set for each package has a mapping for each character set. These are ``override'' mappings, and should have entries only for those special characters for which the macro package provides special escape sequences. When a special character is mapped, the override table is consulted first. If an entry for the character is found, it's used. Otherwise the default mapping is used. If nothing is found there, either, ``<>'' is returned, which should be sufficiently ugly to call your attention to it so you know some editing needs to be done. .LP The maps are selected from the command line with the .B \-t and .B \-me|\-mm|\-ms options. For instance, if your site supports .I xroff and .I pstroff , you can select tables for one or the other with .LP .DS rtf2troff -t xroff rtf2troff -t pstroff .DE .LP It might be more convenient simply to create shell scripts, e.g., .I rtf2xroff , which would contain .LP .DS #!/bin/sh exec rtf2troff -t xroff "$@" .DE .LP Internally, character mappings are selected using three functions. They must be called in the order described. If a .I troff name is given (with the .B \-t argument), .I SelectFormatterMaps() is called. This selects the proper entry from the mapping table and selects the set of default maps. Then, if a macro package argument is given, it should be passed to .I SelectMacPackMaps() so that the proper override tables are used. During RTF file parsing, when the character set symbol is found, pass its minor number to .I SelectCharSetMaps() to select the charset-specific maps from the default and override maps. You should be confused enough by the preceding that you will not be surprised when I adjure you to consult the source code for more information. .LP To modify .I rtf2troff for local variations, you must make sure the mapping table in .I trf-charmap.c has an entry for each of your .I troff versions. You may have to contruct new mappings yourself. .LP If you do make such modifications, please let me know about them. .NH 2 Error messages .LP Tokens not recognized by the reader are echoed. .LP ``Uh-oh! RTF group nesting exceeded maximum level,'': the maximum stack depth needs to be increased. .LP ``unbalanced brace level'': the RTF file is malformed; some ``{'' is not matched by a ``}''. .LP ``unrestored environment'': indicates a bug in .I rtf2troff . .LP ``unrestored indirection'': indicates a bug in .I rtf2troff . .LP If you get a message ``Trap Loop Death detected'', it means the header and footer overlap. This is legal in RTF, but .I troff doesn't like it and can end up in an infinite loop because the header triggers the footer trap, which triggers the header trap, which... .LP Most other messages come from inside the reader, e.g., if the stylesheet or font table readers get confused. .NH 2 Registers and Macros .LP A number of register and macro names are used by .I rtf2troff . If these conflict with those used by any preprocessors you might use, you can change them by editing .I rtf.h .