@c ************************************************************************* @c CHAPTER: Introduction to VVcode @c ************************************************************************* @c node-name, next, previous, up @node Introduction, Introduction, Top, Top @chapter Introduction to VVcode Reliable and faithful exchange of binary files between computers over networks is a well-known problem, especially if the computers use different operating systems and are connected to different networks via a gateway. Unfortunately inter-networking and electronic mail are very much children of the 1960s: they might have had to wait until the 1970s for their naissance, but their progenitors were mentally locked-in to the concept of the 7-bit ASCII code for conveying textual information. The @TeX{} community has long been aware of this problem when trying to exchange ``machine-independent'' @file{.dvi} files and font-related data such as @file{.tfm} and @file{.pk} files. It has sometimes been possible to exchange this binary data by using encoding schemes that allow the data to be represented using a subset of the seven-bit ASCII character set. Academics and authors in many fields have hitherto been able to pass @file{.tex} files back-and-forth by electronic mail---apart from a few minor quirks and blemishes, such @TeX{} source files pass unharmed across the planet's networks. Problems are encountered when mail passes through certain gateway machines which introduce irreversible character corruptions. Particularly notorious is the Janet/Bitnet gateway which has the unfortunate habit of converting @samp{^} to @samp{~} and @samp{~} to @samp{%}: since it leaves @samp{%} itself unaffected, this makes recovery of the original file a non-trivial exercise. It sometimes also changes the brace characters @samp{@{@}} into odd characters above 128: this is particularly embarrassing, of course, for @file{.tex} files! For some years many @TeX{} users, particularly those working in languages other than English, and thus familiar with character set encodings containing other than the basic ASCII set, have been agitating for @TeX{} to be able to handle input in their mother tongues, using their own languages' character sets. In 1989, Knuth announced @TeX{} V3, and implementors world-wide beavered away to bring each implementation up-to-date. @TeX{} V3 now supports eight-bit character sets and so @file{.tex} source files are now effectively `binary' files and will therefore suffer from the same exchange problems experienced with @file{.dvi} files. All those authors that had previously been able to cooperate, despite being separated by hundreds or thousands of miles, might once again be forced to entrust floppy disks to the vagaries of the world's postal systems (although one shouldn't underestimate the bandwidth of the Royal [or other] Mail system). Unless or until the various e-mail protocols, networks and software are converted to support uncorrupted transmission of characters codes @code{0x20 @dots{} 0x7e} and @code{0xa1 @dots{} 0xfe}, it will have to become the norm for @file{.tex} sources to be encoded for transmission by e-mail. This problem is of course well known outside the @TeX{} community. @c ========================================================================= @c SECTION: The Aston Archive @c ========================================================================= @section{The Aston Archive} The author is a volunteer assistant to Peter Abbott in running the world's principal repository of @TeX{}-related material at Aston University in Birmingham. The archive (host: @code{TeX.Ac.Uk}) holds several hundred megabytes of text and binary files including: @itemize @bullet @item program sources for @TeX{}, @code{METAFONT}, DVI drivers and many other utilities; @item binary executables for a variety of popular operating systems (e.g. Atari, Macintosh, MS-DOS, Unix, VAX/VMS and VM/CMS); @item @code{METAFONT} sources for Computer Modern and other fonts; @item binary font files (mainly @file{.tfm} and @file{.pk}) for a number of different output devices; @item text macro and style files. @end itemize The archive provides access to these files via the following services: @itemize @bullet @item NIFTP@footnote{Network Independent File Transfer Protocol --- in the UK, one does not perform the pseudo-login that Internet users are accustomed to using with the FTP protocol: instead, one issues a ``transfer request'' for a file to be sent to or from the remote machine --- the transfer itself takes place asynchronously. One nice consequence is that such transfers can be queued for overnight execution, leaving daytime bandwidth free for e-mail and true remote interactive logins.} from Janet hosts---typically 300 megabytes of data are transferred every month; this would probably be much greater if we were not limited by the bandwidth of our 9600Bd connection to Janet. @item FTP and Telnet access from Internet hosts. @item Interactive browsing service via Janet PAD, including the facility to send files out using NIFTP (and later FTP). @item Interactive browsing service via dialup modem lines, including the facility to download files using Kermit and similar protocols. @item An e-mail file server which typically sends 150 megabytes of data per month to sites all over the world (though predominantly to EARN/Bitnet sites). @item A magnetic media distribution service via surface carriers. Copies of the entire archive have been sent to embryonic @TeX{} communities in Czechoslovakia, Hungary and Poland. @end itemize We have experienced many problems trying to support all of these file types, operating systems and access methods. The e-mail file server clearly needs a reliable method of encoding files if its many customers are not to be denied access to the non-text files in the archive. Binary files such as @file{.pk} font files are stored in different ways to accommodate the requirements of the different operating systems supported. Currently we maintain multiple font directory trees for the Macintosh, MS-DOS, Unix and VAX/VMS with all the attendant problems of synchronization, disk space and archivists' time. We need a single storage format which allows export to all of our supported operating systems. @c ========================================================================= @c SECTION: Specification for a Coding Scheme @c ========================================================================= @section{Specification for a Coding Scheme} In mid-1990, the archivists came to the conclusion that a universal encoding scheme was required to accommodate the many different kinds of file and file organizations that needed to be supported by the archive. Niel Kempson formulated the first draft of this specification in mid-1990; the requirements of the encoding scheme may be summarized as follows: @table @strong @item Preserving File Structure It is insufficient, especially for an archive holding text and binary files for a variety of machine types, merely to encode data simply as a stream of bytes: @itemize @bullet @item Virtually all operating systems (except Unix) make a distinction between binary and text files, so the coding system should recognize and maintain this distinction. @item Unix and most PC-based operating systems treat files as streams of bytes with no further structure imposed. On the other hand, certain widely-used operating systems (e.g. VAX/VMS and VM/CMS) have record-oriented file systems where different types of file are stored in a format appropriate to the type of file@footnote{It is often argued that the increase in efficiency more than offsets the increase in complexity.}. For these operating systems, we consider it essential that the encoding scheme should identify, preserve and record the most commonly used file organizations. The decoding program should be able to use this information to create the output file using the organization appropriate to the operating system in use. If the information is of no consequence to the receiving system, the default file structure (if any) should be created. If the encoding system does not have structure in its files, the receiving system may provide suitable defaults automatically. In all cases the programs should permit the user to override or supplement file structure information. @item Whenever possible, these details of structure should be determined automatically by the encoding program; at the very least, an indication of whether the file is text or binary shall be provided, even under an operating system such as Unix that need make no such distinction for its own use, to allow decoding to an appropriate file organization on those systems that {@emph do} make such a distinction. @end itemize @item Coding Scheme Whatever method is used must allow encoded data to be e-mailed: @itemize @bullet @item It should be possible to specify the coding table to be used to encode the data. The coding table used shall be recorded with each part of the encoded data. @item If a recorded coding table is found while decoding, it should be used to construct an appropriate decoding table. Simple one-to-one character corruptions should be corrected as long as only one of the input characters is mapped to any one output character. @item The recommended encoding uses only the following characters: @quotation @code{+-0123456789}@* @code{abcdefghijklmnopqrstuvwxyz}@* @code{ABCDEFGHIJKLMNOPQRSTUVWXYZ} @end quotation Such an encoding as originally used for XXcode has been shown to pass successfully through all the gateways which are known to corrupt characters. @end itemize @item Integrity of Encoded Data We want to ensure that the @emph{whole} encoded file passes through the e-mail network. @itemize @bullet @item Encoded lines should be prefixed by an appropriate character string to distinguish them from unwanted lines such as mail headers and trailers. Whilst not essential, this feature does assist the decoding program in ignoring these spurious data. @item Lines should not end with whitespace characters as some mailers and operating systems strip off trailing whitespace. @item The encoding program should calculate parameters of the input file such as the number of bytes and CRC and record them at the end of the encoded data. The decoding program should calculate the same parameters from the decoded data and compare the values obtained from those recorded at the end of the encoded data. @end itemize @item Making Files Mailable A mechanism is needed to overcome some gateways' refusal to handle large files. @itemize @bullet @item The encoding program should be able to split the encoded output into parts, each no larger than a maximum specified size. Splitting the output into smaller parts is useful if the encoded data is to be transmitted using electronic mail or over unreliable network links that do not stay up long enough to transmit a large file. The recommended default maximum part size is 30kB. @item The decoding program should be able to decode a multi-part encoded file very flexibly. It should @emph{not} be necessary to: @enumerate @item strip out mail headers and trailers; @item combine all of the parts into one file in the correct order; @item process each part of the encoded data as a separate file. @end enumerate @item In addition any file specifications from the operating system on which the VVE file was created must not prevent the file from being decoded. @end itemize @item Miscellaneous Further considerations include: @itemize @bullet @item Support for character sets other than ASCII is essential if the encoding scheme is to be useful to IBM hosts. The encoding program should label the character set used by the encoded data, and both encoder and decoder should enable the conversion between the local character set and another character set. For example a user on an EBCDIC host should be able to encode text files for transmission to another EBCDIC host, or to convert them to ASCII before encoding and transmission to an ASCII host. Similarly, that user should be able to decode text files from ASCII and EBCDIC machines, creating EBCDIC output files. @item Where possible, the original file's timestamp should be encoded and used by the decoding program when recreating the file: this will permit archives to retain the originator's time of creation for files, and thus permit the users (not to mention the archivists) to identify more clearly when a new version of a file has been made available. Timezones should be supported where possible. @item The encoding and decoding schemes should be able to read and write files that are compatible with one or more of the well established coding schemes (e.g. UUcode, XXcode). @item The source code for the programs should be freely available. It should also be portable and usable with as many computers, operating systems and compilers as possible. @end itemize @end table @c ========================================================================= @c SECTION: The Search Commences @c ========================================================================= @section{The Search Commences} Naturally, the first step was to examine the existing coding schemes in comparison with the above ideal specification. Such schemes fell into two broad classes: @dfn{portable schemes}, which were intended to permit the encoding of files on any computer architecture into a form that could be transmitted electronically, and decoded on the same or a different architecture; and @dfn{platform-specific schemes}, which provided rather better support for transferring files between two computers using the same architecture and operating system. @subsection{Portable Coding Schemes} The most commonly used coding schemes supported by a variety of platforms are: @itemize @bullet @item @code{BOO} @item @code{UU} @item @code{XX} @end itemize Most implementations of these schemes known to the authors are designed for use with stream file systems. These programs have no means of recording, let alone preserving, record structure and are thus unsuitable for our purposes. This is not surprising since @code{UUcode} and its mutation @code{XXcode} were developed specifically for exchanging files between Unix systems. In fairness to these schemes, they are well suited to the transmission of text files and certain unstructured binary files. Standard @code{UUcode} encodes files using characters @samp{ } @dots @samp{_} of ASCII. This can result in one or more spaces appearing at the ends of lines: some mailers decide that this is information not worth transmitting, with consequent inability to reconstruct the original file. Files containing characters such as @samp{^} are often irreversibly corrupted by mail gateways; this problem led to the development of @code{XXcode} which uses a rather more robust character set, namely: @quotation @code{+-0123456789}@* @code{abcdefghijklmnopqrstuvwxyz}@* @code{ABCDEFGHIJKLMNOPQRSTUVWXYZ} @end quotation The encoding table used is recorded with the encoded data to allow the detection of character corruptions, and the correction of reversible character transpositions. Whilst superficially a step forward, @code{XXcode} offered little more than most existing versions of @code{UUcode}, which already supported coding tables. Its major contribution was in formalizing the encoding table, and in particular its default table was proof against all the known gateway-induced corruptions. @subsection{Platform Specific Coding Schemes} Encoding schemes have been developed to support transfer of files possessing some structure which therefore cannot be reconstructed correctly when encoded by the portable schemes. When the encoding and decoding programs of such a platform specific scheme are each used on the same computer and operating system type, files may be encoded and transmitted with a great deal of confidence that the decoded file will reproduce the original's structure and attributes in their entirety. Examples of such programs are @code{TELCODE} and @code{MFTU} for VMS, @code{NETDATA} for IBM mainframes, and @code{Stuffit} and @code{MacBinary} for the Macintosh. But these programs have the major disadvantage that they have each been implemented @emph{only} on the single architecture for which they were designed: thus the only two of these schemes that could be used on the VMS-based Aston Archive would be of minimal interest elsewhere! The Archive's content is in some respects artificially inflated by the presence of @file{.hqx} files for Macintoshes, @file{.boo} for MS-DOS, etc., which have to be held in pre-encoded form for transfer by those requiring them. @c ========================================================================= @c SECTION: VVcode is Born @c ========================================================================= @section{VVcode is Born} Realizing that none of the existing portable schemes were close enough to our ideal, an early version of our specification was circulated on various mailing lists by Niel Kempson towards the end of 1990. When the anticipated ``nil return'' was all that resulted, Brian Hamilton Kelly went ahead and created a rudimentary @code{VVencode} by modifying an existing VAX Pascal implementation of @code{uuencode}. After generating the companion @code{VVdecode}, he then re-implemented the programs in Turbo C under the MS-DOS operating system on the IBM-PC, and thereby was able to prove that the new scheme was both viable and sufficient. This version didn't support file formats, time stamping, file splitting, character sets or CRC checking. @subsection{A Production VVcode} Following the minor feasibility study, Niel Kempson re-engineered the pair of programs from scratch (adding certain features of the evolving specification), paying particular attention to making the code portable across a wide variety of operating systems. Particular care was taken to avoid the use of supposedly ``standard'' C functions that experience had shown behaved differently under individual manufacturer's implementations, or were even non-existent in some. Therefore the code may sometimes appear to be performing certain operations in a very long-winded way; it's very easy to look at it and say ``why didn't the author use the @code{foo()} function, which does this much more efficiently?'', but this function may not even exist under another implementation of C, or behave in a subtly different manner. The core functions of @code{VVcode} are implemented as a collection of routines written in as portable a fashion as possible, and a separate module of a few operating system specific routines for file I/O, timestamping, command-line or other interface, etc. Porting @code{VVcode} to a new platform should require only that this latter module be re-implemented, in most cases by adapting an existing one. @code{VVcode} implements all of the features listed in the specification, apart from the ability to generate @code{UUcode} and @code{XXcode} compatible files. However, the decoding program is backwards compatible and can decode files generated by @code{UUcode} and @code{XXcode}. @subsection{Arguments against VVcode} When the advent of the @code{VVcode} system was first aired in the various electronic digests, some heated debate followed along the lines that a new encoding scheme was unnecessary, since @code{UUcode}/@code{XXcode} sufficed @emph{for them}. However, all these correspondents were Unix users who had interpreted the @samp{VV} as meaning ``Vax-to-Vax'' by analogy with @samp{uu}@footnote{@samp{V} was chosen simply because it followed @samp{U}; at one time, we'd seriously considered calling it @code{YAFES} --- Yet Another File Encoding Scheme!} and who felt that such a scheme should be private to VAXen. The authors' reply was to the effect that the encoding scheme was intended to support the needs of archives like Aston's, and as such, had to provide @enumerate @item an automated tool (it would be somewhat difficult to expect our users to be able to tell the encoder what sort of file structure it was handling, when this concept was entirely alien to many of them); @item facilities to encode binaries for many operating systems; @item mail server features, such as splitting of large files; @item operation across the widest possible combination of platforms. @end enumerate The overhead of using the @code{VVcode} system is at most a couple of hundred bytes over using @code{UUcode}, and the extra functionality and @emph{universality} with respect to @code{UUcode} or @code{XXcode} thereby comes almost for free. @c ========================================================================= @c SECTION: @c ========================================================================= @section{Availability of VVcode} At present, the @code{VVcode} system is only available in C, but it has been shown to run successfully on the following combinations of hardware, operating system and compiler: @table @strong @item Macintosh At the time of writing (May 1991) John Rawnsley of the University of Warwick had commenced development of a Macintosh port, which will encode the resource and data forks in a manner that will permit the former to be ignored by non-Macintosh systems. @item MS-DOS @itemize @bullet @item IBM PS/2, PC (and clones); MS-DOS 3.3, 4.01, 5.00; Borland Turbo C 1.5, 2.0, Borland C++ 1.0, 2.0, 3.0 and Microsoft C 5.1, 6.0 @end itemize @item OS/2 @itemize @bullet @item IBM PS/2, PC (and clones); OS/2 2.0; Microsoft C 6.0 and GNU C 2.1 @end itemize @item Unix @itemize @bullet @item Sun 3; SunOS 3.x and 4.0.3; native C and GNU C @item Sun Sparcstation 1; SunOS 4.1; native C and GNU C @item SCO Unix V/386 v3.2.2, Microsoft C compiler @end itemize @item VAX/VMS @itemize @bullet @item All VAXen; VMS 5.2--5.4-1; VAX C V3.0--V3.2 and GNU C 1.40 @end itemize @item VM/CMS @itemize @bullet @item VM/CMS; Whitesmith C compiler v1.0 (This implementation was ported by Rainer Sch@"opf; basing it upon the Unix implementation, this took him about one day.) @end itemize @end table