\chapter{\TeX/\LaTeX\ code for components of organic chemical structure diagrams}\label{ch:txltx} \section{Conventions for drawing the diagrams}\label{sc:convntns} The chemical structure of a molecule is defined by the spatial arrangement of the atoms and the bonding between them. Chemists use several standard methods for representing the structures two-dimensionally by diagrams called structural formulas; and this thesis will develop mechanisms for printing such diagrams using the \TeX/\LaTeX\ system. A very common structure representation, sometimes called a dash structural formula, uses the element symbols for the atoms and a dash for each covalent bond in the compound. Thus the dash represents the pair of shared electrons that constitutes the bond. Two dashes ($=$) represent a double bond and three dashes ($\equiv $) a triple bond. --- It is usually neither necessary nor practical to represent each bond in a molecule explicitly by a dash. Some molecules and some bonds are so common that a complete dash formula would not be used except at a very introductory level of presenting chemical information. A condensed structural formula is one alternative. It does not contain dashes but uses the convention that atoms bonded to a carbon are written immediately after that carbon and otherwise atoms are written from left to right in the order in which they occur in the real structure. The following two structural formulas are a dash formula and a condensed formula, respectively, for the same compound, ethanol. \[ \parbox{4.5cm} { \begin{picture}(400,900)(0,-110) \put(0,0) {\cbranch{H}{S}{H}{S}{C}{S}{}{S}{H} } \put(240,0) {\cbranch{H}{S}{}{Q}{C}{S}{O---H}{S}{H} } \end{picture} } \hspace{1.5cm} {\rm CH_{3}CH_{2}OH} \] Multiple bonds are usually not implied unless a very common group, such as the cyano group, is shown. It can be found as $-$C$\equiv $N or simply as $-$CN. Another alternative to a complete dash formula is a diagram where the symbols for carbon and for hydrogen on carbon are not shown. Each corner and each open-ended bond in these diagrams implies a carbon atom with as many hydrogen atoms bonded to it as there are free valences. This representation is the customary one for ring structures (structures with a closed chain of atoms). Thus, the following two diagrams both represent the compound cyclopropane. \[ \hetthree{Q}{H}{H}{H}{H}{S}{S}{C} \hspace{3cm} \yi=330 \threering{Q}{Q}{Q}{Q}{Q}{Q}{Q}{Q}{Q} \] \reinit The three different kinds of structure representation can be combined in one diagram, such that in part of the diagram all bonds are represented by dashes and all atoms by an element symbol, in another part a condensed structural formula fragment is used, and in still another part a cyclic fragment with implied carbon and hydrogen atoms occurs. The rest of this chapter describes how \LaTeX\ can be used to position and typeset the bond lines and condensed formula strings that are the components of structure diagrams. It should be mentioned that there are no binding rules for many aspects of the two-dimensional representations of a chemical structure. Structures and fragments of structures can be oriented in different ways depending on the availability of space, the emphasis given to a certain part of a structure, or the spatial relationship of the parts to each other. Thus, a cyclopropane ring can be represented in various orientations, $\bigtriangleup $, $\bigtriangledown $, and others. Also, the angles between the bond lines can be different in different representations of one and the same compound. Since most molecules do not have all their atoms lying in one plane it would not even be possible to reproduce all bond angles in a two-dimensional representation. The structures shown in this thesis adopt the orientations and bond angles found to prevail in Solomons' textbook (Solomons 84), the organic chemistry text used for several years at the University of Tennessee. There are some methods to indicate the real, three-dimensional structure (the stereochemistry) of a molecule in the two-dimensional representation: A dashed line and a wedge instead of a full bond line mean that the real bond extends below or above the plane, respectively. \section{Bond line drawing and positioning} \subsection{Review of \TeX/\LaTeX\ facilities for line-drawing} \label{sc:review} The easiest way to produce horizontal and vertical lines representing chemical bonds is by the use of keyboard characters and simple control sequences provided by \TeX. By typing one, two, or three hyphens, a normal hyphen, a medium dash designed for number ranges, and a punctuation dash are produced, - -- ---, respectively. When a hyphen is typed in \TeX's math mode, it is interpreted as a minus sign and the spacing around it will be different from text mode. --- The equal sign can represent a double bond for chemistry typesetting. It can be typed in text mode and in math mode, again resulting in different spacing around the symbol. --- The control sequence \verb+\equiv+ can be used as a triple bond ($\equiv $). It has to be typed in math mode. Vertical lines are available through the keyboard character or the control sequences \verb+\vert+ and \verb+\mid+, all three to be entered in math mode. A double vertical bar is produced by \verb+\|+ or \verb+\Vert+, again both in math mode. The spacing around all these symbols can be controlled by adding extra (positive or negative) space with the horizontal spacing commands. The symbols, just as any other part of a line, can also be raised or lowered respective to the normal baseline. The length and height of the symbols however depend on the font currently in use. Where control of length and height of the bond lines is needed, \TeX's or \LaTeX's command sequences for printing horizontal and vertical ``rules'' can be used. The systems recognize several length units, including the inch, centimeter, millimeter, and printer point (Knuth 84, p. 57). One printer point (pt), an often used unit in typesetting, measures about 0.35 mm. --- \LaTeX's rule-printing command has the format $$\hbox{\verb+\rule[raise-length]{width}{height}+}$$ Thus it can be used to produce horizontal and vertical rules. Using the \verb+\rule+ command one can also print multiple bond lines of user-controlled length, e.~g. $\dbond{16}{10} $, $\tbond{16}{11} $, with the short control sequences \verb+\dbond+ and \verb+\tbond+ defined in this thesis. The vertical spacing between the bonds depends on the current line spacing in the document and may have to be adjusted. The control sequences are set up for math mode. When bond lines other than horizontal and vertical ones are to be printed, and when a coordinate system is needed to control placement of structure components relative to one another, \LaTeX's picture environment (Lamport~86, pp.~101--111) is a necessity. A picture environment uses length units which are dimensionless and have to be defined by the user before entering the environment. This is done by the \verb+\setlength+ command. In this study, \verb+\setlength{\unitlength}{0.1pt}+ is the definition used for most diagrams. Such a small unitlength was chosen to have fine control over the appearance of the diagram. The picture environment starts with the statement $$\hbox{\verb+\begin{picture}(width,height)+}$$ where picture width and height reserve space on the page and are specified in terms of unitlengths. Optionally, one can include the coordinates of the lower left corner of the picture: $$\hbox{\verb+\begin{picture}(width,height)(x+$_i$\verb+,y+$_i$\verb+)+}$$ The default value for these coordinates is (0,0). Objects are placed into the picture with the \verb+\put+ command with their reference point at the coordinates (x,y): \verb+\put(x,y){picture object}+. The picture objects of most interest to this study are straight lines. They are drawn by the \verb+\line+ command: $$\hbox{\verb+\line(x+$_s$\verb+,y+$_s$\verb+){length}+}$$ where the coordinate pair specifies the slope of the line, and the nonnegative value of length specifies the length of the projection of the line on the x-axis for all nonvertical lines, and the length of the line for vertical lines. The reference point of a line is one of its ends. Thus the statement $$\hbox {\verb+\put(x,y)+ \verb+{\line(x+$_s$\verb+,y+$_s$\verb+){len}}+} $$ draws a line that begins at (x,y), has a slope of ${\rm y_s\mbox{/}x_s}$, and extends for length len as explained above. Only a limited number of slopes is available through the line fonts in \LaTeX. The possible values for ${\rm x_s}$ and ${\rm y_s}$ are integers between $-6$ and $+6$, inclusive. These values translate into 25 different absolute angle values, which are listed in Appendix~\ref{ap:slopes}. \subsection{Bonds in structural formulas written on one line} \label{sc:onelinebonds} The application of some of the bond-drawing mechanisms for this simplest type of structural diagrams is illustrated in Figure~\ref{fg:oneline}. \begin{figure}\centering \begin{picture}(900,700) \put(0,600) {a \ $CH\equiv C-CH=CH_{2}$} \put(0,350) {b \ $CH$\raise.1ex\hbox{$\equiv$}$C-CH=CH_{2}$} \put(0,100) {c \ $CH\tbond{14}{20} C\sbond{14} CH\dbond{14}{19} CH_{2}$} \end{picture} \caption{One-line structural formulas} \label{fg:oneline} \end{figure} For Figure~\ref{fg:oneline}a only keyboard characters and the \TeX\ command \verb+\equiv+ were used to produce the bonds. Figure~\ref{fg:oneline}b shows a slight improvement through raising the triple bond. Figure~\ref{fg:oneline}c was printed using the \verb+\sbond+, \verb+\dbond+, and \verb+\tbond+ command sequences from this thesis, choosing a length of 14pt for the bonds. It can be seen that each of the formulas in Figure~\ref{fg:oneline} is a creditable representation of the structure. Depending on the design of the page, the reason for displaying the structure at a particular place, and the emphasis put on features of the structure in the text, one would choose shorter or longer bonds and take more or less trouble to produce the structure. The picture environment is not needed for one-line structural formulas, unless one of these formulas has to be attached to another structural fragment, as in Figure~\ref{fg:picline} Then the coordinate system of the picture environment makes it possible to fit the two fragments together. \begin{figure} \hspace{5cm} \parbox{70 pt} { \begin{picture}(400,200) \put(-155,0) {$CH_{3}-CH-CH_{2}-CH_{2}-CH_{2}-CH_{3}$} \end{picture} } \hspace{5cm} \yi=200 \pht=600 \sixring{Q}{Q}{Q}{Q}{Q}{}{D}{D}{D} \\ \caption{One-line structure in picture environment} \label{fg:picline} \end{figure} \subsection{Bonds in acyclic structures with vertical branches} Structure diagrams with vertical, single- or double-bonded, branches, going up or down, are frequently seen. Several experiments with \TeX\ and \LaTeX\ were made to see how this type of structure can be handled. One method is to align the vertical bonds by using the mechanisms for tabbing or for printing tables and matrices. Here a structure such as the one shown in Figure~\ref{fg:vertbranch} is treated as a set of columns as indicated by the vertical dividing lines drawn into the second version of this structure in Figure~\ref{fg:vertbranch}. \begin{figure} \hspace{1cm} \begin{minipage}{180pt} \begin{tabbing} $CH_{3}CH_{2}$\= $CH$\= $CHCH_{2}$\= $CHCH_{2}CH_{3}$\+ \kill $Br$\> \> $CH_{3}$ \\ \hspace{2pt}$\vert $\> \> \hspace{2pt}$\vert $ \- \\ $CH_{3}CH_{2}$\> $CH$\> $CHCH_{2}$\> $CHCH_{2}CH_{3}$\+ \+ \\ \hspace{2pt} $\vert $ \\ $CH_{2}CH_{3}$ \end{tabbing} \end{minipage} \hspace{2.5cm} \begin{minipage}{180pt} \begin{tabbing} $CH_{3}CH_{2}$\= $\vert CH$\= $\vert CHCH_{2}$\= $\vert CHCH_{2}CH_{3}$ \+ \kill $\vert Br$\> $\vert $ \> $\vert CH_{3}$ \\ $\vert $\hspace{2pt}$\vert $\> $\vert $ \> $\vert $\hspace{2pt} $\vert $ \- \\ $CH_{3}CH_{2}$\> $\vert CH$ \> $\vert CHCH_{2}$\> $\vert CHCH_{2}CH_{3}$ \+ \\ $\vert $\> $\vert $\hspace{2pt}$\vert $ \> $\vert $ \\ $\vert $\> $\vert CH_{2}CH_{3}$ \end{tabbing} \end{minipage} \caption{Vertical branches}\label{fg:vertbranch} \end{figure} The structure diagram in Figure~\ref{fg:vertbranch} uses \verb+\vert+ for the vertical bonds and \LaTeX's tabbing environment for the alignment. One can also use ``rules'' as the vertical bonds in order to give the horizontal and vertical bonds the same lengths. Furthermore, vertical bonds can also be double bonds. The following examples illustrate these features. $$ \tbranch{O}{D}{H_{2}N-}{C-NH_{2}}{}{}{1} \hspace{2cm} \tbranch{}{}{CH_{3}-CH_{2}-}{C-CH_{3}}{D}{NH}{1} \hspace{2cm} \tbranch{}{}{H-}{C=\ }{S}{Br}{1}\tbranch{}{}{}{C-H}{S}{Br}{1} $$ Similar structures were also generated with \TeX's \verb+\halign+ mechanism which forms templates for the columns rather than setting tab stops. For the purpose of printing the structure diagrams, no clearcut advantage was seen in one or the other method of alignment. In each case the vertical spacing depends on the line spacing in the document. The alternative method of producing these structures is the use of the picture environment. It provides better control over horizontal and vertical spacing and over bond lengths. Also, as illustrated in Section~\ref{sc:onelinebonds}, using a picture environment makes it possible to attach one structural fragment to another at a specific place. Thus, although the picture environment is not necessary for drawing structures with vertical branches, it has several advantages, and writing \LaTeX\ code for this implementation is not more difficult than writing the code for the tabbing method of alignment. \subsection{Bonds in Structures Containing Slanted Bond Lines} Structure diagrams with slanted bond lines are frequently used for acyclic compounds and have to be used to depict almost all cyclic structures. Two examples are shown here: $$ \cdown{$CH_{3}$}{S}{$N^{+}$}{D}{$O$}{S}{$O^{-}$} \hspace{3cm} \sixring{$COOH$}{$OCOCH_{3}$}{Q}{Q}{Q}{Q}{S}{S}{C} $$ In developing diagrams for such structures in this thesis the conventions described in Section~\ref{sc:convntns} are followed. Thus the symbol for carbon is not printed for the carbons that are ring members, but it is usually printed in acyclic structures, unless the acyclic structure fragment is a long chain, or space for the diagram is limited. The picture environment is always needed for slanted lines. It was explained in Section~\ref{sc:review} that \LaTeX\ can draw lines only with a finite number of slopes. This is not a severe limitation for creating the structure diagrams, since the conventions for structure representation allow variations in the angles. The representation does not have to reflect the true atomic coordinates. In fact many chemistry publications contain structure diagrams with angles significantly deviating from the real bond angles, even where those could have been used easily. Thus, Solomons' text (Solomons~84) shows the carboxylic acid group often in this form \pht=600 \[ \cright{}{S}{C}{D}{O}{S}{OH} \] \pht=900 with an angle of about $90\circ$ between the OH and doublebonded O, whereas the true angle is close to $120\circ$. --- The angles used in this thesis for the regular hexagon of the sixring deviate by % \parbox{4mm}{+\vspace{-18pt}\\ $-$}~$1^0$ from $120^0$ $\pm 1^\circ$ from $120^\circ$ because of \LaTeX's limited number of slopes. This difference is not big enough to be detected as a flaw. To write the \LaTeX\ statement for a slanted bond line, one chooses the origin and the slope and then uses trigonometric functions to calculate the \LaTeX\ ``length'' of the line for the desired real length. Once the \LaTeX\ length is determined, the coordinates of the end point of the line can be calculated in case the end point is needed as the origin of a connecting line. --- The origin and length of slanted double bonds were also calculated with standard methods from trigonometry. As an example, Figure~\ref{fg:calcpos} shows how coordinates of the origin were calculated for the inside part of a ring double bond that is at a distance d from the outside bond. \setlength{\unitlength}{1pt} \begin{figure} \begin{picture}(300,250)(0,-100) \thicklines \put(0,0) {\line(5,3) {120}} \put(120,72) {\line(5,-3) {120}} \put(240,0) {\line(0,-1) {100}} \put(215,-6) {\line(-5,3) {88}} \thinlines \put(120,72) {\circle*{4}} \put(125,72) {($x$,$y$)} \put(127,47) {\circle*{4}} \put(132,47) {($x_d$,$y_d$)} \put(120,72) {\line(0,-1) {16}} \put(120,72) {\line(-3,-5){9}} \put(111,56) {\line(1,0) {16}} \put(127,56) {\line(0,-1) {9}} \put(111,56) {\line(5,-3) {16}} \put(111,62) {\scriptsize d} \put(116,46) {\scriptsize d} \put(112,17) {{\small $\theta =30^{0}$}} \put(114,28) {\vector(0,1){27}} \put(270,35) {$x_{d}=x-d\sin ${\small $\theta $}$+d\cos ${\small $\theta $}} \put(270,5) {$y_{d}=y-d\sin ${\small $\theta $}$-d\cos ${\small $\theta $}} \end{picture} \caption{Calculating position and length of double bond.} \label{fg:calcpos} \end{figure} \reinit The \LaTeX\ command \verb+\multiput+ is similar to \verb+\put+ and provides a shortcut for the coding of structures where several bond lines of the same slope and length occur at regular intervals. Multiput has the format $$\hbox {\verb+multiput(x,y)(x+$\Delta$\verb+,+$\Delta$\verb+y){n}+ \verb+{object}+}$$ where n is the number of objects, {\em e.g.,\/} lines. A structure diagram for which several \verb+\multiput+ statements are appropriate is the structure of vitamin~A shown in Figure~\ref{fg:multidiag} \begin{figure} \hspace{2cm} \parbox{5cm} { \begin{picture}(900,900)(-300,-300) \put(342,200) {\line(0,-1) {200}} \put(342,0) {\line(-5,-3) {171}} \put(171,-103) {\line(-5,3) {171}} \put(0,0) {\line(0,1) {200}} \put(0,200) {\line(5,3) {171}} \put(171,303) {\line(5,-3) {171}} \put(322,180) {\line(0,-1) {160}} \put(342,0) {\line(5,-3) {128}} \put(171,303) {\line(5,3) {128}} \put(171,303) {\line(-5,3) {128}} \multiput(342,200)(342,0){5}{\line(5,3){171}} \multiput(513,303)(342,0){4}{\line(5,-3){171}} \multiput(527,270)(342,0){4}{\line(5,-3){135}} \multiput(855,303)(684,0){2}{\line(0,1){160}} \put(1881,275){=O} \end{picture} } \caption{Diagram using $\backslash $multiput} \label{fg:multidiag} \end{figure} The size of objects in a picture environment can be scaled in a simple way by changing the unitlength. Figure~\ref{fg:scaling} illustrates scaling and two problems associated with it. Changing the unitlength changes the length of the lines only, not the width of the lines or the size of text characters. Thus ``it does not provide true magnification and reduction'' (Lamport~86, p.~102). However, the size of the text characters can be varied separately, as will be discussed in the next section of this chapter. \begin{figure} \pht=750\centering \setlength{\unitlength}{.07pt} \sixring{$OH$}{Q}{Q}{Q}{Q}{$Br$}{S}{D}{S} \hspace{1.5cm} \setlength{\unitlength}{0.08pt} \sixring{$OH$}{Q}{Q}{Q}{Q}{$Br$}{S}{D}{S} \hspace{1.5cm} \yi=150 \setlength{\unitlength}{0.15pt} \sixring{$OH$}{Q}{Q}{Q}{Q}{$Br$}{S}{D}{S} \caption{Scaling (unitlength=0.07pt, 0.08pt, 0.15pt)} \label{fg:scaling} \end{figure} The smallest diagram in Figure~\ref{fg:scaling} illustrates a limitation that is unfortunate for the printing of structure diagrams. The shortest slanted line that can be printed by \LaTeX's line fonts is one with an x-axis projection of about~3.6mm. If a shorter slanted line is requested, \LaTeX\ just prints nothing. A chemist would occasionally want to draw shorter lines, especially for the purpose of generating dashed lines indicating stereochemical features. \section{Atomic symbols and condensed structural fragments} \label{sc:fragments} Special considerations for the printing of condensed structural fragments are required since many of them contain subscripts. \TeX\ considers the printing of subscripts a part of mathematics typesetting which has to be done in the special math mode. As it is known, typesetting of mathematics documents is one of the strong points of \TeX; the fonts of type for the math mode are designed to agree with all conventions of high quality mathematics publishing. Each typestyle in math mode consists of a family of three fonts (Knuth~84, p.~153), a textfont for normal symbols, a scriptfont for first-level sub- and superscripts, and a scriptscriptfont for higher-level sub- and superscripts. When structural fragments such as ${\rm C_{2}H_{5}}$ are typeset, the textfont is used for the C and the~H. As \TeX\ enters math mode it selects \verb+\textfont1+ as the textfont unless otherwise instructed. \verb+\textfont1+ is defined by the \TeX\ macros as math italic, a typestyle that prints letters (not numbers) similar to the italic style, but with certain features adapted for mathematics typesetting. The italic style letters, lower and upper case, are the ones commonly seen in typeset mathematical formulas. Chemical formulas on the other hand are not usually printed with slanted letters. In this thesis, two methods were employed to produce chemistry-style letters in \TeX's math mode which has to be used because of the presence of subscripts. For a document that contains many chemical formulas it is convenient to redefine \verb+\textfont1+ at the beginning of the \TeX\ input file. The statement \verb+\textfont1=\tenrm+ was used at the beginning of the input file that produced this document and causes \TeX\ to select the roman font as the textfont in math mode. The roman typestyle is the one normally used by \TeX\ outside of math mode and it is the style in which this thesis is printed. The ten point size, which is slightly smaller than the eleven point size of the text in this document, was chosen because it appears to look better for the chemical formulas which consist largely of capital letters. When different typesizes are used in this way, all the atomic symbols and formulas in any one structure, even those without subscripts, have to be printed in math mode so that they all have the same size. It could be a problem with this method of selecting the roman font for math mode that the lowercase Greek letters (and some other symbols used in mathematics) are not available in this font. To print these one can temporarily redefine textfont1 to math italic with the statement \verb+\textfont1=\tenmi+. One can also switch to a math font different from the default \verb/\textfont1/. Using one of \LaTeX's font definitions, \verb+\small+, a statement \{\verb+\small$\theta$+\} will print the Greek letter. Another method for avoiding the math italic style for letters in chemical formulas is to select the roman style in each individual instance where a formula has to be printed in math mode. A statement such as \verb+${\rm C_2H_5}$+ produces ${\rm C_{2}H_{5}}$ at the size of type currently used in the document. When the typestyle is thus selected within math mode, enclosed by dollar signs, \TeX\ changes the style of the letters of the alphabet only; the lowercase Greek letters and math symbols remain available. The size of the letters in chemical formulas can be changed with the ten size declarations provided by \LaTeX\ (Lamport~86, p.~200) or with \TeX's declarations. (Some of \TeX's declarations are not defined in \LaTeX\ (Lamport~86, p.~205)). The size declaration has to be written outside of math mode. One place in chemistry typesetting where a smaller typesize is desirable is the writing on reaction arrows. The size in the following example is scriptsize: $$ \advance \yi by 100 HC\equiv CH + H_{2}O \parbox{92pt} {\cto{Hg^{++}}{18\%\ H_{2}SO_{4},\ 90^\circ}{14}} CH_{3}-CHO $$ Finally, condensed structural formulas sometimes have to be right-justified to be attached to the main structural diagram. Figure~\ref{fg:rightjus} illustrates this for the positioning of the substituent in the 4-position of the pyrazole ring. \LaTeX\ makes this positioning convenient with the \verb+\makebox+ command, especially in the picture environment where the command has the format $$\hbox{\verb+\makebox(width,height)[alignment]{content}+}$$ (Lamport~86, p.~104). The one-line piece of text that constitutes the content of the (imaginary) box can be aligned with the top, bottom, left side, or right side of the box. \begin{figure}\centering \parbox{\xbox pt} { \begin{picture}(\pw,\pht)(-\xi,-\yi) \put(200,-84) {\line(5,3) {110}} % bond 1,2 \put(342,200) {\line(0,-1) {140}} % bond 3,2 \put(342,200) {\line(-1,0) {342}} % bond 3,4 \put(0,200) {\line(0,-1) {200}} % bond 4,5 \put(0,0) {\line(5,-3) {140}} % bond 5,1 \put(135,-130) {$N$} % N-1 in ring \put(310,-30) {$N$} % N-2 in ring \put(171,-137) {\line(0,-1) {83}} % subst. on \put(150,-283) {$C_{6}H_{5}$} % on N-1 \put(370,-17) {\line(5,-3) {100}} % subst. on \put(475,-100) {$C_{6}H_{5}$} % N-2 \put(335,211) {\line(5,3) {128}} % outside \put(349,189) {\line(5,3) {128}} % double O \put(475,250) {$O$} % on C-3 \put(0,200) {\line(-5,3) {128}} % single subst. \put(-430,234) {\makebox(300,87)[r]{$CH_{3}COCH_{2}CH_{2}$}} \put(-7,11) {\line(-5,-3){128}} % outside \put(7,-11) {\line(-5,-3){128}} % double O \put(-200,-130){$O$} % on C-5 \end{picture} } % end pyrazole macro \caption{Right-justification of substituent formula} \label{fg:rightjus} \end{figure}