GNU Source-highlight, given a source file, produces a document with syntax highlighting. The colors and the styles can be specified (bold, italics, underline) by means of a configuration file, and some other options can be specified at the command line. The output format can be HTML, XHTML and ANSI color escape sequences.
The program already recognizes many programming languages (e.g., C++, Java, Perl, etc.) and file formats (e.g., log files, ChangeLog, etc.). Since version 2.0, it also allows you to specify your own input source language via a simple syntax described later in this manual (Language Definitions).
The complete list of languages (indeed, file extensions) natively
supported by this version of Source-highlight (2.0), as
reported by --lang-list
, is the following:
Supported languages (file extensions) and associated language definition files C = cpp.lang H = cpp.lang bison = bison.lang c = cpp.lang caml = caml.lang cc = cpp.lang changelog = changelog.lang cpp = cpp.lang flex = flex.lang fortran = fortran.lang h = cpp.lang hh = cpp.lang hpp = cpp.lang htm = html.lang html = html.lang java = java.lang javascript = javascript.lang js = javascript.lang l = flex.lang lex = flex.lang ll = flex.lang log = syslog.lang lua = lua.lang ml = caml.lang pas = pascal.lang pascal = pascal.lang perl = perl.lang php = php3.lang php3 = php3.lang pl = prolog.lang pm = perl.lang prolog = prolog.lang py = python.lang python = python.lang rb = ruby.lang ruby = ruby.lang sml = sml.lang syslog = syslog.lang y = bison.lang yacc = bison.lang yy = bison.lang
Please, keep in mind, that I haven't tested personally all these
language definitions: I actually checked that the definition file is
correct (with the command line option --check-lang
, Invoking source-highlight), but I'm not sure their definition actually respects
that language syntax (e.g., I've put up together some language
definitions by searching for information in the Internet, but I've never
programmed in that language). So, if you find that a language
definition is not precise, please let me know. Moreover, if you have a
program example in a language that's not included in the tests
directory, please send it to me so that I can include it in the test
suite.
See the file INSTALL for detailed building and installation instructions; anyway if you're used to compiling Linux software that comes with sources you may simply follow the usual procedure, i.e. untar the file you downloaded in a directory and then:
cd <source code main directory> ./configure make make install
Note: unless you specify a different install directory by
--prefix
option of
configure (e.g. ./configure --prefix=<your home>
),
you must be root to run make install
.
Files will be installed in the following directories:
Executables
/prefix/bin
docs and samples
/prefix/share/doc/source-highlight
conf files
/prefix/share/source-highlight
Default value for prefix is /usr/local
but you may change it with --prefix
option to configure.
NOTICE: Originally, instead of Source-highlight, there were two separate programs, namely GNU java2html and GNU cpp2html. There are two shell scripts with the same name that will be installed together with Source-highlight in order to facilitate the migration (however their use is not advised and it is deprecated).
You can download it from GNU's ftp site: ftp://ftp.gnu.org/gnu/src-highlite or from one of its mirrors (see http://www.gnu.org/prep/ftp.html).
I do not distribute Windows binaries anymore; since, they can be easily built by using Cygnus C/C++ compiler, available at http://www.cygwin.com. However, if you don't feel like downloading such compiler, you can request such binaries directly to me, by e-mail (find my e-mail at my home page) and I can send them to you. An MS-Windows port of Source-highlight is available from http://gnuwin32.sourceforge.net.
Archives are digitally signed by me (Lorenzo Bettini) with GNU gpg (http://www.gnupg.org). My GPG public key can be found at my home page (http://www.lorenzobettini.it).
You can also get the patches, if they are available for a particular release (see below for patching from a previous version).
This project's CVS repository can be checked out through anonymous (pserver) CVS with the following instruction set. When prompted for a password for anoncvs, simply press the Enter key.
cvs -d:pserver:anoncvs@subversions.gnu.org:/cvsroot/src-highlite login cvs -z3 -d:pserver:anoncvs@subversions.gnu.org:/cvsroot/src-highlite \ co src-highlite
Further instructions can be found at the address:
http://savannah.gnu.org/projects/src-highlite.
Since version 2.0 Source-highlight relies on regular expressions as provided by boost (http://www.boost.org), so you need to install at least the regex library from boost. Most GNU/Linux distributions provide this library already in a compiled form.
Source-highlight has been developed under GNU/Linux, using gcc (C++), and bison (yacc) and flex (lex), and ported under Win32 with Cygnus C/C++compiler, available at http://www.cygwin.com. I used the excellent GNU Autoconf and GNU Automake. I also used Autotools (ftp://ftp.ugcs.caltech.edu/pub/elef/autotools) which creates a starting source tree (according to GNU standards) with autoconf, automake starting files. Finally I used GNU gengetopt (http://www.gnu.org/software/gengetopt), for command line parsing.
I started to use also doublecpp (http://www.lorenzobettini.it/software/doublecpp) that permits achieving dynamic overloading.
If you want to use a specific version of the Boost regex library, you
can use the configure option --with-boost-regex
to specify a
particural suffix. For instance,
./configure --with-boost-regex=boost_regex-gcc-1_31
Actually, apart from the boost regex library, you don't need the other tools above to build source-highlight because I provide generated sources, unless you want to develop source-highlight.
If you downloaded a patch, say source-highlight-1.3-1.3.1-patch.gz (i.e., the patch to go from version 1.3 to version 1.3.1), cd to the directory with sources from the previous version (source-highlight-1.3) and type:
gunzip -cd ../source-highlight-1.3-1.3.1.patch.gz | patch -p1
and restart the compilation process (if you had already run configure a simple make should do).
This was suggested by Konstantine Serebriany. The script src-hilite-lesspipe.sh will be installed together with source-highlight. You can use the following environment variables:
export LESSOPEN="| /path/to/src-hilite-lesspipe.sh %s" export LESS=' -R '
This way, when you use less to browse a file, if it is a source file handled by source-highlight, it will be automatically highlighted.
Christian W. Zuckschwerdt added support for building an .rpm and an .rpm.src. You can issue the following command
rpm -tb source-highlight-2.0.tar.gz
for building an .rpm with binaries and
rpm -ts source-highlight-2.0.tar.gz
for building an .rpm.src with sources.
Martin Gebert is also implementing a KDE interface to source-highlight programs (and he did a wonderful job!), and it is called ksrc2html; if you want to test it: http://murphy.netsolution-net.de.
CGI support was enabled thanks to Robert Wetzel; I haven't tested it personally yet, so you may ask him directly. Moreover he set up some examples at the page http://www.inf.tu-dresden.de/~rw8/java2.html. If you want to use source-highlight as a CGI program, you have to use the executable source-highlight-cgi. You can build such executable by issuing
make source-highlight-cgi
in the src directory.
Moreover there's also a Java version of java2html, you can find it at http://www.generationjava.com/projects/Java2Html.shtml.
GNU Source-highlight is free software; you are free to use, share and modify it under the terms of the GNU General Public License that accompanies this software (see COPYING).
GNU source-highlight was written and maintained by Lorenzo Bettini http://www.lorenzobettini.it.
Here are some realistic examples of running source-highlight1.
Source-highlight only does a lexical analysis of the source code, so the program source is assumed to be correct!
Here's how to run source-highlight (for this example we will use C/C++ input files, but this is valid also for other source-highlight input languages):
source-highlight --src-lang cpp --out-format html \ --input <C++ file> \ --output <html file> options
For input files, apart from the -i (--input)
option and the standard
input redirection, you can simply specify some files at the command line
and also use regular expressions (for instance *.java). In this case
the name for the output files will be formed using the name of the
source file with a .<ext> appended, where <ext> is the extension chosen
according to the output format specified (in this example it would be
.html).
If STDOUT
string is passed as -o (--output)
option, then
the output is forced to the standard output anyway.
If -s (--src-lang)
is not specified, the source language is
inferred by the extension of the input file (this, of course, does not
work with standard input redirection).
If -f (--out-format)
is not specified, the output will be
produced in HTML.
During execution, source-highlight needs some files where it finds
directives on how to recognize the source language (if not explicitly
specified with --src-lang
or --lang-def
), on how to format
specific source elements (e.g., keywords, comments, etc.), and source
language definitions. These files will be explained in the next
sections.
If the directory for such files is not explicitly specified with the
command line option --data-dir
, these files are searched for in
the following order:
If you want to be sure about which file is used during the
execution, you can use the command line option --verbose
.
You must specify your options for syntax highlighting in the file tags.j2h. Here's the one that comes with this distribution:
keyword blue b ; // for language keywords type darkgreen ; // for basic types string red ; // for strings and chars comment brown i ; // for comments number purple ; // for literal numbers preproc darkblue b ; // for preproc directives (e.g. #include, import) symbol darkred ; // for simbols (e.g. <, >, +) function black b; // for function calls and declarations cbracket red; // for block brackets (e.g. {, }) // line numbers linenum black; // Internet related url blue u; // other elements for ChangeLog and Log files date blue b ; time darkblue b ; ip darkgreen ; file darkblue b ; name darkgreen ; // for Prolog, Perl... variable darkgreen ;
You can specify your own file (it doesn't have to be named
tags.j2h) with the command line option --tags-file
, see
Invoking source-highlight.
You can also specify the color of normal text by adding this line
normal darkblue ;
As you might see the syntax of this file is quite straightforward:
b = bold i = italics u = underline
You may also specify more than on of these options separated by commas e.g.
keyword blue u, b ;
These are all possible HTML color logical names handled by source-highlight:
black (#000000) red (#FF0000) darkred (#990000) brown (#660000) yellow (#FFCC00) cyan (#66FFFF) blue (#3333FF) pink (#CC33CC) purple (#993399) orange (#FF6600) brightorange (#FF9900) green (#33CC00) brightgreen (#33FF33) darkgreen (#009900) teal (#008080) gray (#808080) darkblue (#000080)
You can see these colors in the file colors.html. You can also
use the standard #<number>
html syntax for specifying a color.
This configuration file associates a file extension to a specific
language definition file. You can also use such file extension to
specify the --src-lang
option (see Simple Usage).
Source-highlight comes with such a file, called lang.map.
Of course, you can ovverride the settings of this file by
writing your own language map file and specify such file
with the command line option --lang-map
).
Moreover, as explained above, if a file lang.map
is present in the current directory, such version will be used.
The format of such file is quite simple:
extension = language definition file
The default language definition file is shown in Introduction.
These files are crucial for source-highlight since they specify the source elements that have to be highlighted. These files also allow to specify your own language definitions in order to deal with a language that is not handled by source-highlight2.
I encourage those who write new language definitions or correct/modify existing language definitions to send them to me so that they can be added to the source-highlight distribution!
Since these files require more explainations (that, however, are not necessary to the standard usage of source-highlight), they carefully explained in a separate part: Language Definitions.
The format for running the source-highlight program is:
source-highlight option ...
source-highlight
supports the following options, shown by
the output of source-highlight --help
:
Let us explain some options in details (apart from those that should be
clear from the --help
output itself, and those already explained
in Simple Usage).
--doc
-d
--no-doc
--doc
option above is actually implied by other command line
options (e.g., --css
). If you do not want a complete html
document to be created in such cases (e.g., you want to include the
output in an existing document containing the global CSS style), you can
disable it by using --no-doc
.
--css
-c
--tab
-t
--output-dir
--failsafe
Since version 2.0 source-highlight uses a specific syntax to specify source language elements (e.g., keywords, strings, comments, etc.). Before version 2.0, language elements were scanned through Flex. This had the drawback of writing a new flex file to deal with a new language; even worse, a new language could not be added “dynamically”: you had to recompile the whole source-highlight program.
Instead, now, language elements are specified in a file, loaded dynamically, through a (hopefully) simple syntax. Then, these definitions are used internally to create, on-the-fly, regular expressions that are used to highlight the elements. In particular, we use the regular expressions provided by the Boost library (see Installation). Thus, when writing a language definition file you will surely have to deal with regular expressions. Of cource, we use the Boost regex library regular expression syntax. We refer to Boost documentation for such syntax, http://www.boost.org/libs/regex/doc/syntax.html.
Here, we see such syntax in details, by relying on many examples. This allows a user to easily modify an existing language definition and create a new one. These files have, typically, extension .lang.
Each definition basically associates a regular expression to a language
element and defines a name for the language element. Such name will be
used to associate a particular style (e.g., bold face, color, etc.) to
the highlighting of such elements. You cannot use names that are the
same of keywords used in the language definition syntax (e.g.,
start
, as shown later, is a reserved word).
Comments can be given by using #
; the rest of the line is
considered as a comment.
The simpler way of specify language elements is to list the possible alternatives. This is the case, for instance, for keywords. For instance, in java.lang you have:
keyword = "abstract|assert|break|case|catch|class|const", "continue|default|do|else|extends|false|final", "finally|for|goto|if|implements|instanceof|interface" keyword = "native|new|null|private|protected|public|return", "static|strictfp|super|switch|synchronized|throw", "throws|true|this|transient|try|volatile|while"
The elements must be specified in double quotes. You can separate
quoted definitions with commas. Alternatively, within a quoted
definition, alternatives can be separated with the pipe symbol |
.
The above definition defines the language element keyword
. Each
time an element is found in the source file, it is highlighted with the
style for the element with the same name in the output format style file
(notice that all elements shown in the example are take from the
language definition files that come with source-highlight and there is a
style for each of such elements, see Configuration files). If
such an element is not specified in the output format style file, it is
simply not highlighted (so pay attention to typos :-).
From the above example you may have noticed that language element
definitions are cumulative, so the second keyword
definition does
not replace the first one. (Indeed, in some case you may want to
actually redefine a language element; this is possible as explained in
the following sections.)
Notice that words specified in double quotes have to match exactly in a
source file, and they must be isolated (not surrounded by anything but
spaces). Thus for instance class
is matched as a keyword, but in
my_class
the substring class
is not matched as keyword.
From the point of view of regular expressions a string such as
class
in a double quote simple definition is intended as
\<(class)\>
.
Special characters have to be escaped with the character \
. So
for instance if you want to specify the character |
, which is
normally used to separate alternatives in double quoted strings, you
have to specify \|
.
Definitions in double quotes are interpreted literarly (thus, e.g., a
dot .
is interpreted as the character .
not as the regular
expression wild card). If you want to enjoy the full power of regular
expressions to specify a language alternative, you have to use single
quoted strings ('
), instead of double quoted strings.
For instance, the following is the definition for a preprocessor directive in C/C++:
preproc = '^[[:blank:]]*#([[:blank:]]*[[:word:]]*)'
Notice that the definition 'class'
is different from
"class"
, as explained above. Thus, for instance 'class'
matches also the sub-expression class
inside my_class
.
Finally, at the end of a list of definitions, one can specify the
keyword nonsensitive
; in that case, the specified strings will be
interpreted in a non case sensitive way. For instance, we use this
feature in Pascal language definition, pascal.lang where keywords
are parsed in a non sensitive way:
keyword = "alfa|and|array|begin|case|const|div", "do|downto|else|end|false|file|for|function|get|goto|if|in", "label|mod|new|not|of|or|pack|packed|page|program", "put|procedure|read|readln|record|repeat|reset|rewrite|set", "text|then|to|true|type|unpack|until|var|while|with|writeln|write" nonsensitive
It is often useful to define a language element that affects all the
remaining characters up to the end of the line. For such definitions,
instead of the =
you must use the keyword start
. For
instance, the following is the definition of a single line comment in
C++:
comment start "//"
This says that when the two characters //
are encountered in the
source file, everything from these characters, include, up to the end of
the line, will be highlighted according to the style comment
.
It is important to observe that the order of language definitions is important since it will be used during regular expression matching. You then have to make sure that, if there are definitions that start with same characters, the longest expression is specified first in the file. For instance if you write
symbol = "/" comment start "//"
The first expression will always be matched first, and the second expression will never be matched. The right order is
comment start "//" symbol = "/"
Many elements are delimited by specific character sequences. For instance, strings and multiline comments. The syntax for such an element definition is
<name> delim <left delimited> <right delimiter> \ {escape <escape character>} \ {multiline} {nested}
The escape
specification allows to specify the escape
character that may preceed one of the delimiters inside the
element. This is optional.
For instance, this is the definition of C-like strings:
string delim "\"" "\"" escape "\\"
Notice that \
is a special characters in definitions so it has to
be escaped. If the escape
specification was omitted, the C
string "write \"hello\" string"
would have been highlight
incorrectly (it would have been highlighted as the string
"write \"
, the normal character sequence hello\
and
the string " string"
).
The option multiline
specifies that the element can spawn
multiple lines. For instance, PHP strings are defined as follows:
string delim "\"" "\"" escape "\\" multiline
The option nested
instructs to count possible multiple
occurrences of delimited characters and to match relative
multiple occurrences. For instance, C-like multiline comments
are specified as follows:
comment delim "/*" "*/" multiline nested
If nested
was not used the following nested comment
would have not been highglighted correctly:
/* This is a /* nested comment */ */
As said above, definitions are cumulative, and they are also cumulative even when using different syntactic forms. Thus, for instance, the complete definition for C++-style comments are the following:
comment start "//" comment delim "/*" "*/" multiline nested
It is possible to define variables to be re-used in many parts in a language definition file. A variable is defined by using
vardef
<name of the variable> =
<list of definitions>
Once defined, a variable can be used by prepending the
symbol $
to its name. For instance,
vardef FUNCTION = '(?:[[:alpha:]]|_)[[:word:]]*[[:blank:]]*(?=\()' function = $FUNCTION
The capital letters are used only for readability.
It is also possible to concatenate variables and expressions, and reuse variables inside further variable definitions:
vardef basic_time = '[[:digit:]]{2}:[[:digit:]]{2}:[[:digit:]]{2}' vardef time = '\<' + $basic_time + '\>'
It is possible to include other language definition files into another
file. This is inclusion actually physically includes the contents of
the included file into the current file during parsing, at the exact
point of inclusion (just like the #include
in C/C++). This is
useful for re-using definitions in many files. For instance, C++
comment definitions are given in a file c_comment.lang, and this
file is included in the Java and C++ definition files. The same happens
for number and functions. For instance, the file java.lang
contains the following include instructions:
include "c_comment.lang" include "number.lang" keywords ... include "function.lang"
Notice that the order of inclusion is crucial since the order of
definition is crucial. If function definition was included before
keyword definitions, then the sentence if (exp)
would be
highlighted as a function invocation.
Sometimes you want some source element to be highlighted only if they are surrounded by other elements. Source-highlight language definitions provides also this feature.
state|environment <standard definition> begin <other definitions> end
This structure is recursive (so other state/environment definitions can
be given within a state/environment). The meaning of a
state/environment is that the definitions within the begin
... end
are matched only if the definitions that define the
state/environment have been matched. When entering a state/environment,
however, the definitions given outside the state/environment are not
matched. The difference between state
and environment
is
that in the latter, normal parts of the source language (i.e., those
that do not match any definition) are highlighted according to the style
of the definition that defines the environment.
As an example, the following defines the multiline nested C comment, and highlights URL and e-mail addresses only when they appear inside a comment (notice that this uses file inclusion):
environment comment delim "/*" "*/" multiline nested begin include "url.lang" end
Notice that we used environment
because everything else inside a
comment has to be formatted according to the comment style.
While for programming language definitions states/environments can be avoided, they are pretty important for highlighting files such as logs and ChangeLog files, since elements have to be higlighted when they appear in a specific position. For instance, for ChangeLog (see changelog.lang), we use a state for highlighting the date, name, e-mail:
state date start '[[:digit:]]{2,4}-?[[:digit:]]{2}-?[[:digit:]]{2}' begin string = '<(?:[[:word:]]*|\.)+@(?:[[:word:]]*|\.)+>' url = '(?:[[:word:]]|[[:punct:]])+' end
Notice that definitions that appear inside a state/environment have the
same scope of the expressions that define the environment. While this
makes sense for start
and delim
definitions, it may makes
less sense for simple definitions (i.e., those that simply lists all
possible expressions): infact, in this case, such expressions do not
define a scope. For such definitions, the semantics of
state/environment is that the state/enviroment starts after matching one
of the alternatives. And where will it end? In this case you must
explicitly exit the enviroment. For instance, you can say that, when
inside a state/environment, a specific language definition, when
encountered also exits the environment (with the keyword exit
).
You can even exit all the environments with exitall
. For
instance, the following definition, highlights a non empty string
following a web method:
vardef non_empty = '[^[:blank:]]+' state webmethod = "OPTIONS|GET|HEAD|POST|PUT|DELETE", "TRACE|CONNECT|PROPFIND|MKCOL|COPY|MOVE|LOCK|UNLOCK" begin string = $non_empty exit end
If you ever need such advanced features, you may want to take a look at the log.lang definition file that defines higlighting for several log files (access logs, Apache logs, etc.).
These two features are useful when you want to define
a language by re-using an existing language definition
with some changes. Typically you include
another
language definition file and you redefine/substitute some
elements.
When you use redef
you erase all the previous
definitions of that language elements with the new one.
The new language element definition will be placed exactly
in the point of the new definition.
We use this feature, for instance, when we define the
sml
language by re-using the caml
one:
they differ only for the keywords3. In fact, the contents of
sml.lang is summarized as follows:
include "caml.lang" redef keyword = "abstraction|abstype|and|andalso..."
Since the new language element definition appears in the
exact point of the redefinition, this means that
such a regular expression will be matched only if all
the previous ones (the ones of the included file) cannot
be matched. This may lead to unwanted results in some
cases (not in the sml
case though).
In other words the following code
keyword = "foo" keyword = "bar" type = "int" redef keyword = "myfoo"
is equivalent to the following one
type = "int" keyword = "myfoo"
If this is not what you want, you can use subst
,
which is similar to redef
apart from that it
replaces the previous first definition of that language
element in the exact point of that first definition
(all other possible definitions are simply erased).
That is to say that the following code
keyword = "foo" keyword = "bar" type = "int" subst keyword = "myfoo"
is equivalent to the following one
keyword = "myfoo" type = "int"
It is up to you to decide which one fits best your needs.
We use this feature to define javascript
in terms
of java
:
include "java.lang" subst keyword = "abstract|break|case|catch|class|if..."
Here using redef
would have led to the unwanted
behavior that if (exp)
would have been highlighted
as a function call, since the function element definition
would have come first (and then matched first)
than the redefinition of if
as a keyword.
By mixin all these features you can unleash your immagination and define
highlighting for complex source languages such as Flex and Bison by
writing few lines of code and re-use existing ones. For instance, Flex
and Bison have their own syntax and lets you write C/C++ code in
specific parts of the source language, e.g., the code between the
outmost brackets, in the following example, is C++ code, and should be
highlighted following C++ language definitions (apart from variables
that are prefixed with $
):
globaltags : options { if (...) { setTags( $1 ); } }
This is easy to do (taken from flex.lang):
state cbracket delim "{" "}" multiline nested begin variable = '\$.' include "cpp.lang" end
Notice that, since we used nested
we can be sure
that the C++ language definitions are not considered
anymore when we matched the last closing }
.
If you find a bug in source-highlight, please send electronic mail to
bug-source-highlight at gnu dot org
Include the version number, which you can find by running source-highlight --version. Also include in your message the output that the program produced and the output you expected.
If you have other questions, comments or suggestions about source-highlight, contact the author via electronic mail (find the address at http://www.lorenzobettini.it). The author will try to help you out, although he may not have time to fix your problems.
The following mailing lists are available:
help-source-highlight at gnu dot org
for generic discussions about the program and for asking for help about it (open mailing list), http://mail.gnu.org/mailman/listinfo/help-source-highlight
info-source-highlight at gnu dot org
for receiving information about new releases and features (read-only mailing list), http://mail.gnu.org/mailman/listinfo/info-source-highlight.
If you want to subscribe to a mailing list just go to the URL and follow the instructions, or send me an e-mail and I'll subscribe you.
[1] Command lines that are too long are
split into multiple indented lines separated by a \
. Of course
these commands are to be given in one line only, anyway.
[2] This is the main difference introduced in version 2.0 with respect the the previous version.
[3] At least, to the best of my knowledge :-)