Lookup Definite Clause Grammars

 

 

 

(c) Carlos Viegas Damásio, October 2003

 

1. Description
This small application implements an extension of Definite Clause Grammars (DCGs) which introduces lookahead symbols in the compiled code. Ordinary DCGs introduce two additional arguments in each compiled clause, one for the input list to parse and other for the remaining list to parse after execution of the predicate (production). Our compilation method introduces 4 additional arguments:
  • The current lookahead symbol, in the 1st predicate argument, i.e. the first symbol in the input.
  • The rest input list in the 2nd predicate argument.
  • The DCG predicate arguments appear after the 2nd argument.
  • The lookahead symbol of the remaining string to parse, in the penultimate argument.
  • The remaining to parse list in the last argument.

This technique allows the lookup DCG code to explore the indexing facilities of most Prolog implementations and the user to write the grammars in a more natural way, with significant performance improvments. However, in order to be able to use lookahead information, the input string must be terminated with a special symbol (usually -1). To support the development of large applications we've introduced additional syntactic sugar.

To simplify the determination of lookahead symbol information, the lookup DGC compiler resorts to the tabling features of XSB Prolog and therefore is not portable to othe Prolog systems. However, the generated code is fully standard and can be used in any Prolog system. This parser generator has been used for the implementation of a full non-validating XML Parser.

 

2. Lookup DCG syntax

Productions can have two forms:

  • Head --> Body. These behave as ordinary DCG productions, except that the extra arguments for lookahead symbol propagation are introduced.
  • Head ::= Body. These productions obtain lookahead symbol information from their bodies, and use it to optimize the execution of the grammar. This must be used with care since a large number of rules might be generated from a single production, the rule of thumb being one rule for each lookahed symbol in the Body.

The bodies of productions have a similar syntax to ordinary DGCs, except that we introduced additional syntax to represent terminal symbols, permitting the specification of (union of ) interval ranges. Regarding non-terminals, we allow the inline expansion of non-terminals by its rules. Cuts are allowed in production bodies, as well as actions with the usual { Prolog Code } syntax. The full syntex is described next:

Non-Terminal symbols in the body:

  • + NonTerminal, indicates that NonTerminal rules are expanded inline.
  • NonTerminal, where NonTerminal is an atom specifying a non-terminal symbol

Terminal symbols in the body:

  • [], the empty list is used to represent the empty string.
  • [S1,S2...,Sn], recognizes the sequence obtained from recognizing S1, S2, ..., Sn.
  • [S1,S2...,Sn]/[C1,C2...Cn], as before but C1 is the symbol in the input recognized by S1, C2 is the symbol in the input recognized by S2 ..., and Cn is the symbol in the input recognized by Cn. 

The third case above is an extension to Prolog DCGs, since we allow the use of ranges in any of the Si symbol expressions above. A symbol expression might be:

  • An atom or character code, as in ordinary DCGs
  • Min-Max, recognizing any character code between Min and Max, and thus these must be integer numbers such that Min <= Max.
  • [Min1-Max1,Min2-Max2,...,MinN-MaxN], recognizing any character code between Min1 and Max1, or Min2 and Max2, ... or MinN and MaxN.

The parser generator does not take into account ranges for the generation of optimized code in productions of the form ::=, so these must be used with care (the same behaviour of DGCs is obtained).

Production control

The following constructions are allowed in the bodies of production in order to control the execution of the parser:

  • !, as in ordinary DGCs
  • { Prolog Code }, actions as in ordinary DCGs
  • ? [C1,...,Cn], tests if the input starts with [C1,...,Cn] where C1, .., Cn are character codes. This does not consume input. This construction is an extension and is mostly used in the form Head --> ? "test", !. allowing for the programmer to use base conditions without input consumption.

 

 
3. Installation and usage of the Lookup DCG parser generator
  1. Construct your parser according to the previous syntax. The parser may be divided in several files and might contain auxiliary Prolog code and declarations. We suggest the use of the extension .G in these files.
  2. Declare in the parser file the start non-terminal symbol with the declaration :- start( Name/Arity).
  3. Declare in the parser file the end terminal symbol with the declaration :- end( Symbol ), usually -1 if parsing lists of character codes.
  4. The generation of parser code for some productions can be prevented by adding the declaration :- - Name/Arity. This is used, for instance, for removing all the code for fully expanded non-terminal symbols.
  5. The parser generator code must be extracted to a directory and compiled with the goal ?-[lookupdcg].
  6. Generate the parser with the call ?- gen_parser( ['File1.G', 'File2.G',...,'FileN.G'], 'OutFile.P'). The first argument contains the list of files of the parser to be generated. The compiled code is put in a single file, given in the 2nd argument of the gen_parser/2 predicate. This file must be afterwards compiled.
 
4. Example

The following grammar parses lists of natural numbers and names separated by line feeds, either 0xA or 0xD.

% An example Look Up DGC

:- start( example/1 ).
:- end( -1 ).

:- - digit/1.

example( Is ) ::= lf, !, example( Is ).
example( [] ) ::= [].
example( [I|Is] ) --> item( I ), !, lf, example( Is ).


item( I ) ::= !, number( I ).
item( I ) ::= name( I ).


number( N ) --> + digit(D), !, rest_digits( Ds ), { number_codes( N, [D|Ds] ) }.

rest_digits( [D|Ds] ) --> + digit( D ), !, rest_digits( Ds ).
rest_digits( [] ) ::= [].

digit( 0'0 ) --> "0".
digit( 0'1 ) --> "1".
digit( 0'2 ) --> "2".
digit( 0'3 ) --> "3".
digit( 0'4 ) --> "4".
digit( 0'5 ) --> "5".
digit( 0'6 ) --> "6".
digit( 0'7 ) --> "7".
digit( 0'8 ) --> "8".
digit( 0'9 ) --> "9".


name( N ) --> startchar(C), !, rest_name( Cs ), { atom_codes( N, [C|Cs] ) }.

rest_name( [C|Cs] ) --> namechar( C ), !, rest_name( Cs ).
rest_name( [] ) --> [].


startchar( C ) --> [[0'A-0'Z,0'a-0'z]]/[C], !.
namechar( D ) ::= + digit(D), !.
namechar( C ) ::= startchar(C).

lf --> [16'A].
lf --> [16'D].

To generate the parser for this grammar, consult the parser generator file and then call gen_parser/2:

| ?- [lookupdcg].
[lookupdcg loaded]
[readgram loaded]
[predparserint loaded]
[parserexp loaded]

yes
| ?- gen_parser( ['example.G'], 'example.P' ).
example / 1
item / 1
number / 1
rest_digits / 1
name / 1
rest_name / 1
startchar / 1
namechar / 1
lf / 0
yes

The generated code is stored in example.P. The user is suggested to view and try to understand the code. Notice that no rules for digit/1 are generated since all occurrences of digit in the grammar are in-line expanded using the + digit(D) facility. The use of cuts can be very subtle, as can be noticed from the rules for item/1 and startchar/1.
To use the parser, the following goal must be invoked: example( FirstSymbol, RestSymbols, Itens, -1, [] ), as in the example below:

| ?- example( 10, [0'a,0'0,0'Z,10,0'1,0'0,10,10,0'1,10,-1], Is, -1, [] ). 
Is = [a0Z,10,1] 
yes

 

5. Copyright

This is an academical and experimental tool. It cannot be used for commercial purposes without explicit consent of the author.

 

6. Disclaimer

This is an academical and experimental tool. I do not give any guarantee of any form regarding the use of this tool.

 

Last update: October 28th, 2003