CUP User's Manual

Scott E. Hudson
Graphics Visualization and Usability Center
Georgia Institute of Technology
March 1996 (v0.9e release)

1.: Introduction and Example
2.: Specification Syntax
3.: Running CUP
4.: Customizing the Parser
5.: Error Recovery
6.: Conclusion
: References
A.: Grammar for CUP Specification Files
B.: A Very Simple Example Scanner

1. Introduction and Example

This manual describes the basic operation and use of the Java(tm) Based Constructor of Useful Parsers (CUP for short). CUP is a system for generating LALR parsers from simple specifications. It serves the same role as the widely used program YACC [1] and in fact offers most of the features of YACC. However, CUP is written in Java, uses specifications including embedded Java code, and produces parsers which are implemented in Java.

Although covering all aspect of the CUP system, this manual is relatively brief, assumes you have at least a little bit of knowledge of LR parsing, and preferably have a bit of experience with a program such as YACC. A number of compiler construction textbooks (such as [2,3]) cover this material, and discuss the YACC system (which is quite similar to this one) as a specific example.

Using CUP involves creating a simple specifications based on the grammar for which a parser is needed, along with construction of a scanner capable of breaking characters up into meaningful tokens (such as keywords, numbers, and special symbols).

As a simple example, consider a system for evaluating simple arithmetic expressions over integers. This system would read expressions from standard input (each terminated with a semicolon), evaluate them, and print the result on standard output. A grammar for the input to such a system might look like:

  expr_list ::= expr_list expr_part | expr_part
  expr_part ::= expr ';'
  expr      ::= expr '+' term | expr '-' term | term
  term      ::= term '*' factor | term '/' factor | term '%' factor | factor
  factor    ::= number | '-' expr | '(' expr ')'

To specify a parser based on this grammar, our first step is to identify and name the set of terminal symbols that will appear on input, and the set of non terminal symbols. In this case, the non terminals are:

  expr_list, expr_part, expr, term, and factor.

For terminal names we might choose:

  SEMI, PLUS, MINUS, TIMES, DIVIDE, MOD, NUMBER, LPAREN, and RPAREN

Based on these namings we can construct a small CUP specification as follows:

// CUP specification for a simple expression evaluator (no actions)

import java_cup.runtime.*;

/* Preliminaries to set up and use the scanner.  */
init with {: scanner.init();              :};
scan with {: return scanner.next_token(); :};

/* Terminals (tokens returned by the scanner). */
terminal token     SEMI, PLUS, MINUS, TIMES, DIVIDE, MOD,  LPAREN, RPAREN;
terminal int_token NUMBER;

/* Non terminals */
non terminal symbol     expr_list, expr_part;
non terminal int_token  expr, term, factor;

/* The grammar */
expr_list ::= expr_list expr_part | 
              expr_part;
expr_part ::= expr SEMI;
expr      ::= expr PLUS term | 
              expr MINUS term | 
              term;
term      ::= term TIMES factor | 
              term DIVIDE factor | 
              term MOD factor | 
              factor;
factor    ::= NUMBER | 
              MINUS factor | 
              LPAREN expr LPAREN;

We will consider each part of the specification syntax in detail later. However, here we can quickly see that the specification contains three main parts. The first part provides preliminary and miscellaneous declarations to specify how the parser is to be generated, and supply parts of the runtime code. In this case we indicate that the java_cup.runtime classes should be imported, then supply a small bit of initialization code, and some code for invoking the scanner to retrieve the next input token. The second part of the specification declares terminals and non terminals, and associates object classes with each. In this case, we declare our terminals as being represented at runtime by two object types: token and int_token (which are supplied as part of the CUP runtime system), while various non terminals are represented by objects of types symbol and int_token (again supplied from the runtime system). The final part of the specification contains the grammar.

To produce a parser from this specification we use the CUP generator. If this specification were stored in a file parser.cup, then (on a Unix system at least) we might invoke CUP using a command like:

 java java_cup.Main < parser.cup

In this case, the system will produce two Java source files containing parts of the generated parser: sym.java and parser.java. As you might expect, these two files contain declarations for the classes sym and parser. The sym class contains a series of constant declarations, one for each terminal symbol. This is typically used by the scanner to refer to symbols (e.g. with code such as "return new token(sym.SEMI);" ). The parser class implements the parser itself.

The specification above, while constructing a full parser, does not perform any semantic actions -- it will only indicate success or failure of a parse. To calculate and print values of each expression, we must embed Java code within the parser to carry out actions at various points. In CUP, actions are contained in code strings which are surrounded by delimiters of the form {: and :} (we can see examples of this in the init with and scan with clauses above). In general, the system records all characters within the delimiters, but does not try to check that it contains valid Java code.

A more complete CUP specification for our example system (with actions embedded at various points in the grammar) is shown below:

// CUP specification for a simple expression evaluator (w/ actions)

import java_cup.runtime.*;

/* Preliminaries to set up and use the scanner.  */
init with {: scanner.init();              :};
scan with {: return scanner.next_token(); :};

/* Terminals (tokens returned by the scanner). */
terminal token     SEMI, PLUS, MINUS, TIMES, DIVIDE, MOD,  LPAREN, RPAREN;
terminal int_token NUMBER;

/* Non terminals */
non terminal symbol     expr_list, expr_part;
non terminal int_token  expr, term, factor;

/* The grammar */
expr_list ::= expr_list expr_part 
	      | 
              expr_part;

expr_part ::= expr:e 
	      {: System.out.println("= " + e.int_val); :} 
              SEMI              
	      ;

expr      ::= expr:e1 PLUS term:e2    
	      {: RESULT.int_val = e1.int_val + e2.int_val; :} 
	      | 
              expr:e1 MINUS term:e2    
              {: RESULT.int_val = e1.int_val - e2.int_val; :} 
	      | 
              term:e1                  
	      {: RESULT.int_val = e1.int_val; :} 
	      ;

term      ::= term:e1 TIMES factor:e2 
	      {: RESULT.int_val = e1.int_val * e2.int_val; :} 
	      | 
              term:e1 DIVIDE factor:e2 
	      {: RESULT.int_val = e1.int_val / e2.int_val; :} 
	      | 
              term:e1 MOD factor:e2 
	      {: RESULT.int_val = e1.int_val % e2.int_val; :} 
	      | 
              factor:e                 
	      {: RESULT.int_val = e.int_val; :} 
	      ;

factor    ::= NUMBER:n                 
	      {: RESULT.int_val = n.int_val;  :} 
	      | 
              MINUS factor:e             
	      {: RESULT.int_val = -e.int_val; :} 
	      | 
              LPAREN expr:e RPAREN     
	      {: RESULT.int_val = e.int_val;  :} 
	      ;

Here we can see several changes. Most importantly, code to be executed at various points in the parse is included inside code strings delimited by {: and :}. In addition, labels have been placed on various symbols in the right hand side of productions. For example in:

  expr ::= expr:e1 PLUS term:e2   
           {: RESULT.int_val = e1.int_val + e2.int_val; :}

the non terminal expr has been labeled with e1, while term has been labeled with e2. The left hand side symbol of each production is always implicitly labeled as RESULT.

Each symbol appearing in a production is represented at runtime by an object (on the parse stack). These labels allow code embedded in a production to refer to these objects. Since expr and term were both declared as int_token, they are both represented by an object of class int_token. These objects are created as the result of matching some other production. The code in that production fills in various fields of its result object, which are in turn used here to fill in a new result object, and so on. Overall this is a very common form of syntax directed translation related to attribute grammars and discussed at length in compiler construction textbooks such as [2,3].

In our specific example, the int_token class includes an int_val field which stores an int value. We use this field to compute the value of the expression from its component parts. In the production above, we compute the int_val field of the left hand side symbol (i.e. RESULT) as the sum of the values computed by the expr and term parts making up this expression. That value in turn may be combined with other to compute a final result.

The final step in creating a working parser is to create a scanner (also known as a lexical analyzer or simply a lexer). This routine is responsible for reading individual characters, removing things things like white space and comments, recognizing which terminal symbols from the grammar each group of characters represents, then returning token objects representing these symbols to the parser. Example code for a workable (if not elegant or efficient) scanner for our example system can be found in Appendix B.

Like the very simple one given in Appendix B, all scanners need to return objects which are instances of java_cup.runtime.token (or one of its subclasses). The runtime system predefines three such classes: token which contains no specific information beyond the token type (and some internal information used by the parser), int_token which also records a single int value, and str_token which records a single string value.

The code contained in the init with clause of the specification will be executed before any tokens are requested. Each token will be requested using whatever code is found in the scan with clause. Beyond this, the exact form the scanner takes is up to you.

In the next section a more detailed and formal explanation of all parts of a CUP specification will be given. Section 3 describes options for running the CUP system. Section 4 discusses the details of how to customize a CUP parser, while Section 5 considers error recovery. Finally, Section 6 provides a conclusion.

2. Specification Syntax

Now that we have seen a small example, we present a complete description of all parts of a CUP specification. A specification has four sections with a total of eight specific parts (however, most of these are optional). A specification consists of:

package and import specifications,
user code components,
symbol (terminal and non-terminal) lists, and
the grammar.

Each of these parts must appear in the order presented here. (A complete grammar for the specification language is given in Appendix A.) The particulars of each part of the specification are described in the subsections below.

Package and Import Specifications

A specification begins with optional package and import declarations. These have the same syntax, and play the same role, as the package and import declarations found in a normal Java program. A package declaration is of the form:

    package name;

where name name is a Java package identifier, possibly in several parts separated by ".". In general, CUP employs Java lexical conventions. So for example, both styles of Java comments are supported, and identifiers are constructed beginning with a letter, dollar sign ($), or underscore (_), which can then be followed by zero or more letters, numbers, dollar signs, and underscores.

After an optional package declaration, there can be zero or more import declarations. As in a Java program these have the form:

    import package_name.class_name;

    import package_name.*;

The package declaration indicates what package the sym and parser classes that are generated by the system will be in. Any import declarations that appear in the specification will also appear in the source file for the parser class allowing various names from that package to be used directly in user supplied action code.

User Code Components

Following the optional package and import declarations are a series of optional declarations that allow user code to be included as part of the generated parser (see Section 4 for a full description of how the parser uses this code). As a part of the parser file, a separate non-public class to contain all embedded user actions is produced. The first action code declaration section allows code to be included in this class. Routines and variables for use by the code embedded in the grammar would normally be placed in this section (a typical example might be symbol table manipulation routines). This declaration takes the form:

    action code {: ... :};

where {: ... :} is a code string whose contents will be placed directly within the action class class declaration.

After the action code declaration is an optional parser code declaration. This declaration allows methods and variable to be placed directly within the generated parser class. Although this is less common, it can be helpful when customizing the parser -- it is possible for example, to include scanning methods inside the parser and/or override the default error reporting routines. This declaration is very similar to the action code declaration and takes the form:

    parser code {: ... :};

Again, code from the code string is placed directly into the generated parser class definition.

Next in the specification is the optional init declaration which has the form:

    init with {: ... :};

This declaration provides code that will be executed by the parser before it asks for the first token. Typically, this is used to initialize the scanner as well as various tables and other data structures that might be needed by semantic actions. In this case, the code given in the code string forms the body of a void method inside the parser class.

The final (optional) user code section of the specification indicates how the parser should ask for the next token from the scanner. This has the form:

    scan with {: ... :};

As with the init clause, the contents of the code string forms the body of a method in the generated parser. However, in this case the method returns an object of type java_cup.runtime.token. Consequently the code found in the scan with clause should return such a value.

Symbol Lists

Following user supplied code comes the first required part of the specification: the symbol lists. These declarations are responsible for naming and supplying a type for each terminal and non-terminal symbol that appears in the grammar. As indicated above, each terminal and non-terminal symbol is represented at runtime with an object. In the case of terminals, these are returned by the scanner and placed on the parse stack. In the case of non terminals these replace a series of symbol objects on the parse stack whenever the right hand side of some production is recognized. In order to tell the parser which object types should be used for which symbol, terminal and non terminal declarations are used. These take the forms:

    terminal classname name1, name2, ...;

and

    non terminal classname name1, name2, ...;

where classname can be a multiple part name separated with "."s. Since the parser uses these objects for internal bookkeeping, the classes used for non terminal symbols must be a subclass of java_cup.runtime.symbol. Similarly, the classes used for terminal symbols must be a subclass of java_cup.runtime.token (note that java_cup.runtime.token is itself a subclass of java_cup.runtime.symbol).

The Grammar

The final section of a CUP declaration provides the grammar. This section optionally starts with a declaration of the form:

    start with nonterminal;

This indicates which non terminal is the start or goal non terminal for parsing. If a start non terminal is not explicitly declared, then the non terminal on the left hand side of the first production will be used.

The grammar itself follows the optional start declaration. Each production in the grammar has a left hand side non terminal followed by the symbol "::=", which is then followed by a series of zero or more actions, terminal, or non terminal symbols, and terminated with a semicolon (;). Each symbol on the right hand side can optionally be labeled with a name. Label names appear after the symbol name separated by a colon (:). Label names must be unique within the production, and can be used within action code to refer to the runtime object that represents the symbol. If there are several productions for the same non terminal they may be declared together. In this case the productions start with the non terminal and "::=". This is followed by multiple right hand sides each separated by a bar (|). The full set of productions is then terminated by a semicolon.

Actions appear in the right hand side as code strings (e.g., Java code inside {: ... :} delimiters). These are executed by the parser at the point when the portion of the production to the left of the action has been recognized. (Note that the scanner will have returned the token one past the point of the action since the parser needs this extra lookahead token for recognition.)

3. Running CUP

As mentioned above, CUP is written in Java. To invoke it, one needs to use the Java interpreter to invoke the static method java_cup.Main(), passing an array of strings containing options. Assuming a Unix machine, the simplest way to do this is typically to invoke it directly from the command line with a command such as:

    java java_cup.Main options < inputfile

Once running, CUP expects to find a specification file on standard input and produces two Java source files as output.

In addition to the specification file, CUP's behavior can also be changed by passing various options to it. Legal options include:

-package name: Specify that the parser and sym classes are to be placed in the named package. By default, no package specification is put in the generated code (hence the classes default to the special "unnamed" package).
-parser name: Output parser and action code into a file (and class) with the given name instead of the default of "parser".
-symbols name: Output the symbol constant code into a class with the given name instead of the default of "sym".
-nonterms: Place constants for non terminals into the symbol constant class. The parser does not need these symbol constants, so they are not normally output. However, it can be very helpful to refer to these constants when debugging a generated parser.
-expect number: During parser construction the system may detect that an ambiguous situation would occur at runtime. This is called a conflict. In general, the parser may be unable to decide whether to shift (read another symbol) or reduce (replace the recognized right hand side of a production with its left hand side). This is called a shift/reduce conflict. Similarly, the parser may not be able to decide between reduction with two different productions. This is called a reduce/reduce conflict. Normally, if one or more of these conflicts occur, parser generation is aborted. However, in certain carefully considered cases it may be advantageous to arbitrarily break such a conflict. In this case CUP uses YACC convention and resolves shift/reduce conflicts by shifting, and reduce/reduce conflicts using the "highest priority" production (the one declared first in the specification). In order to enable automatic breaking of conflicts the -expect option must be given indicating exactly how many conflicts are expected.
-compact_red: Including this option enables a table compaction optimization involving reductions. In particular, it allows the most common reduce entry in each row of the parse action table to be used as the default for that row. This typically saves considerable room in the tables, which can grow to be very large. This optimization has the effect of replacing all error entries in a row with the default reduce entry. While this may sound dangerous, if not down right incorrect, it turns out that this does not affect the correctness of the parser. In particular, some changes of this type are inherent in LALR parsers (when compared to canonical LR parsers), and the resulting parsers will still never read past the first token at which the error could be detected. The parser can, however, make extra erroneous reduces before detecting the error, so this can degrade the parser's ability to do error recovery. (Refer to reference [2] pp. 244-247 or reference [3] pp. 190-194 for a complete explanation of this compaction technique.)

Special note: at the time of this writing the standard javac compiler had a bug which caused it to produce corrupted class files when very large statically initialized arrays (i.e., large parse tables) are used. Consequently, if you have a large grammar, you may be forced to use this option in order to create tables that are small enough to compile correctly.
-nowarn: This options causes all warning messages (as opposed to error messages) produced by the system to be suppressed.
-nosummary: Normally, the system prints a summary listing such things as the number of terminals, non terminals, parse states, etc. at the end of its run. This option suppresses that summary.
-progress: This option causes the system to print short messages indicating its progress through various parts of the parser generation process.
-dump_grammar
-dump_states
-dump_tables
-dump: These options cause the system to produce a human readable dump of the grammar, the constructed parse states (often needed to resolve parse conflicts), and the parse tables (rarely needed), respectively. The -dump option can be used to produce all of these dumps.
-time: This option adds detailed timing statistics to the normal summary of results. This is normally of great interest only to maintainers of the system itself.
-debug: This option produces voluminous internal debugging information about the system as it runs. This is normally of interest only to maintainers of the system itself.

4. Customizing the Parser

Each generated parser consists of three generated classes. The sym class (which can be renamed using the -symbols option) simply contains a series of int constants, one for each terminal. Non terminals are also include if the -nonterms option is given. The source file for the parser class (which can be renamed using the -parser option) actually contains two class definitions, the public parser class that implements the actual parser, and another non-public class (called CUP$action) which encapsulates all user actions contained in the grammar, as well as code from the action code declaration. In addition to user supplied code, this class contains one method: CUP$do_action which consists of a large switch statement for selecting and executing various fragments of user supplied action code. In general, all names beginning with the prefix of CUP$ are reserved for internal uses by CUP generated code.

The parser class contains the actual generated parser. It is a subclass of java_cup.runtime.lr_parser which implements a general table driven framework for an LR parser. The generated parser class provides a series of tables for use by the general framework. Three tables are provided:

the production table: provides the symbol number of the left hand side non terminal, along with the length of the right hand side, for each production in the grammar,
the action table: indicates what action (shift, reduce, or error) is to be taken on each lookahead symbol when encountered in each state, and
the reduce-goto table: indicates which state to shift to after reduces (under each non-terminal from each state).

(Note that the action and reduce-goto tables are not stored as simple arrays, but use a compacted "list" structure to save a significant amount of space. See comments the runtime system source code for details.)

Beyond the parse tables, generated (or inherited) code provides a series of methods that can be used to customize the generated parser. Some of these methods are supplied by code found in part of the specification and can be customized directly in that fashion. The others are provided by the lr_parser base class and can be overridden with new versions (via the parser code declaration) to customize the system. Methods available for customization include:

public void user_init(): This method is called by the parser prior to asking for the first token from the scanner. The body of this method contains the code from the init with clause of the the specification.
public java_cup.runtime.token scan(): This method encapsulates the scanner and is called each time a new token is needed by the parser. The body of this method is supplied by the scan with clause of the specification.
public void report_error(String message, Object info): This method should be called whenever an error message is to be issued. In the default implementation of this method, the first parameter provides the text of a message which is printed on System.err and the second parameter is simply ignored. It is very typical to override this method in order to provide a more sophisticated error reporting mechanism.
public void report_fatal_error(String message, Object info): This method should be called whenever a non-recoverable error occurs. It responds by calling report_error(), then aborts parsing by calling the parser method done_parsing(), and finally throws an exception. (In general done_parsing() should be called at any point that parsing needs to be terminated early).
public void syntax_error(token cur_token): This method is called by the parser as soon as a syntax error is detected (but before error recovery is attempted). In the default implementation it calls: report_error("Syntax error", null);.
public void unrecovered_syntax_error(token cur_token): This method is called by the parser if it is unable to recover from a syntax error. In the default implementation it calls: report_fatal_error("Couldn't repair and continue parse", null);.
protected int error_sync_size(): This method is called by the parser to determine how many tokens it must successfully parse in order to consider an error recovery successful. The default implementation returns 3. Values below 2 are not recommended. See the section on error recovery for details.

Parsing itself is performed by the method public void parse(). This method starts by getting references to each of the parse tables, then initializes a CUP$action object (by calling protected void init_actions()). Next it calls user_init(), then fetches the first lookahead token with a call to scan(). Finally, it begins parsing. Parsing continues until done_parsing() is called (this is done automatically, for example, when the parser accepts).

In addition to the normal parser, the runtime system also provides a debugging version of the parser. This operates in exactly the same way as the normal parser, but prints debugging messages (by calling public void debug_message(String mess) whose default implementation prints a message to System.err).

Based on these routines, invocation of a CUP parser is typically done with code such as:

      /* create a parsing object */
      parser parse_obj = new parser();

      /* open input files, etc. here */

      try {
        if (do_debug_parse)
          parser_obj.debug_parse();
        else
          parser_obj.parse();
      } catch (Exception e) {
        /* do cleanup here -- possibly rethrow e */
      } finally {
	/* do close out here */
      }

5. Error Recovery

A final important aspect of building parsers with CUP is support for syntactic error recovery. CUP uses the same error recovery mechanisms as YACC. In particular, it supports a special error symbol (denoted simply as error). This symbol plays the role of a special non terminal which, instead of being defined by productions, instead matches an erroneous input sequence.

The error symbol only comes into play if a syntax error is detected. If a syntax error is detected then the parser tries to replace some portion of the input token stream with error and then continue parsing. For example, we might have productions such as:

    stmt ::= expr SEMI | while_stmt SEMI | if_stmt SEMI | ... |
	     error SEMI
	     ;

This indicates that if none of the normal productions for stmt can be matched by the input, then a syntax error should be declared, and recovery should be made by skipping erroneous tokens (equivalent to matching and replacing them with error) up to a point at which the parse can be continued with a semicolon (and additional context that legally follows a statement). An error is considered to be recovered from if and only if a sufficient number of tokens past the error symbol can be successfully parsed. (The number of tokens required is determined by the error_sync_size() method of the parser and defaults to 3).

Specifically, the parser first looks for the closest state to the top of the parse stack that has an outgoing transition under error. This generally corresponds to working from productions that represent more detailed constructs (such as a specific kind of statement) up to productions that represent more general or enclosing constructs (such as the general production for all statements or a production representing a whole section of declarations) until we get to a place where an error recovery production has been provided for. Once the parser is placed into a configuration that has an immediate error recovery (by popping the stack to the first such state), the parser begins skipping tokens to find a point at which the parse can be continued. After discarding each token, the parser attempts to parse ahead in the input (without executing any embedded semantic actions). If the parser can successfully parse past the required number of tokens, then the input is backed up to the point of recovery and the parse is resumed normally (executing all actions). If the parse cannot be continued far enough, then another token is discarded and the parser again tries to parse ahead. If the end of input is reached without making a successful recovery (or there was no suitable error recovery state found on the parse stack to begin with) then error recovery fails.

6. Conclusion

This manual has briefly described the CUP LALR parser generation system. CUP is designed to fill the same role as the well known YACC parser generator system, but is written in and operates entirely with Java code rather than C or C++. Additional details on the operation of the system can be found in the parser generator and runtime source code. See the CUP home page below for access to the API documentation for the system and its runtime.

This document covers the system as it stands at the time of its fifth alpha release (v0.9e). Check the CUP home page: http://www.cc.gatech.edu/gvu/people/Faculty/hudson/java_cup/home.html for the latest release information, instructions for downloading the system, and additional news about the system. Bug reports and other comments for the developers can be sent to java-cup@cc.gatech.edu

CUP was originally written by Scott Hudson, in August of 1995.

References

[1]: S. C. Johnson, "YACC -- Yet Another Compiler Compiler", CS Technical Report #32, Bell Telephone Laboratories, Murray Hill, NJ, 1975.
[2]: A. Aho, R. Sethi, and J. Ullman, Compilers: Principles, Techniques, and Tools, Addison-Wesley Publishing, Reading, MA, 1986.
[3]: C. Fischer, and R. LeBlanc, Crafting a Compiler with C, Benjamin/Cummings Publishing, Redwood City, CA, 1991.

Appendix A. Grammar for CUP Specification Files

java_cup_spec      ::= package_spec import_list code_part init_code 
		       scan_code symbol_list start_spec production_list
package_spec       ::= PACKAGE multipart_id SEMI | empty
import_list        ::= import_list import_spec | empty
import_spec        ::= IMPORT import_id SEMI
code_part          ::= action_code_part parser_code_part 
action_code_part   ::= ACTION CODE CODE_STRING SEMI | empty
parser_code_part   ::= PARSER CODE CODE_STRING SEMI | empty
init_code          ::= INIT WITH CODE_STRING SEMI | empty
scan_code          ::= SCAN WITH CODE_STRING SEMI | empty
symbol_list        ::= symbol_list symbol | symbol
symbol             ::= TERMINAL type_id term_name_list SEMI | 
                       NON TERMINAL type_id non_term_name_list SEMI
term_name_list     ::= term_name_list COMMA new_term_id | new_term_id
non_term_name_list ::= non_term_name_list COMMA new_non_term_id |
	               new_non_term_id
start_spec         ::= START WITH nt_id SEMI | empty
production_list    ::= production_list production | production
production         ::= nt_id COLON_COLON_EQUALS rhs_list SEMI
rhs_list           ::= rhs_list BAR rhs | rhs
rhs                ::= prod_part_list
prod_part_list     ::= prod_part_list prod_part | empty
prod_part          ::= symbol_id opt_label | CODE_STRING
opt_label          ::= COLON label_id | empty
multipart_id       ::= multipart_id DOT ID | ID
import_id          ::= multipart_id DOT STAR | multipart_id
type_id            ::= multipart_id
new_term_id        ::= ID
new_non_term_id    ::= ID
nt_id              ::= ID
symbol_id          ::= ID
label_id           ::= ID

Appendix B. A Very Simple Example Scanner

// Simple Example Scanner Class

import java_cup.runtime.*;

public class scanner {
  /* single lookahead character */
  protected static int next_char;

  /* advance input by one character */
  protected static void advance()  { next_char = System.in.read(); }

  /* initialize the scanner */
  public static void init() { advance(); }

  /* recognize and return the next complete token */
  public static token next_token()
    {
      for (;;)
        switch (next_char)
	  {
	    case '0': case '1': case '2': case '3': case '4': 
	    case '5': case '6': case '7': case '8': case '9': 
	      /* parse a decimal integer */
	      int i_val = 0;
	      do {
	        i_val = i_val * 10 + (next_char - '0');
	        advance();
	      } while (next_char >= '0' && next_char <= '9');
	    return new int_token(sym.NUMBER, i_val);

	    case ';': advance(); return new token(sym.SEMI);
	    case '+': advance(); return new token(sym.PLUS);
	    case '-': advance(); return new token(sym.MINUS);
	    case '*': advance(); return new token(sym.TIMES);
	    case '/': advance(); return new token(sym.DIVIDE);
	    case '%': advance(); return new token(sym.MOD);
	    case '(': advance(); return new token(sym.LPAREN);
	    case ')': advance(); return new token(sym.RPAREN);

	    case -1: return new token(sym.EOF);

	    default: 
	      /* in this simple scanner we just ignore everything else */
	      advance();
	    break;
	  }
    }
};

Java and HotJava are trademarks of Sun Microsystems, Inc., and refer to Sun's Java programming language and HotJava browser technologies. CUP is not sponsored by or affiliated with Sun Microsystems, Inc.

CUP User's Manual

Scott E. Hudson Graphics Visualization and Usability Center Georgia Institute of Technology March 1996 (v0.9e release)

Table of Contents

1. Introduction and Example

4. Customizing the Parser

5. Error Recovery

Scott E. Hudson
Graphics Visualization and Usability Center
Georgia Institute of Technology
March 1996 (v0.9e release)