A Gobstones lexer main purpose is to help a parser by converting the characters of a given source into a sequence of tokens. It can also be used on other tools that work on a sequence of tokens. There is a generic interface Lexer to provide uniform access, and an implementation for it, BaseLexer.

A BaseLexer works on SourceInputs. It is also parametrized with a WordsDef that defines the precise tokens to be read from the source. Subclasses define particular versions of the language, by fixing the language definition. The class StandardLexer, for example, is a subclass using the Standard Domain (with default on English locale).

During the processing, tokens are read in sequence from the source documents. After reading a token, the position is advanced to the start of the next token.

Also during the processing, some side information is kept:

  • warnings produced during the process (errors that does not interrupt the processing),
  • language options found in the source that have to be passed to the parser, and
  • attributes that were read in the source but still not assigned to a token (attributes are not assigned to tokens by the Lexer -- they have to be assigned by the Gobstones parser).

The API of a Gobstones lexer includes:

A Token is composed by a sequence of characters. Following the definition of the language given by WordsDef, characters are divided in three groups: whitespace chars, symbol chars, and regular chars. The source is usually divided in maximal groups of characters from the same kind (whitespaces, punctuation chars, regular chars, etc.); there are also munchers, groups of characters that belong to any group, and whose presence is regulated by symbol chars groups called muncher sigils. Different groups of characters are further classified, thus defining the Tokens.

The parameter WordsDef defines the exact classification of characters in groups, and the concrete groups that form each of the recognizable tokens in the language. Tokens are defined as:

  • Whitespaces: maximal groups of whitespace chars (given by the words definition parameter). They are used to separate other tokens.
  • Pragmas: munchers indicated by pragma sigils (given by the words definition parameter), regulating its occurrence and structure. They are used to provide language options, attributes, modifiers, etc. that the lexer process according to the language definition.
  • Comments: munchers indicated by comment sigils (given by the words definition parameter), used for documentation purposes and pieces of the source that should be ignored by the parser.
  • Strings: munchers surrounded with a string delimiter char (given by the words definition parameter), and that may contain any character (either directly or via a escape mechanism). They are used as text literals in the language.
  • Symbolic identifiers: maximal groups of punctuation chars (given by the words definition parameter). They are used by the parser for punctuation purposes, to organize the structure of a program, and also for some operations (like infix ones).
  • Numbers: maximal groups of digits starting with a digit (given by the words definition parameter). They are used by the parser as number literals.
  • Identifiers: maximal groups of regular characters no starting with a digit. They are used by the parser as keywords and names of entities.

Tokens other that whitespaces and munchers cannot contain whitespaces, and muncher sigils cannot contain whitespaces either. Some Tokens are used by the parser to construct the program, but others are used for other purposes; thus, Tokens are classified as being fillers and GTokens: fillers are ignored by the parser. NOTE: an EOF token cannot be a filler token. Thus, the Gobstones lexer has two different operations to read tokens, one for reading only GTokens, and the other to read any token (either a GToken or a filler).

An example of use is to generate the complete list of GTokens, and another one is to generate a complete list of all tokens. Assuming the following input string:

  const program =
'\*@LANGUAGE_DOMAIN@Standard Board@*/\n'+
'\*@LANGUAGE_LOCALE@es@*/\n' +
'\n' +
'/* Pone dos bolitas del color dado */\n' +
'/*@ATTRIBUTE@atomic@*/\n' +
'procedure Poner2(c) {\n' +
' Poner(c)\n' +
' Poner(c)\n' +
'}\n' +
'\n' +
'program {\n' +
' Poner2(Rojo)\n' +
' Mover(Este)\n' +
' Poner2(Verde)\n' +
'}\n';

the first example would be

  const lexer = new BaseLexer(program, StandardWordsDefES);
const tokens: string[] = [];
const attrs: Record<string, string[]>[] = [];
while (lexer.hasNextGToken()) {
tokens.push(lexer.nextGToken().toString());
attrs.push(lexer.getPendingAttributes());
}
const languageMods: Record<string, string[]>[] = lexer.getLanguageMods();

and the second would be

  const lexer = new BaseLexer(program, StandardWordsDefES);
const allTokens = [];
while (lexer.hasNextToken()) {
allTokens.push(lexer.nextToken());
}

The result of the first example will produce two arrays, tokens and attrs, and a record, languageMods; their values should be:

  tokens = [
'PROCEDURE@<6,1>',
"UPPERID@<6,11>('Poner2')",
'LPAREN@<6,17>',
"LOWERID@<6,18>('c')",
'RPAREN@<6,19>',
'LBRACE@<6,21>',
"UPPERID@<7,5>('Poner')",
'LPAREN@<7,10>',
"LOWERID@<7,11>('c')",
'RPAREN@<7,12>',
"UPPERID@<8,5>('Poner')",
'LPAREN@<8,10>',
"LOWERID@<8,11>('c')",
'RPAREN@<8,12>',
'RBRACE@<9,1>',
'PROGRAM@<11,1>',
'LBRACE@<11,9>',
"UPPERID@<12,5>('Poner2')",
'LPAREN@<12,11>',
"UPPERID@<12,12>('Rojo')",
'RPAREN@<12,16>',
"UPPERID@<13,5>('Mover')",
'LPAREN@<13,10>',
"UPPERID@<13,11>('Este')",
'RPAREN@<13,15>',
"UPPERID@<14,5>('Poner2')",
'LPAREN@<14,11>',
"UPPERID@<14,12>('Verde')",
'RPAREN@<14,17>',
'RBRACE@<15,1>',
'EOF@<End of document>' // Check how to use Spanish for messages.
];
attrs = [
{
COMMENT: ['\*\n Pone dos bolitas del color dado\n*/']
, atomic: []
}
, {}, ..., {} // 30 times
];
languageMods = {};

The result of the second example will produce a single array, allTokens, with value:

  allTokens = [
"PRAGMA@<1,1>('/*@LANGUAGE_DOMAIN@Standard Board@*/')",
'WHITESPACES@<1,37>',
"PRAGMA@<2,1>('/*@LANGUAGE_LOCALE@es@*/')",
'WHITESPACES@<2,25>',
"COMMENT@<4,1>('/* Pone dos bolitas del color dado */')",
'WHITESPACES@<4,38>',
"PRAGMA@<5,1>('/*@ATTRIBUTE@atomic@*/')",
'WHITESPACES@<5,23>',
'PROCEDURE@<6,1>',
'WHITESPACES@<6,10>',
"UPPERID@<6,11>('Poner2')",
'LPAREN@<6,17>',
"LOWERID@<6,18>('c')",
'RPAREN@<6,19>',
'WHITESPACES@<6,20>',
'LBRACE@<6,21>',
'WHITESPACES@<6,22>',
"UPPERID@<7,5>('Poner')",
'LPAREN@<7,10>',
"LOWERID@<7,11>('c')",
'RPAREN@<7,12>',
'WHITESPACES@<7,13>',
"UPPERID@<8,5>('Poner')",
'LPAREN@<8,10>',
"LOWERID@<8,11>('c')",
'RPAREN@<8,12>',
'WHITESPACES@<8,13>',
'RBRACE@<9,1>',
'WHITESPACES@<9,2>',
'PROGRAM@<11,1>',
'WHITESPACES@<11,8>',
'LBRACE@<11,9>',
'WHITESPACES@<11,10>',
"UPPERID@<12,5>('Poner2')",
'LPAREN@<12,11>',
"UPPERID@<12,12>('Rojo')",
'RPAREN@<12,16>',
'WHITESPACES@<12,17>',
"UPPERID@<13,5>('Mover')",
'LPAREN@<13,10>',
"UPPERID@<13,11>('Este')",
'RPAREN@<13,15>',
'WHITESPACES@<13,16>',
"UPPERID@<14,5>('Poner2')",
'LPAREN@<14,11>',
"UPPERID@<14,12>('Verde')",
'RPAREN@<14,17>',
'WHITESPACES@<14,18>',
'RBRACE@<15,1>',
'WHITESPACES@<15,2>',
'EOF@<End of document>' // Check how to use Spanish for messages.
];

Author

Pablo E. --Fidel-- Martínez López fidel.ml@gmail.com

Index

Classes

Auxiliaries

Main Definitions

Tokens

Words Access