A Gobstones lexer main purpose is to help a parser by converting the
characters of a given source into a sequence of tokens. It can also be used on other tools
that work on a sequence of tokens.
There is a generic interface Lexer to provide uniform access, and an implementation
for it, BaseLexer.
A BaseLexer works on
SourceInputs.
It is also parametrized with a WordsDef that defines the precise tokens to
be read from the source.
Subclasses define particular versions of the language, by fixing the language definition.
The class StandardLexer, for example, is a subclass using the Standard Domain
(with default on English locale).
During the processing, tokens are read in sequence from the source documents.
After reading a token, the position is advanced to the start of the next token.
Also during the processing, some side information is kept:
warnings produced during the process (errors that does not interrupt the processing),
language options found in the source that have to be passed to the parser, and
attributes that were read in the source but still not assigned to a token (attributes
are not assigned to tokens by the Lexer -- they have to be
assigned by the Gobstones parser).
A Token is composed by a sequence of characters. Following the definition of the language
given by WordsDef, characters are divided in three groups: whitespace chars,
symbol chars, and regular chars.
The source is usually divided in maximal groups of characters from the same kind
(whitespaces, punctuation chars, regular chars, etc.);
there are also munchers, groups of characters that belong to any group, and whose presence
is regulated by symbol chars groups called muncher sigils.
Different groups of characters are further classified, thus defining the Tokens.
The parameter WordsDef defines the exact classification of characters in groups, and
the concrete groups that form each of the recognizable tokens in the language.
Tokens are defined as:
Whitespaces:
maximal groups of whitespace chars (given by the words definition parameter).
They are used to separate other tokens.
Pragmas:
munchers indicated by pragma sigils (given by the words definition
parameter), regulating its occurrence and structure.
They are used to provide language options, attributes, modifiers, etc. that the
lexer process according to the language definition.
Comments:
munchers indicated by comment sigils (given by the words definition
parameter), used for documentation purposes and pieces of the source that should
be ignored by the parser.
Strings:
munchers surrounded with a string delimiter char (given by the words definition
parameter), and that may contain any character (either directly or via a escape
mechanism). They are used as text literals in the language.
Symbolic identifiers:
maximal groups of punctuation chars (given by the words definition parameter).
They are used by the parser for punctuation purposes, to organize the
structure of a program, and also for some operations (like infix ones).
Numbers:
maximal groups of digits starting with a digit (given by the words definition
parameter). They are used by the parser as number literals.
Identifiers:
maximal groups of regular characters no starting with a digit.
They are used by the parser as keywords and names of entities.
Tokens other that whitespaces and munchers cannot contain whitespaces, and muncher sigils cannot
contain whitespaces either. Some Tokens are used by the parser to construct the program, but
others are used for other purposes; thus, Tokens are classified as being fillers and GTokens:
fillers are ignored by the parser.
NOTE: an EOF token cannot be a filler token.
Thus, the Gobstones lexer has two different operations to read tokens, one
for reading only GTokens, and the other to read any token (either a GToken or
a filler).
An example of use is to generate the complete list of GTokens, and another one is to
generate a complete list of all tokens.
Assuming the following input string:
A Gobstones lexer main purpose is to help a parser by converting the characters of a given source into a sequence of tokens. It can also be used on other tools that work on a sequence of tokens. There is a generic interface Lexer to provide uniform access, and an implementation for it, BaseLexer.
A BaseLexer works on
SourceInput
s. It is also parametrized with a WordsDef that defines the precise tokens to be read from the source. Subclasses define particular versions of the language, by fixing the language definition. The class StandardLexer, for example, is a subclass using the Standard Domain (with default on English locale).During the processing, tokens are read in sequence from the source documents. After reading a token, the position is advanced to the start of the next token.
Also during the processing, some side information is kept:
The API of a Gobstones lexer includes:
SourceInput
, and a WordsDef,A Token is composed by a sequence of characters. Following the definition of the language given by WordsDef, characters are divided in three groups: whitespace chars, symbol chars, and regular chars. The source is usually divided in maximal groups of characters from the same kind (whitespaces, punctuation chars, regular chars, etc.); there are also munchers, groups of characters that belong to any group, and whose presence is regulated by symbol chars groups called muncher sigils. Different groups of characters are further classified, thus defining the Tokens.
The parameter WordsDef defines the exact classification of characters in groups, and the concrete groups that form each of the recognizable tokens in the language. Tokens are defined as:
Tokens other that whitespaces and munchers cannot contain whitespaces, and muncher sigils cannot contain whitespaces either. Some Tokens are used by the parser to construct the program, but others are used for other purposes; thus, Tokens are classified as being fillers and GTokens: fillers are ignored by the parser. NOTE: an EOF token cannot be a filler token. Thus, the Gobstones lexer has two different operations to read tokens, one for reading only GTokens, and the other to read any token (either a GToken or a filler).
An example of use is to generate the complete list of GTokens, and another one is to generate a complete list of all tokens. Assuming the following input string:
the first example would be
and the second would be
The result of the first example will produce two arrays,
tokens
andattrs
, and a record,languageMods
; their values should be:The result of the second example will produce a single array,
allTokens
, with value:Author
Pablo E. --Fidel-- Martínez López fidel.ml@gmail.com