Chapter 2. Lexical Structure

Table of Contents

2.1. The Standard Character Set
2.2. The Maximal Match Rule
2.3. Notational Conventions
2.4. Line Terminators
2.5. Comments
2.6. Whitespace
2.7. Separators
2.8. Identifiers
2.9. Reserved Words
2.10. Operator Symbols
2.11. Literals
2.11.1. Integer Literals
2.11.2. Floating-point Literals
2.11.3. String Literals
2.11.4. Character Literals

2.1. The Standard Character Set

Comma programs are written using a subset of the 8-bit character set ISO 8859-1 (Latin-1). This subset is called the standard character set A character not present in the standard character set but occurring in the source code will trigger a compile time error.

The following table associates with each character a name, the hexadecimal value of its encoding, and a brief description. The characters name is the standard glyph used to present that character if it is a graphic character. For characters with no distinct (or visible) graphic representation, a symbolic name is provided.

Table 2.1. Standard character set

CharHexDescriptionCharHexDescription
a61small aA41capital A
b62small bB42capital B
c63small cC43capital C
d64small dD42capital D
e65small eE45capital E
f66small fF46capital F
g67small gG47capital G
h68small hH48capital H
i69small iI49capital I
j6Asmall jJ4Acapital J
k6Bsmall kK4Bcapital K
l6Csmall lL4Ccapital L
m6Dsmall mM4Dcapital M
n6Esmall nN4Ecapital N
o6Fsmall oO4Fcapital O
p70small pP50capital P
q71small qQ51capital Q
r72small rR52capital R
s73small sS53capital S
t74small tT54capital T
u75small uU55capital U
v76small vV56capital V
w77small wW57capital W
x78small xX58capital X
y79small yY59capital Y
z7Asmall zZ5Acapital Z
131digit 1636digit 6
232digit 2737digit 7
333digit 3838digit 8
434digit 4939digit 9
535digit 5030digit 0
!21exclamation mark$24dollar sign
"22quotation mark, or double quote'27apostrophe, or single quote
(28left parenthesis, or open parenthesis)29right parenthesis, or close parenthesis
,2Ccomma_5Flow line, or underscore
-2Dhyphen, or minus .2Efull stop, period, or dot
/2Fsolidus, or slash:3Acolon
;3Bsemicolon?3Fquestion mark
+2Bplus <3Cless-than
=3Dequals >3Egreater-than
#23number sign, or sharp%25percent
&26ampersand*2Aasterisk, or star
@40commercial at, or at-sign[5Bleft bracket
\5Creverse solidus, or backslash]5Dright bracket
{7Bleft curly bracket, or left brace|7Cvertical bar
}7Dright curly bracket, or right brace`60grave accent, or backquote
^5Ecircumflex accent~7Etilde
HT09horizontal tabVT0Bvertical tab
CR0Dcarriage returnLF0Aline feed
SP20spaceFF0Cform feed

Note

It is possible that this specification will evolve to include program source written using the UTF-8 encoding, defined by the Unicode character standard. All reserved words will be specified using the current, compatible, standard character set.

2.2. The Maximal Match Rule

Lexical analysis proceeds by obeying the maximal match rule: When a character sequence can be transformed into two or more lexemes, the lexeme with the longest character representation is selected. Thus, although domain is a reserved word, domains is not.

2.3. Notational Conventions

The syntax of the following grammar rules make use of the following constructs:

[pattern]

pattern may occur optionally.

{pattern}

pattern may occur zero or more times.

pattern1 | pattern2

Choice of either pattern1 or pattern2.

2.4. Line Terminators

Input programs are scanned and divided into lines. Error messages reported by the compiler and associated tools will make use of the line number to produce useful diagnostic messages. Line terminators also indicate the end of a comment.

Line Terminators
[1]Line_Terminator::= Linefeed | Carriage-Return | Carriage-Return Linefeed  
[2]Input_Character::= All characters in the standard character set.  

2.5. Comments

Comments begin with the two characters --, with no whitespace, and continue to the end of the line. Comments do not appear within character or string literals.

Comments
[3]Comment::= -- { Input_Character } Line_Terminator  

2.6. Whitespace

Whitespace consists of the space, horizontal tab, and form feed characters, as well as line terminators and comments. Whitespace is a proper delimiter for lexemes.

Whitespace
[4]Whitespace::= White_Char | Line_Terminator | Comment  
[5]White_Char::= Space | Tab | Formfeed  

2.7. Separators

The following characters are the separators (also known as punctuators or delimiters).

():;,.    

2.8. Identifiers

An identifier is a sequence of characters. The initial character of the sequence must be an alphabetic character. All remaining characters can be any of the lower or uppercase alphabetic characters, a numeric digit, or the character '_'. An identifier may not contain two consecutive underscore characters.

Two identifiers are considered the same if their respective character sequences are identical. Thus, identifiers are case sensitive.

Example 2.1. Some valid Comma identifiers:

xintegerINTEGERMax_Indexx1234x_4

Identifiers
[6]Identifier::= Alpha { Identifier_Char } Identifiers and Reserved Words ]
[7]Identifier_Char::= Alpha | Digit | _  
[8]Alpha::= a | b | ... | z | A | B | ... | Z  
[9]Digit::= 0 | 1 | ... | 9  

Identifiers and Reserved Words

An identifier shall not be a reserved word.

2.9. Reserved Words

The following character sequences are reserved words and may not be used as identifiers:

Table 2.2. Reserved Words

abstractaddandarraycarrierbegin
declaredomainelseelsifendfor
functiongenericifimportininj
isloopmodofoutothers
pragmaprjprocedurerangeremreturn
reversesignaturesubtypethentypewhile
with     

2.10. Operator Symbols

The following tokens are the operators. These symbols have special productions in the grammar of the language and can be used as the defining identifier of a function declaration.

Table 2.3. Operators

=/=<><=>=
+-*&/**
modrem    

2.11. Literals

Literals are primitive values in Comma programs which have a direct representation in source code. There are literals for integer, floating point, string, and character values.

2.11.1. Integer Literals

An integer literal may be expressed in decimal (base 10), hexadecimal (base 16), octal (base 8), or binary (base 2).

For the sake of readability the underscore character '_' can be appear within an integer literal. These characters are ignored and serve only to help improve readability.

Integer Literals
[10]Integer::= Decimal_Literal | Hexadecimal_Literal | Octal_Literal | Binary_Literal  
[11]Decimal_Literal::= Digit { Digit | Uscore }  
[12]Hexadecimal_Literal::= 0X Hexadecimal_Digit { Hexadecimal_Digit | Uscore }  
[13]Octal_Literal::= 0O Octal_Digit { Octal_Digit | Uscore }  
[14]Binary_Literal::= 0B Binary_Digit { Binary_Digit | Uscore }  
[15]Hexadecimal_Digit::= Digit | a | b | ... | f | A | B | ... | F  
[16]Octal_Digit::= 0 | 1 | ... | 7  
[17]Binary_Digit::= 0 | 1  

2.11.2. Floating-point Literals

A floating-point literal can consist of an integer part, a decimal point, a fractional part, and an exponent. The decimal point is represented by the ASCII character '.'. The exponent is represented by either the characters 'e' or 'E', followed by an optional '+' or '-' sign, followed by one or more digits. In order to avoid ambiguity with integer decimal literals, a floating-point literal must contain either a decimal point, an exponent, or a float type suffix.

Floating-point Literals
[18]Float_Literal::= { Digit } . [ { Digit } ] [ Exponent_Part ] |
. { Digit } [ Exponent_Part ] |
{ Digit } [ Exponent_Part ] |
{ Digit } Exponent_Part
 
[19]Exponent_Part::= E + Digit | E - Digit  

2.11.3. String Literals

A string literal is a character sequence delimited by the ASCII " (double quote, code 0x22) character.

String Literals
[20]String_Literal::= " { String_Character } "  
[21]String_Character::= Input_Character /* Except ". */

2.11.4. Character Literals

A character literal is a single input character delimited by single quotes (ASCII code 0x27).

Character Literals
[22]Character_Literal::= ' Input_Character '  

Example 2.2. Examples of character literals:

'A''_''"''''