Chapter 2. Lexical Structure

Chapter 2. Lexical Structure
Prev		Next

Table of Contents

2.1. The Standard Character Set

2.2. The Maximal Match Rule

2.3. Notational Conventions

2.4. Line Terminators

2.9. Reserved Words

2.10. Operator Symbols

2.11. Literals

2.11.1. Integer Literals
2.11.2. Floating-point Literals
2.11.3. String Literals
2.11.4. Character Literals

2.1. The Standard Character Set

Comma programs are written using a subset of the 8-bit character set ISO 8859-1 (Latin-1). This subset is called the standard character set A character not present in the standard character set but occurring in the source code will trigger a compile time error.

The following table associates with each character a name, the hexadecimal value of its encoding, and a brief description. The characters name is the standard glyph used to present that character if it is a graphic character. For characters with no distinct (or visible) graphic representation, a symbolic name is provided.

Table 2.1. Standard character set

Char	Hex	Description	Char	Hex	Description
a	61	small a	A	41	capital A
b	62	small b	B	42	capital B
c	63	small c	C	43	capital C
d	64	small d	D	42	capital D
e	65	small e	E	45	capital E
f	66	small f	F	46	capital F
g	67	small g	G	47	capital G
h	68	small h	H	48	capital H
i	69	small i	I	49	capital I
j	6A	small j	J	4A	capital J
k	6B	small k	K	4B	capital K
l	6C	small l	L	4C	capital L
m	6D	small m	M	4D	capital M
n	6E	small n	N	4E	capital N
o	6F	small o	O	4F	capital O
p	70	small p	P	50	capital P
q	71	small q	Q	51	capital Q
r	72	small r	R	52	capital R
s	73	small s	S	53	capital S
t	74	small t	T	54	capital T
u	75	small u	U	55	capital U
v	76	small v	V	56	capital V
w	77	small w	W	57	capital W
x	78	small x	X	58	capital X
y	79	small y	Y	59	capital Y
z	7A	small z	Z	5A	capital Z
1	31	digit 1	6	36	digit 6
2	32	digit 2	7	37	digit 7
3	33	digit 3	8	38	digit 8
4	34	digit 4	9	39	digit 9
5	35	digit 5	0	30	digit 0
!	21	exclamation mark	$	24	dollar sign
"	22	quotation mark, or double quote	'	27	apostrophe, or single quote
(	28	left parenthesis, or open parenthesis	)	29	right parenthesis, or close parenthesis
,	2C	comma	_	5F	low line, or underscore
-	2D	hyphen, or minus	.	2E	full stop, period, or dot
/	2F	solidus, or slash	:	3A	colon
;	3B	semicolon	?	3F	question mark
+	2B	plus	<	3C	less-than
=	3D	equals	>	3E	greater-than
#	23	number sign, or sharp	%	25	percent
&	26	ampersand	*	2A	asterisk, or star
@	40	commercial at, or at-sign	[	5B	left bracket
\	5C	reverse solidus, or backslash	]	5D	right bracket
{	7B	left curly bracket, or left brace	\|	7C	vertical bar
}	7D	right curly bracket, or right brace	`	60	grave accent, or backquote
^	5E	circumflex accent	~	7E	tilde
HT	09	horizontal tab	VT	0B	vertical tab
CR	0D	carriage return	LF	0A	line feed
SP	20	space	FF	0C	form feed

Note

It is possible that this specification will evolve to include program source written using the UTF-8 encoding, defined by the Unicode character standard. All reserved words will be specified using the current, compatible, standard character set.

2.2. The Maximal Match Rule

Lexical analysis proceeds by obeying the “maximal match” rule: When a character sequence can be transformed into two or more lexemes, the lexeme with the longest character representation is selected. Thus, although domain is a reserved word, domains is not.

2.3. Notational Conventions

The syntax of the following grammar rules make use of the following constructs:

[pattern]: pattern may occur optionally.
{pattern}: pattern may occur zero or more times.
pattern1 | pattern2: Choice of either pattern1 or pattern2.

2.4. Line Terminators

Input programs are scanned and divided into lines. Error messages reported by the compiler and associated tools will make use of the line number to produce useful diagnostic messages. Line terminators also indicate the end of a comment.

Line Terminators

[1]	Line_Terminator	`::=`	Linefeed \| Carriage-Return \| Carriage-Return Linefeed
[2]	Input_Character	`::=`	All characters in the standard character set.

2.5. Comments

Comments begin with the two characters --, with no whitespace, and continue to the end of the line. Comments do not appear within character or string literals.

Comments

[3] Comment ::= -- { Input_Character } Line_Terminator

2.6. Whitespace

Whitespace consists of the space, horizontal tab, and form feed characters, as well as line terminators and comments. Whitespace is a proper delimiter for lexemes.

Whitespace

[4]	Whitespace	`::=`	White_Char \| Line_Terminator \| Comment
[5]	White_Char	`::=`	Space \| Tab \| Formfeed

2.7. Separators

The following characters are the separators (also known as punctuators or delimiters).

(

)

;

2.8. Identifiers

An identifier is a sequence of characters. The initial character of the sequence must be an alphabetic character. All remaining characters can be any of the lower or uppercase alphabetic characters, a numeric digit, or the character '_'. An identifier may not contain two consecutive underscore characters.

Two identifiers are considered the same if their respective character sequences are identical. Thus, identifiers are case sensitive.

Example 2.1. Some valid Comma identifiers:

x integer INTEGER Max_Index x1234 x_4

Identifiers

[6]	Identifier	`::=`	Alpha { Identifier_Char }	[ Identifiers and Reserved Words ]
[7]	Identifier_Char	`::=`	Alpha \| Digit \| _
[8]	Alpha	`::=`	a \| b \| ... \| z \| A \| B \| ... \| Z
[9]	Digit	`::=`	0 \| 1 \| ... \| 9

Identifiers and Reserved Words

An identifier shall not be a reserved word.

2.9. Reserved Words

The following character sequences are reserved words and may not be used as identifiers:

Table 2.2. Reserved Words

abstract	add	and	array	carrier	begin
declare	domain	else	elsif	end	for
function	generic	if	import	in	inj
is	loop	mod	of	out	others
pragma	prj	procedure	range	rem	return
reverse	signature	subtype	then	type	while
with

2.10. Operator Symbols

The following tokens are the operators. These symbols have special productions in the grammar of the language and can be used as the defining identifier of a function declaration.

Table 2.3. Operators

=	/=	<	>	<=	>=
+	-	*	&	/	**
mod	rem

2.11. Literals

Literals are primitive values in Comma programs which have a direct representation in source code. There are literals for integer, floating point, string, and character values.

2.11.1. Integer Literals

An integer literal may be expressed in decimal (base 10), hexadecimal (base 16), octal (base 8), or binary (base 2).

For the sake of readability the underscore character '_' can be appear within an integer literal. These characters are ignored and serve only to help improve readability.

Integer Literals

[10]	Integer	`::=`	Decimal_Literal \| Hexadecimal_Literal \| Octal_Literal \| Binary_Literal
[11]	Decimal_Literal	`::=`	Digit { Digit \| Uscore }
[12]	Hexadecimal_Literal	`::=`	0X Hexadecimal_Digit { Hexadecimal_Digit \| Uscore }
[13]	Octal_Literal	`::=`	0O Octal_Digit { Octal_Digit \| Uscore }
[14]	Binary_Literal	`::=`	0B Binary_Digit { Binary_Digit \| Uscore }
[15]	Hexadecimal_Digit	`::=`	Digit \| a \| b \| ... \| f \| A \| B \| ... \| F
[16]	Octal_Digit	`::=`	0 \| 1 \| ... \| 7
[17]	Binary_Digit	`::=`	0 \| 1

2.11.2. Floating-point Literals

A floating-point literal can consist of an integer part, a decimal point, a fractional part, and an exponent. The decimal point is represented by the ASCII character '.'. The exponent is represented by either the characters 'e' or 'E', followed by an optional '+' or '-' sign, followed by one or more digits. In order to avoid ambiguity with integer decimal literals, a floating-point literal must contain either a decimal point, an exponent, or a float type suffix.

Floating-point Literals

[18]	Float_Literal	`::=`	{ Digit } . [ { Digit } ] [ Exponent_Part ] \| . { Digit } [ Exponent_Part ] \| { Digit } [ Exponent_Part ] \| { Digit } Exponent_Part
[19]	Exponent_Part	`::=`	E + Digit \| E - Digit

2.11.3. String Literals

A string literal is a character sequence delimited by the ASCII " (double quote, code 0x22) character.

String Literals

[20]	String_Literal	`::=`	" { String_Character } "
[21]	String_Character	`::=`	Input_Character	/* Except ". */

2.11.4. Character Literals

A character literal is a single input character delimited by single quotes (ASCII code 0x27).

Character Literals

[22] Character_Literal ::= ' Input_Character '

Example 2.2. Examples of character literals:

'A' '_' '"' '''

Prev		Next
Chapter 1. General	Home	Chapter 3. Names and Expressions