Lexical Elements

Character Set

OADL uses the UTF-8 (RFC 3629) input character encoding, which allows for minimum size for most uses (and is compatible with ASCII), but also allows for the use of the full 21-bit range of Unicode standard character encodings for identifiers, string literals and comments. In addition, OADL uses UTF-8 for text-based output. Although OADL accepts extended Unicode characters as elements of identifiers, only the ASCII digits '0' through '9' are allowed in numeric constants.

Note that illegal UTF-8 seqences in a source file will produce a compile-time error.

OADL is a token-based compiler. It uses a greedy algorithm to scan tokens; this means, for example, that the sequence of characters === is interpreted as the token == followed by the token = (which the compiler would then flag as a syntax error).

Whitespace (ASCII space, carriage return, line feed, form feed, horizontal tab, and vertical tab) is ignored, except as it separates tokens. Comments (delimited by /* and */) are considered whitespace, and otherwise ignored. End-line comments (delimited by //) are also ignored. For compatibility with some UTF-8 editors (i.e. Windows Notepad), the Unicode byte-order-mark 0xFEFF (or UTF-8 0xEF, 0xBB, 0xBF) is allowed at the beginning of an OADL source file and is treated as whitespace.

The following ASCII characters are recognized as whitespace:

ASCII code Character name
9 horizontal tab
10 line feed
11 vertical tab
12 form feed
13 carriage return
32 space

Tokens

Punctuation

An OADL compiler recognizes the following punctuation as single-character tokens:

& | ^ ~ + - * / %
. , : ; ! ( ) { }
[ ] < > = @ ? `

In addition, OADL recognizes the following two-character tokens:

## != == <= >= <<
>> += -= *= /= %=
&= |= ^= ++ -- =>
?= #= && || ~= ?#
:: ** \| \^ \& \<
\> \+ \- \* \/ \%
!- @@ #[ #( ?* ??
-> := #{

OADL also recognizes the following three-character tokens:

<<= >>= <<< >>> ...
\== \#= \!= \<= \>=
\<< \>> \=> \~= \**


Character and String Constants

There are several composite token types. The first is a character constant token, expressed as a single character between two single quote marks, thus: 'x' A string constant token is lexically similar, expressed as zero or more characters between two double quote marks, thus: "abcdefg"

In both character constant tokens and string constant tokens, the character \ has special significance. Just as in C and C++, it is an "escape" character, which alters the interpretation of a number of characters following. Here is the complete list of escapes recognized by OADL:

Escape Meaning
\0 ASCII NUL character
\a ASCII bell character
\b ASCII backspace character
\f ASCII formfeed character
\n ASCII linefeed character
\r ASCII carriage return character
\t ASCII horizontal tab character
\v ASCII vertical tab character
\xHEX Hexadecimal character code (up to 8 digits)
\' Non-terminating single quote
\" Non-terminating double quote
\\ The backslash character
\any The given character

If a character constant or string constant token is immediately preceded by the character L the token is a wide character constant or wide string constant token. If any of the characters inside the token have an encoding greater than 127 (either due to specification via \xHEX or via UTF-8 sequences), then the constant is also considered to be a wide character constant or wide string constant token.

Integer Constants

In OADL, an integer constant token takes one of three forms:

The digits of an integer constant may be separated by an underscore _ for enhanced readability (the underscore must be between digits, not at the beginning or the end of the number). An integer constant may also have the one of the following suffixes to give it a type other than Int (the suffixes are case-independent):

Suffix Resulting type
B Byte (base-10 integer constants)
SB Byte (hexadecimal integer constants)
UB Ubyte
S Short
US Ushort
U Uint
L Long
UL Ulong

Here are some examples of integer constant tokens:

Token Description
0x1000 The number 4096, in hex
123 The number 123, in decimal
0x1FFF_FFFF The largest positive integer, in hex
0b111_1111b The largest Byte value, in binary


Floating Point Constants

A floating point constant token in OADL is distinguished from an integer constant token by one of two things: either it has an exponent part (the character e or E followed by a signed integer exponent), or it has a fractional part (the character . followed by the digits comprising the fraction), or it has both. A floating point constant may also have one of the following suffixes to give it a type other than Float (the suffixes are case-independent):

Suffix Resulting type
H Half
D Double

Hexadecimal floating point constants are supported; they are similar to hexadecimal integer constants but are distinguished from them by having an exponent part (the character p or P followed by a signed integer exponent signifying a power of two), or having a fractional part (the character . followed by the hexadecimal digits comprising the fraction), or both. A hexadecimal floating point constant may have one of the following suffixes to give it a type other than Float (the suffixes are case-independent):

Suffix Resulting type
H Half
L Double

Like integer constants, numeric digits of a floating point constant can be separated by the underscore _ (again, the underscore must be between two numeric digits, not at the beginning or the end of the string of digits).

Here are some examples of floating point constant tokens:

Token Description
3.14159_26535_89793_24d A Double approximation of π
1.e38 About the biggest Float representable
1e-38 About the smallest Float representable
.0h Zero as a Half
0x1p-1L The Double 0.5


Identifiers

OADL identifier tokens consist of a Unicode character with an alphabetic attribute or the character _ or $, followed by zero or more of:

Unicode characters with any of the following attributes are recognized as alphabetic when part of an identifier:

Attribute Abbreviation
Letter, upper case Lu
Letter, lower case Ll
Letter, title case Lt
Letter, other Lo

Unicode characters with any of the following attributes are recognized as numeric when part of an identifier:

Attribute Abbreviation
Number, decimal digit Nd
Number, letter Nl
Number, other No

Here are some example identifier tokens: Main $foo my_house x1 bißchen Note that OADL is case-sensitive; that is, the identifier tokens Main, main, mAIn, and MAIN are all unique.

OADL reserves several identifiers as keywords for syntactic purposes. This is the complete list:

assert break case catch
class const continue default
do else extern for
forall foreach if match
namespace new operator proc
protected public return static
switch throw try using
var while with __FILE__
__LINE__

Keywords may not be used as user identifiers (variable names, procedure names, etc.) in OADL.

It is non-trivial to distinguish between the various integer constant, floating point constant, and identifier tokens. See the Token State Machine in the OADL Implementation Notes chapter for more information.

Note that the __FILE__ and __LINE__ keywords are special. The __FILE__ keyword is replaced at compile time with a string constant token containing the name of the file currently being compiled. The __LINE__ keyword is replaced at compile time with an integer constant token containing the line number where it is found.

Local match arguments

In the context of a match statement, a match argument consisting of the character ? followed by one or more decimal digits may be used. It indicates the nth match result of the pattern:

    match ("123 45") {
    case "([0-9]+) ([0-9]+)" :
        "First: ", ?1, "; second: ", ?2, '\n';
    }
First: 123; second: 45

The number of match arguments present can be found by using the token ?# (note that the zero'th argument is always the entire string matched):

    match ("123 456 789") {
    case "([0-9]*) ([0-9]*) ([0-9]*)" :
        "Found ", ?#, " matches:\n";
        for (var i = 0; i < ?#; i++) {
            "", oadl::matchvec()[i], '\n';
        }
    }
Found 4 matches:
123 456 789
123
456
789

OADL Preprocessing

It is often convenient to break up a program into several source files. To enable this, the include statement is supported:

#include "filename"

Tokens are read from the given filename until it has been exhausted; at that point, lexical analysis returns to the current file. It is assumed that the filename will have a suffix of ".oah"; if so, it is not necessary to include the suffix in the #include statement. It is not required that files included have a ".oah" suffix, though.

It is implementation-dependent how many nested levels of includes are supported; however, all implementations will support at least 4 levels.

The OADL preprocessor supports macros. Macros are similar to those in C/C++; however, unlike C/C++, parenthesis and braces in a macro must match. Additionally, instead of escaping the end-of-line as in C/C++, multi-token OADL macro definitions are completely enclosed in matching braces. For example:

    #define foo(x) {
        x = x + 1
    }

    var a = 1
    foo(a)
    "a = ", a
a = 2

The matching braces are not included in the expansion of the macro. Unlike C/C++ there are no token pasting or tokenizing capabilities in OADL macros.

Macros may define and undefine other macros; for example:

    #define bar(x) {
        #ifdef(foo)
            #undef foo
        #endif
        #define foo(y) {x + y}
    }
    bar(3)
    foo(4)
7

    bar(5)
    foo(4)
9

Unlike C/C++, macros may not be redefined. Instead they must be undefined using the #undef statement:

    #define foo(a) {a+1}
    foo(1)
2
    #undef foo
    #define foo(a) {a+2}
    foo(1)
3

OADL also supports conditional compilation. The tokens #if, #ifdef, #else, #elif, and #endif are used for conditional compilation. As in C/C++, OADL conditional compilation statements may be nested. Unlike C/C++, OADL conditional compilation statements do not need to be on individual lines. This means that the condition must be enclosed in parentheses ( ):

    #ifdef(foo) "This statement is not printed"
    #elif(0) "Neither is this statement"
    #else "This statement is printed" #endif
This statement is printed

OADL also supports the #defined query which may be used in conditional compilation expressions as well as in regular OADL expressions:

    #define foo(a) {a+1}
    "Is foo defined? ", #defined(foo)
Is foo defined? true

    #if(0 || #defined(foo)) "Foo is still defined" #endif
Foo is still defined

Just as in C/C++, other than the conditional compilation tokens #if, #ifdef, #else, #elif, and #endif, tokens inside a non-compiled section of a conditional compilation statement are ignored, including #include, #define, and #undef tokens. Note that the input character stream is still processed as tokens; this can lead to subtle errors with unterminated comments:

    // Nothing is printed due to the unterminated comment inside
    // the #ifdef:
    #ifdef(foo) "This statement is not printed" /*
    #else "Neither is this one." */
    #endif

OADL reserves the following preprocessor keywords for future use:

#arg #args #nargs

OADL Unnamed Value Tokens

For unnamed non-numeric, non-array values, OADL accepts and prints the following special token sequences:

#OBJ(num) Unnamed object number num
#PRC(num) Unnamed proc number num
#PTR(num) System-specific pointer number num

Note that num must be a compile-time decimal integer constant.

For example,

    a = proc() {"Hello a!\n";}
    a
#PRC(7)

    b = #PRC(7)
    b()
Hello a!

    i = 7
    c = #PRC(i)
Decimal integer constant expected

OADL Desk Calculator Tokens

The OADL desk calculator accepts the following tokens for special use:

#classes #consts #defines #edit
#erase #externs #help #intrinsics
#list #load #namespaces #object
#procs #publics #quit #reset
#save #vars

Continue to Syntax

Return to Introduction