| .. index:: lexical format |
| .. _text-lexical: |
| |
| Lexical Format |
| -------------- |
| |
| |
| .. index:: ! character, Unicode, ASCII, code point, ! source text |
| pair: text format; character |
| .. _source: |
| .. _text-source: |
| .. _text-char: |
| |
| Characters |
| ~~~~~~~~~~ |
| |
| The text format assigns meaning to *source text*, which consists of a sequence of *characters*. |
| Characters are assumed to be represented as valid |Unicode|_ (Section 2.4) *code points*. |
| |
| .. math:: |
| \begin{array}{llll} |
| \production{source} & \Tsource &::=& |
| \Tchar^\ast \\ |
| \production{character} & \Tchar &::=& |
| \unicode{00} ~|~ \dots ~|~ \unicode{D7FF} ~|~ \unicode{E000} ~|~ \dots ~|~ \unicode{10FFFF} \\ |
| \end{array} |
| |
| .. note:: |
| While source text may contain any Unicode character in :ref:`comments <text-comment>` or :ref:`string <text-string>` literals, |
| the rest of the grammar is formed exclusively from the characters supported by the 7-bit |ASCII|_ subset of Unicode. |
| |
| |
| .. index:: ! token, ! keyword, character, white space, comment, source text |
| single: text format; token |
| .. _text-keyword: |
| .. _text-reserved: |
| .. _text-token: |
| |
| Tokens |
| ~~~~~~ |
| |
| The character stream in the source text is divided, from left to right, into a sequence of *tokens*, as defined by the following grammar. |
| |
| .. math:: |
| \begin{array}{llll} |
| \production{token} & \Ttoken &::=& |
| \Tkeyword ~|~ \TuN ~|~ \TsN ~|~ \TfN ~|~ \Tstring ~|~ \Tid ~|~ |
| \text{(} ~|~ \text{)} ~|~ \Treserved \\ |
| \production{keyword} & \Tkeyword &::=& |
| (\text{a} ~|~ \dots ~|~ \text{z})~\Tidchar^\ast |
| \qquad (\mbox{if occurring as a literal terminal in the grammar}) \\ |
| \production{reserved} & \Treserved &::=& |
| \Tidchar^+ \\ |
| \end{array} |
| |
| Tokens are formed from the input character stream according to the *longest match* rule. |
| That is, the next token always consists of the longest possible sequence of characters that is recognized by the above lexical grammar. |
| Tokens can be separated by :ref:`white space <text-space>`, |
| but except for strings, they cannot themselves contain whitespace. |
| |
| The set of *keyword* tokens is defined implicitly, by all occurrences of a :ref:`terminal symbol <text-grammar>` in literal form, such as :math:`\text{keyword}`, in a :ref:`syntactic <text-syntactic>` production of this chapter. |
| |
| Any token that does not fall into any of the other categories is considered *reserved*, and cannot occur in source text. |
| |
| .. note:: |
| The effect of defining the set of reserved tokens is that all tokens must be separated by either parentheses or :ref:`white space <text-space>`. |
| For example, :math:`\text{0\$x}` is a single reserved token. |
| Consequently, it is not recognized as two separate tokens :math:`\text{0}` and :math:`\text{\$x}`, but instead disallowed. |
| This property of tokenization is not affected by the fact that the definition of reserved tokens overlaps with other token classes. |
| |
| |
| .. index:: ! white space, character, ASCII |
| single: text format; white space |
| .. _text-format: |
| .. _text-space: |
| |
| White Space |
| ~~~~~~~~~~~ |
| |
| *White space* is any sequence of literal space characters, formatting characters, or :ref:`comments <text-comment>`. |
| The allowed formatting characters correspond to a subset of the |ASCII|_ *format effectors*, namely, *horizontal tabulation* (:math:`\unicode{09}`), *line feed* (:math:`\unicode{0A}`), and *carriage return* (:math:`\unicode{0D}`). |
| |
| .. math:: |
| \begin{array}{llclll@{\qquad\qquad}l} |
| \production{white space} & \Tspace &::=& |
| (\text{~~} ~|~ \Tformat ~|~ \Tcomment)^\ast \\ |
| \production{format} & \Tformat &::=& |
| \unicode{09} ~|~ \unicode{0A} ~|~ \unicode{0D} \\ |
| \end{array} |
| |
| The only relevance of white space is to separate :ref:`tokens <text-token>`. It is otherwise ignored. |
| |
| |
| .. index:: ! comment, character |
| single: text format; comment |
| .. _text-comment: |
| |
| Comments |
| ~~~~~~~~ |
| |
| A *comment* can either be a *line comment*, started with a double semicolon :math:`\Tcommentd` and extending to the end of the line, |
| or a *block comment*, enclosed in delimiters :math:`\Tcommentl \dots \Tcommentr`. |
| Block comments can be nested. |
| |
| .. math:: |
| \begin{array}{llclll@{\qquad\qquad}l} |
| \production{comment} & \Tcomment &::=& |
| \Tlinecomment ~|~ \Tblockcomment \\ |
| \production{line comment} & \Tlinecomment &::=& |
| \Tcommentd~~\Tlinechar^\ast~~(\unicode{0A} ~|~ \T{eof}) \\ |
| \production{line character} & \Tlinechar &::=& |
| c{:}\Tchar & (\iff c \neq \unicode{0A}) \\ |
| \production{block comment} & \Tblockcomment &::=& |
| \Tcommentl~~\Tblockchar^\ast~~\Tcommentr \\ |
| \production{block character} & \Tblockchar &::=& |
| c{:}\Tchar & (\iff c \neq \text{;} \wedge c \neq \text{(}) \\ &&|& |
| \text{;} & (\iff~\mbox{the next character is not}~\text{)}) \\ &&|& |
| \text{(} & (\iff~\mbox{the next character is not}~\text{;}) \\ &&|& |
| \Tblockcomment \\ |
| \end{array} |
| |
| Here, the pseudo token :math:`\T{eof}` indicates the end of the input. |
| The *look-ahead* restrictions on the productions for |Tblockchar| disambiguate the grammar such that only well-bracketed uses of block comment delimiters are allowed. |
| |
| .. note:: |
| Any formatting and control characters are allowed inside comments. |