3.2 Lexical Translations

CHAPTER 3: Lexical Structure

Previous

Java Language

Index

Next

3.2 Lexical Translations

A raw Unicode character stream is translated into a sequence of Java tokens, using the following three lexical translation steps, which are applied in turn:

A translation of Unicode escapes (S3.3) in the raw stream of Unicode characters to the corresponding Unicode character. A Unicode escape of the form \u xxxx, where xxxx is a hexadecimal value, represents the Unicode character whose encoding is xxxx. This translation step allows any Java program to be expressed using only ASCII characters.
A translation of the Unicode stream resulting from step 1 into a stream of input characters and line terminators (S3.4).
A translation of the stream of input characters and line terminators resulting from step 2 into a sequence of Java input elements (S3.5) which, after white space (S3.6) and comments (S3.7) are discarded, comprise the tokens (S3.5) that are the terminal symbols of the syntactic grammar (S2.3) for Java.

Java always uses the longest possible translation at each step, even if the result does not ultimately make a correct Java program, while another lexical translation would. Thus the input characters a--b are tokenized (S3.5) as a , -- , b , which is not part of any grammatically correct Java program, even though the tokenization a , - , - , b could be part of a grammatically correct Java program.