6. Source Text
ECMAScript source text is represented as a sequence of characters
in the Unicode character encoding, version 2.1 or later, using the
UTF-16 transformation format. The text is expected to have been
normalised to Unicode Normalised Form C (canonical composition), as
described in Unicode Technical Report #15. Conforming ECMAScript
implementations are not required to perform any normalisation of
text, or behave as though they were performing normalisation of
text, themselves.
SourceCharacter :: any Unicode character
ECMAScript source text can contain any of the Unicode characters.
All Unicode white space characters are treated as white space, and
all Unicode line/paragraph separators are treated as line
separators. Non-Latin Unicode characters are allowed in
identifiers, string literals, regular expression literals and
comments.
Throughout the rest of this document, the phrase "code point"
and the word "character" will be used to refer to a 16-bit unsigned
value used to represent a single 16-bit unit of UTF-16 text. The
phrase "Unicode character" will be used to refer to the abstract
linguistic or typographical unit represented by a single Unicode
scalar value (which may be longer than 16 bits and thus may be
represented by more than one code point). This only refers to
entities represented by single Unicode scalar values: the
components of a combining character sequence are still individual
"Unicode characters," even though a user might think of the whole
sequence as a single character.
In string literals, regular expression literals and identifiers,
any character (code point) may also be expressed as a Unicode
escape sequence consisting of six characters, namely \u plus four
hexadecimal digits. Within a comment, such an escape sequence is
effectively ignored as part of the comment. Within a string literal
or regular expression literal, the Unicode escape sequence
contributes one character to the value of the literal. Within an
identifier, the escape sequence contributes one character to the
identifier.
NOTE 1
Although this document sometimes refers to a "transformation"
between a "character" within a "string" and the 16-bit unsigned
integer that is the UTF-16 encoding of that character, there is
actually no transformation because a "character" within a "string"
is actually represented using that 16-bit unsigned value.
NOTE 2
ECMAScript differs from the Java programming language in the
behaviour of Unicode escape sequences. In a Java program, if the
Unicode escape sequence \u000A, for example, occurs within a
single-line comment, it is interpreted as a line terminator
(Unicode character 000A is line feed) and therefore the next
character is not part of the comment. Similarly, if the Unicode
escape sequence \u000A occurs within a string literal in a Java
program, it is likewise interpreted as a line terminator, which is
not allowed within a string literal --- one must write \n instead
of \u000A to cause a line feed to be part of the string value of a
string literal. In an ECMAScript program, a Unicode escape sequence
occurring within a comment is never interpreted and therefore
cannot contribute to termination of the comment. Similarly, a
Unicode escape sequence occurring within a string literal in an
ECMAScript program always contributes a character to the string
value of the literal and is never interpreted as a line terminator
or as a quote mark that might terminate the string literal.