Unicode Escapes In Java

All Unicode characters can be used in comments, character and string literals in java. Unicode characters can be expressed through Unicode Escape Sequences. 

Unicode escape sequences consist of

  • backslash '\' (ASCII character 92, hex 0x5c),
  • 'u' (ASCII 117, hex 0x75)
  • optionally one or more additional 'u' characters, and
  • four hexadecimal digits (the characters '0' through '9' or 'a' through 'f' or 'A' through 'F').

Such sequences represent the UTF-16 encoding of a Unicode character, for example, 'a' is equivalent to '\u0061'. This escape method does not support characters beyond U+FFFF or you have to make use of surrogate pairs.

Unicode escape sequence may appear anywhere in a Java source file including inside identifiers, comments, and string literals. Unicode escapes must be always well formed, even if they appear in comments, else compiler will complain. It is legal to place a well-formed Unicode escape in a comment. Programmers sometimes use Unicode escapes in Javadoc comments to generate special characters in the documentation.

Consider an example. Predict the output of the below program:

//location: D:\Simple\units

public class Test {
    public static void main(String[] args) {
        System.out.print("Hell");
        System.out.println("o world");
    }

Compilation will fail. Unicode escapes must be well formed, even if they appear in commentsThe comment //location: D:\Simple\apps\units will throw a compilation error as \u is not followed by four hexadecimal digits. To avoid trouble like this, we must not put Windows filenames into comments in generated Java source files without first processing them to eliminate backslashes. 

The compiler translates Unicode escapes into the characters they represent before it parses a program into tokens. It also does so before discarding comments and white space. 

Let us consider another example:

public static void main(String[] args) {
    // Note: \u000A is Unicode representation of linefeed (LF)
    char c = 0x000A;
    System.out.println(c);
  }

It won't compile. You will get an error like ';' expected. This program contains a single Unicode escape (\u000A), located in its comment. As the comment tells you, this escape represents the linefeed character, and the compiler translates it before discarding the comment. Unfortunately, this linefeed character is the first line terminator after the two slash characters that begin the comment (//) and so terminates the comment. The words following the escape (is Unicode representation of linefeed (LF)) are therefore not part of the comment; nor are they syntactically valid.

So the above example effectively becomes:

public static void main(String[] args) {
    // Note: 
is Unicode representation of linefeed (LF)
    char c = 0x000A;
    System.out.println(c);
  }

 

Any and all characters in a program may be expressed in Unicode escape characters, but such programs are not very readable, except by the Java compiler.  You can write a complete program as Unicode characters. Consider an example .java file contents:

\u0070\u0075\u0062\u006c\u0069\u0063 \u0063\u006c\u0061\u0073\u0073\u0020\u0054\u0065\u0073\u0074
\u007b\u007d

This will compile fine if the file name is Test.java as the above code is same as:
public 
class Test
{}

Java provides no special treatment for Unicode escapes within string literals. 

Consider another example: If \u0022 is the Unicode escape for double quote ("), then what will the below line print:

System.out.println("a\u0022.length() + \u0022b".length());

It will print 2. Java provides no special treatment for Unicode escapes within string literals. The compiler translates Unicode escapes into the characters they represent before it parses the program into tokens, such as strings literals. Therefore, the first Unicode escape in the program closes a one-character string literal ("a"), and the second one opens a one-character string literal ("b"). The program prints the value of the expression "a".length() + "b".length(), or 2.

System.out.println("a\u0022.length() + \u0022b".length()); is same as:

System.out.println("a".length() + "b".length());

If you wanted to put the two double quote chars into the string literal, you can do it with normal escape sequences. But you can't do with Unicode escapes because Java provides no special treatment for Unicode escapes within string literals. Using normal escape sequence we can write the above as "a\".length() + \"b".length() which will print 16.

Avoid Unicode escapes except where they are truly necessary. They are rarely necessary. Unicode escapes are essential when you need to insert characters that can't be represented in any other way into your program. Avoid them in all other cases. Unicode escapes reduce program clarity and increase the potential for errors.

Quick Notes Finder Tags

Activities (1) advanced java (1) agile (3) App Servers (6) archived notes (2) Arrays (1) Best Practices (12) Best Practices (Design) (3) Best Practices (Java) (7) Best Practices (Java EE) (1) BigData (3) Chars & Encodings (6) coding problems (2) Collections (15) contests (3) Core Java (All) (55) course plan (2) Database (12) Design patterns (8) dev tools (3) downloads (2) eclipse (9) Essentials (1) examples (14) Exception (1) Exceptions (4) Exercise (1) exercises (6) Getting Started (18) Groovy (2) hadoop (4) hibernate (77) hibernate interview questions (6) History (1) Hot book (5) http monitoring (2) Inheritance (4) intellij (1) java 8 notes (4) Java 9 (1) Java Concepts (7) Java Core (9) java ee exercises (1) java ee interview questions (2) Java Elements (16) Java Environment (1) Java Features (4) java interview points (4) java interview questions (4) javajee initiatives (1) javajee thoughts (3) Java Performance (6) Java Programmer 1 (11) Java Programmer 2 (7) Javascript Frameworks (1) Java SE Professional (1) JPA 1 - Module (6) JPA 1 - Modules (1) JSP (1) Legacy Java (1) linked list (3) maven (1) Multithreading (16) NFR (1) No SQL (1) Object Oriented (9) OCPJP (4) OCPWCD (1) OOAD (3) Operators (4) Overloading (2) Overriding (2) Overviews (1) policies (1) programming (1) Quartz Scheduler (1) Quizzes (17) RabbitMQ (1) references (2) restful web service (3) Searching (1) security (10) Servlets (8) Servlets and JSP (31) Site Usage Guidelines (1) Sorting (1) source code management (1) spring (4) spring boot (3) Spring Examples (1) Spring Features (1) spring jpa (1) Stack (1) Streams & IO (3) Strings (11) SW Developer Tools (2) testing (1) troubleshooting (1) user interface (1) vxml (8) web services (1) Web Technologies (1) Web Technology Books (1) youtube (1)