To have a better idea about what is Unicode, let us first look at what is a CodePage.
A Code page is another name for character encoding (A character encoding system consists of a code that pairs each character from a given repertoire with something else, such as a sequence of natural numbers, octets or electrical pulses, in order to facilitate the transmission of data (generally numbers and/or text) through telecommunication networks or storage of text in computers).
Some relevant Terms:
Character: a, b, c, A, B,…
Coded character: A=65, B=66,…
Character Set: A set of characters, to be used together (e.g. Latin alphabet)
Code page: A set of coded characters (e.g. ISO-8859-1, Shift-JIS)
Locale: Code page + properties and rules (e.g. isdigit, collation, …)
Why the Need for Unicode
Every standard code page supports only a certain group of languages (e.g Western European, Eastern European, Japanese).
Within one computer system only one code page can be supported in a clean way. Therefore a universal code page that supports all letters, punctuation signs, technical symbols etc. of all languages is required.
Unicode is a superset of all existing character sets. Unicode encodes plain text (no rendering information). It defines characters, not glyphs (semantics, not visual representation). Unicode unifies characters used in different scripts (CJK* Unification; CJK= Chinese, Japanese, Korean).
In Unicode there is a space for 1,000,000 characters. 64,000 characters coded by one 16bit code point. Further characters coded by two 16bit code points (surrogates).
Unicode Encoding Forms
byte-based encoding scheme; one character is coded with 1-4 bytes; compatible with 7-bit ASCII.
16bit units; often used characters occupy one 16bit unit; further characters are coded with two 16bit units
32bit units; fixed size for all characters
Note: All encoding forms support the same amount of characters.
Unicode in SAP
for external communication (e.g file, network); no endian problems; minimum average data size; limited backward compatibility to non-Unicode systems.
internal (in memory); best compromise between memory usage and algorithmic complexity; fits to Java and Microsoft environment; best way to migrate existing ABAP and C programs.