Unicode in SAP

0
860

To have a better idea about what is Unicode, let us first look at what is a CodePage.

CodePage

A Code page is another name for character encoding (A character encoding system consists of a code that pairs each character from a given repertoire with something else, such as a sequence of natural numbers, octets or electrical pulses, in order to facilitate the transmission of data (generally numbers and/or text) through telecommunication networks or storage of text in computers).

Code Page
Code Page

Some relevant Terms:

Character: a, b, c, A, B,…

Coded character: A=65, B=66,…

Character Set: A set of characters, to be used together (e.g. Latin alphabet)

Code page: A set of coded characters (e.g. ISO-8859-1, Shift-JIS)

Locale: Code page + properties and rules (e.g. isdigit, collation, …)

Why the Need for Unicode

Every standard code page supports only a certain group of languages (e.g Western European, Eastern European, Japanese).

Within one computer system only one code page can be supported in a clean way. Therefore a universal code page that supports all letters, punctuation signs, technical symbols etc. of all languages is required.

Unicode Features

Unicode is a superset of all existing character sets. Unicode encodes plain text (no rendering information). It defines characters, not glyphs (semantics, not visual representation). Unicode unifies characters used in different scripts (CJK* Unification; CJK= Chinese, Japanese, Korean).

In Unicode there is a space for 1,000,000 characters. 64,000 characters coded by one 16bit code point. Further characters coded by two 16bit code points (surrogates).

Unicode - Detailed view
Unicode - Detailed view

Unicode Encoding Forms

UTF-8

byte-based encoding scheme; one character is coded with 1-4 bytes; compatible with 7-bit ASCII.

UTF-16

16bit units; often used characters occupy one 16bit unit; further characters are coded with two 16bit units

UTF-32

32bit units; fixed size for all characters

Note: All encoding forms support the same amount of characters.

Unicode in SAP

UTF-8

for external communication (e.g file, network); no endian problems; minimum average data size; limited backward compatibility to non-Unicode systems.

UTF-16

internal (in memory); best compromise between memory usage and algorithmic complexity; fits to Java and Microsoft environment; best way to migrate existing ABAP and C programs.

© SAP 2008 / Page 15
UTF-8
for external communication (e.g file, network)
no endian problems
minimum average data size
limited backward compatibility to non-Unicode systems
UTF-16
internal (in memory)
best compromise between memory usage and algorithmic complexity
fits to Java and Microsoft environment
best way to migrate existing ABAP and C programs

LEAVE A REPLY

Please enter your comment!
Please enter your name here