CHARACTER ENCODING describes the transformation of textual data from one format to another, with the express purpose of making it readable to the receiving node which may utilize a system with different specifications. A basic example of character encoding is binary-to-text, in which binary data is transmitted through plain text channels like electronic mail (email).
Character encoding is not to be mistaken for encryption, which is a process that converts data into a format that is only accessible to people or processes with the capabilities of decrypting it. Character encoding makes use of schemes that are publicly available and can therefore be reversed very easily. This process does not require keys but rather algorithms.
How Character Encoding Works
The source code of any software project if filled with many complexities, but a computer can only understand information in the form of ones and zeroes. This is known as binary data. The purpose of encoding (character encoding in particular) is to convert the textual output of high-level programming languages, as well as text provided by end users, into a format that computers can understand and therefore transmit.
The American Standard Code for Information Interchange (ASCII) is a modern standard of character encoding for electronic communications. This is most appropriate for files that are comprised of text. The ASCII standard has a way of representing English characters as specific numbers, where each letter in the alphabet is assigned a number ranging from 0 to 127. In this encoding scheme, each letter, special character or number is represented by a binary number that contains 7 bits.
ANSI encoding is a character set which utilizes alphanumeric codes issued specifically through the American National Standards Institute. The effort of this institution is to ensure homogeneous identification of geographical entities across all federal government agencies. ANSI developed the ASCII and ANSI encoding schemes. The latter, by all means, an extension of the ASCII conversion scheme, saving that it consists of 128 additional character codes. So while ASCII characters contain 7 bit codes, ANSI characters are comprised of 8 bits.
The Unicode standard defines the encoding of text in the majority of operating systems in use today. It assigns each character a unique code point, or number, and encompasses two methods of character mapping: UTF (Unicode Transformation Format); and UCS (Universal character set) in collaboration with the International Organisation for Standardisation (ISO). Both character sets are practically identical today, so the following will focus on the UTF standard by bits:
UTF-8, which uses a code unit of 8 bits, makes use of 1 byte to represent characters that are in the ASCII set; 2 bytes for those in several additional alphabet blocks and 3 bytes for the rest of the bit map.
UTF-16 uses only 2 bytes for any character represented in the bitmap and 4 for additional characters. The unit for this encoding scheme is 16 bits.
UTF-32 utilizes a 32-bit code unit and 4 bytes for all characters.
Character Decoding is the exact opposite of the encoding process, in which a computer converts previously encoded blocks into HTML or human-readable plain text. The purpose of this process is to undo any encoding necessary for transmission, recreating a version of the textual data once available on the receiving device or node.
Malicious Attacks in Character Encoding
With websites came a new generation of malicious software and Unicode character sets were known for breeding such threats. The same is true for ASCII and its Base64 grouping of binary-to-text character sets. In fact, Base64 attacks became widespread among PHP applications in the early 2010s—most notably among website themes for the WordPress content management system (CMS) framework. This contributed to the popularity of website security and antiviruses developed specifically for web applications.
If you are a webmaster or system administrator, it is important to stay abreast of Base64 attacks, especially if your web property utilizes any form of open source technology. Doing so will not only keep your web assets safe, but ensure that the hard work you’ve invested in the content and development of your properties remain forthright and in good standing.