III - ASCII and character encoding

The American Standards Association started developing a response to the need to standardize across typewriters and country-to-country with ASCII. In 1968, the 7-bit ASCII standard was released- this included 32 control characters and 96 printing characters for a total of 128 possible characters.

The first 32 characters (0-31) are unprintable control codes that run printers, make beeps, and interfaced with the mechanics of computers in the 1960s in a way that is not generally afforded now.

The rest are:
!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~

There is also a kind of unofficial extended ASCII table that came later. This included another 128 characters, like €, ©, ¶, and ö. These later 128 were not standardized, and were very much responding to a more global text space and often changed location to location. The total number of ASCII characters is 128 or 256 depending who you ask.

These individual characters are each encoded in the form of binary numerical codes. This is because the circuitry of the microprocessor that lies at the heart of a modern computer system can only do two things fundamentally- calculate binary arithmetic operations and perform Boolean (true or false) logical operations.

So -

At the very highest level- ASCII lets you type a binary number made of those 1s and 0s to get a character, through a predefined character map.

When a personal computer records the letter 'A' in a file, it does not create an image of the letter 'A' anywhere in the storage of the machine. Rather, it records a binary number (made up of zeroes and ones) that represents the letter 'A' in a character code table. The computer uses that as the basis for pulling the character shape of 'A' from a font file listing with the same binary number and displaying it on the computer's screen.

The character 'A' in binary is '01000001', which is also '65' in decimal.

You can try this yourself:

This allows us to store complicated “shapes” in very little storage space, because we can look up these numerical stand-ins in a table- just like the Chinese Telegraph codes.

US-ASCII Code Chart, February 1972, General Electric Data communication Product Dept., Waynesboro, Virginia. Image from the Wikimedia Commons.
ASCII was adopted by all U.S. computer manufacturers except for IBM, which, unwilling to give up the decades of dominance afforded by Baudot code developed a different proprietary 8 bit character code (2^8 = 256 code points) called EBCDIC [pronounced eb-see-dick], which stands for "Extended Binary Coded Decimal Interchange Code." It was particularly incomprehensible, the butt of many jokes, and generally not adopted.

U.S. computer companies were by this point selling in international markers, and ASCII became an “international standard”. Of course, this meant adapting ASCII to individual countries. The International Organization for Standardization in Geneva recommended using ASCII code as-is, but with 10 code points to be left open for “national variants”.

Of course, so many languages do not inherently use Latin characters, and even those that do may have more than 10 “national variants” to fit into a table. These countries often had their own or heavily adapted standards. These were often specific to the individual language, but generally computer makers defined unique country “code pages” that used the undefined space from 128-255 in the extended ASCII table, mapping it to various characters they needed.

Extended ASCII Table. Image from Lookuptables.com.
Despite the obvious flaws with this system (more on that later), ASCII was by far the dominant text scheme on the early internet.

No conversation around ASCII would be complete without discussing ASCII art, which was a culturally seminal part of the early internet. ASCII art is made of those 128 (96) printable characters, and was used on forums, in BBSes, on USEnet, as email sign-offs, and as graphics in early (and some contemporary!) games. ASCII art partly rose because early printers often lacked graphics ability and so characters could be used instead to represent images- a language borrowed from typewriter art. It also served as a visual representation when images were too big to transmit over low bandwidths.

.::7777::-.
/:'////' `::>/|/ ,.-----__
.', |||| `/( e\ ,:::://///,:::-.
-==~-'`-Xm````-mr' `-_\ /:''/////// ``:::`;/|/
/' |||||| :://'`\
, .' , |||||| `/( e \
\/ -===~__-'\__X_`````\_____/~`-._ `.
., ~~ ~~ `~-'
_____ /`
'\/ ,::////;::-. \| o
/:'///// ``::>/|/ \/
.', |||| `/( e\
-==~-'`-Xm````-mm-' `-_\ \, Seal do Mar
(From http://www.ascii-art.de/ascii/ab/armadillo.txt)
Despite this prevalence of ASCII in the 1970s-1990s, 128 characters plus an additional undefined 128 obviously aren’t enough for the entire world: code pages (lookup tables that definied what number mapped to what letter) varied by country. This works fine in theory, and certainly was a functional stopgap for the internal usage of say, an American computer by an Iranian company.

However, when computers with different code pages exchange data it truly all goes to hell. For instance, if you send an email from a Greek Cyrillic code-page computer to a Russian code-page computer, the piece of text that arrived would be incomprehensible. This is because the character that is mapped to, say, 166 would be totally different on each computer. Although cross-language data transmission like this was always a problem, it truly entered center stage with the proliferation of the World Wide Web.

There were some attempts to figure this out automatically in early browsers. These were schemas that would attempt to detect the language of a website by counting characters, and finding the most common codepoints to match with a language (in English, it’d be the letters E and T). However, this was incredibly unreliable, and did little to fix the scramble of text.

This problem is essentially why ASCII (and ISO and ANSI, highly related text schemas) are practically defunct. UTF-8 is the standard text encoding format of the web since the early 2000s, which will be looked at in the next chaper: Unicode, plaintext, and emoji.