Return Home to Overview/Index

Characters, Words and Text

Craig L. Scanlan, EdD, RRT, FAARC
Professor, Department of Interdisciplinary Studies
UMDNJ-School of Health Related Professions

 

In the early days of computing, all software and all interfaces were character-based.  Character-based applications treat a display screen as an array of boxes (typically 25 rows by 80 columns) each of which can hold one character. In these systems, everything that appears on the screen, including all letters, numbers, spaces and graphics symbols, is considered a character.

 

In modern graphics-based applications, the term ‘character’ is generally reserved for letters, numbers, and punctuation. Graphics-based programs treat the display screen as an array of millions of pixels, with characters and other objects formed by illuminating patterns of pixels.

Character Coding Systems and Sets

Regardless of display format, in order for text to have ‘meaning’ to a computer, it must be converted into binary machine code (0s and 1s). In most systems, this is done using a standard coding system. Like a shared language, standard character coding systems help different types of computers and communication equipment interchange data and communicate with each other.

 

Unfortunately, there is not a single standard character coding system. Older character-based DOS programs use the American Standard Code for Information Interchange (ASCII  - pronounced as “ask-key”). Microsoft Windows programs use the American National Standards Institute (ANSI) character set. Web browsers use the International Standards Organization (ISO) Latin 1 character set (officially named ISO-8859-1). The most ambitious of all is Unicode (ISO-10646), a comprehensive character set designed to cover all the world's languages (living and dead), as well providing representation of all major scientific symbols. Unicode is used in Windows 98(SE), NT, 2000 and XP. Luckily, the similarities among these character sets is greater than their differences, at least for most simple text-based communication.

ASCII

The ASCII system represents all standard number, English letter and symbol input from the keyboard as a seven bit code (for example, the capital letter “A” is represented in ASCII as binary 1000001/decimal 65). This provides 128 possible characters (decimal 0 to 127). The first ASCII 32 characters are control characters, used for printing and transmission control (such as Linefeed - #10 and Carriage Return - #14). The next 96 characters consist of your standard English alphabet (lower and upper case/caps), the common punctuation marks, and the blank space character. Since the normal data ‘chunk’ used by computers is the 8-bits byte [allowing 256 possible characters] and since ASCII uses only 7 bits, the extra bit can be used either for error checking or to create special symbols (see extended ASCII, below).

 

The following table specifies the printable ASCII characters (decimal 32-127). Note that ASCII #32 is the blank space character (a printable character):

 

32

 

48

0

64

@

80

P

96

`

112

p

33

!

49

1

65

A

81

Q

97

a

113

q

34

"

50

2

66

B

82

R

98

b

114

r

35

#

51

3

67

C

83

S

99

c

115

s

36

$

52

4

68

D

84

T

100

d

116

t

37

%

53

5

69

E

85

U

101

e

117

u

38

&

54

6

70

F

86

V

102

f

118

v

39

'

55

7

71

G

87

W

103

g

119

w

40

(

56

8

72

H

88

X

104

h

120

x

41

)

57

9

73

I

89

Y

105

i

121

y

42

*

58

:

74

J

90

Z

106

j

122

z

43

+

59

;

75

K

91

[

107

k

123

{

44

,

60

<

76

L

92

\

108

l

124

|

45

-

61

=

77

M

93

]

109

m

125

}

46

.

62

>

78

N

94

^

110

n

126

~

47

/

63

?

79

O

95

_

111

o

127

_

 

Extended ASCII. Extended ASCII (also called ‘high’ ASCII) is the second half of the ACSII character set, characters 128 thru 255. It is non-standard or proprietary ASCII, with the characters or symbols defined as needed by the software or hardware vendor. For example, IBM and Microsoft established a set of symbols in extended ASCII for the DOS operating system that can be used to ‘draw’ simple graphics on text-based screen.

ANSI

The ANSI character set is used by Microsoft Windows 95/98. The ANSI set also includes 256 characters, numbered 0 to 255. Values 0 to 127 are the same as those in the standard ASCII set (above). However, the ‘high’ ANSI characters (128 to 255) are different and include many foreign letters, special punctuation, and business symbols.

 

In Windows, you can enter any ANSI character by holding down the Alt key and typing the ANSI decimal code on the numeric keypad. For example to type the copyright symbol (©) in Windows, you would type Alt+0169 (169 being the decimal numeric code for the copyright symbol). Alternatively, you can select ANSI characters for insertion (and view their code numbers by accessing the Windows Character Map program. To do so, click:

 

Start -> Programs -> Accessories -> System Tools -> Character Map

ISO Latin 1

Officially named ISO-8859-1, ISO Latin-1 is a superset of the ASCII character set and is very similar (but not identical) to the ANSI character set used in Windows. Both the HTTP and HTML protocols used on the World Wide Web are based on ISO Latin-1. Thus, when a Web browser, such as Netscape, formats a Web page on a client system, such as Windows, it maps the ISO Latin-1 characters as best it can into the native character set.

 

To represent non-ASCII characters on a Web page, you need to use the corresponding ISO Latin-1 code. To do so, you use a standard HTML format of an ampersand, followed by the pound sign, then the ISO decimal code, followed by a semicolon. For example, if you wanted to show a plus or minus symbol (±) on a web page, you would insert &#177; where you wanted that symbol to appear. Likewise, "a" with grave accent is decimal 224 in the ISO Latin-1. Therefore in HTML "à" is coded as &#224;

 

Below is a table of some of the more commonly used non-ASCII characters (left column) and the HTML code needed to display them on a Web page:

 

¢
&#162;
£
&#163;
©
&#169;
«
&#171;
®
&#174;
°
&#176;
±
&#177;
²
&#178;
³
&#179;
µ
&#181;
¹
&#185;
»
&#187;
¼
&#188;
½
&#189;
¾
&#190;

 

 

For complete list of all ISO Latin-1 special characters and their number codes visit the following site:

 

http://hotwired.lycos.com/webmonkey/reference/special_characters/

 

Unicode

Unicode (ISO-10646) was developed in 1993 by Apple, Microsoft, HP, Digital and IBM to be the single international character set standard. Unicode is a 16-bit character set that codes over 34,000 distinct characters. Luckily, the first 256 Unicode values are identical to the ISO-Latin characters (above).

 

Unicode covers all the world's major living languages, in addition to scientific symbols and the so-called ‘dead’ languages (of scholarly interest). One of its major advantages is the elimination of the complex multibyte character sets previously used to support Asian languages. Unicode also supports "combining" accent characters, which follow the base character that they are to modify.

 

If you are using Windows 98 (Second Edition), Windows NT, Windows 2000, Windows XP or higher, you are using Unicode. Even if using the English language version of Windows, you can still insert characters from other languages (including Greek, Cyrillic, Hebrew, Arabic) into your documents using Word. To do so, open a blank document, click on Insert from the main menu, then Symbol, choose a standard font (e.g., Arial) from the left-hand drop-down box, then choose a Subset from the right-hand drop-down box.

Fonts (Character Display Designs)

A font represents a design for a set of characters, i.e., a way to display the characters on an output device (screen or printer). A font combines a particular style of typeface with other qualities, such as size, pitch, and spacing. For example, Times New Roman is a typeface that defines the shape of each character. Within Times New Roman, however, there are many fonts to choose from -- different sizes, italic, bold, and so on. Below are three examples of common fonts families: Times New Roman, Arial, and Courier:

 

 

When all software and all equipment (screens, printers) were character-based, there was little choice in fonts. Essentially all letters, characters and words appeared in a plain monospaced or fixed-pitch (typewriter-like) Courier font shown above. Nowadays, graphics-oriented operating systems and applications programs give us a nearly unlimited choice of fonts. Most graphics-based operating systems come with a wide selection of basic fonts (more than most people will ever use). Additional fonts are often installed when setting up new printers. Specialized fonts can be downloaded or purchased as needed

 

Computers and their peripherals devices use two methods to represent fonts. In a bit-mapped font, every character is represented by an array of dots. To print a bit-mapped character, a printer simply prints the corresponding dots.

 

The other method used to represent fonts is vector graphics. With vector-based fonts, the shape or outline of each character is defined geometrically. With the font defined by formula, the typeface can be displayed in virtually any size. Because they can be resized (or scaled), vector fonts also are called scalable fonts or outline fonts. Unlike bit-mapped fonts, scalable fonts display and print best at high resolution. The most widely used scalable-font systems are TrueType (Windows) and PostScript (Adobe).

 

Despite the advantages of vector fonts, bit-mapped fonts are still widely used. One reason for this is that small vector fonts do not look very good on low-resolution devices, such as on some display monitors (which are low-resolution when compared with laser printers). Many computer systems, therefore, use bit-mapped fonts for screen displays. These are sometimes called screen fonts.

Document Formats

Traditionally, a document was defined as a file created by a text or word processor, consisting of formatted words (characters and spaces combined in lines, paragraphs, etc). Although modern documents produced by word processors can include many other elements (such as graphics and sounds), the basic definition still applies. Within this framework, there are three basic types of document formats: plain text, HTML and proprietary.

Plain Text

A plain text file is the most basic form of document, consisting solely of the basic 256 characters provided in the character set used to create it (see above). Although often called ASCII files (based on the original ASCII character set), a plain text file may actually be created in one of two basic formats: ASCII (also called ‘MS-DOS text’) or ANSI (called ‘plain text,’ ‘text only’ or just ‘text’). Since both formats use the same basic characters (codes 0 to 127), as long as no special characters are used (above 127), these two formats are equivalent.

 

There are word processor programs designed specifically to create plain text files. These program are typically called text editors, and are used extensively by programmers who must use plain text files to write computer instructions. If you are using windows, you have a small text editor program already installed on your system, called Notepad. You can use Notepad to create or edit plain text files smaller than 64K.

 

Alternatively, all high-end word processors (word, WordPerfect) allow creation of text files via the Save As… feature. For example, any MS Word 2000 file can be save as a text file by following these simple steps:

 

  1. With a document in the document window, click File (main menu), then Save As
  2. In the Save in box, select the folder/directory where you want to save you document
  3. In the Save as type drop-down box select ‘Text only (*.txt)’
  4. In the File name box, edit the file name as desired, but be sure to keep the *.txt extension
  5. Click the Save button to save you file as a ANSI text file

 

Unfortunately, MS Word makes matters needlessly complex by giving you six different text format choices (Text only, Text only with line breaks, MS-DOS text only, MS-DOS text only with line breaks, Text with layout, MS-DOS text with layout). KISS it! (Keep It Simple Stupid!) – always select Text only. This saves your document as an ANSI text file without formatting. All section breaks, page breaks, and new line characters are converted to paragraph marks.

HTML

HTML is short for HyperText Markup Language, the standard authoring language used to create documents on the World Wide Web. An HTML document is actually just a simple ISO Latin-1 text file. HTML defines the structure and layout of a Web document by using a variety of tags and attributes. Below is an example of the HTML needed to create a simple Web page that displays the text ‘Hello World’:

 

 

<HTML>

 

  <HEAD>

    <TITLE>Basic Web Page</TITLE>

    </HEAD>

 

  <BODY>

         <P>Hello World!</P>

  </BODY>

 

</HTML>

 

 

Essentially all HTML tags have two pieces – a start tag and an end tag. Start tags are enclosed in carrots (e.g., <HTML>), while end tags add a slash to indicate that the tag is complete (e.g., </HTML>

 

All HTML documents start and end with the HTML start and end tags:

 

<HTML>… </HTML>

 

Within the HTML start and end tags goes the <BODY>…</BODY> of your document, i.e., all the information you'd like to include in your Web page (text, graphics, etc). In our example above, the HTML body consists solely of a single paragraph of text (‘Hello World’), set off with the paragraph tag set <P>…</P>:

 

<P>Hello World!</P>

 

Although not required, well-designed HTML documents also include a header, marked off using a <HEAD>…</HEAD> tag set. The HTML header can include critical information about the document (its title, author, keywords, etc) that does NOT display in the browser’s document window.

 

There are hundreds of other tags used to format and layout the information in a Web page. For instance, <I>…</I> is used to italicize fonts. Tags also are used to specify the source for any graphics displayed in the page. As a true text file, an HTML document does not actually include any graphics directly. Instead, the file includes tags that specify where the graphic resides on the Web and where and how it should be displayed on the HTML page. Tags also are used to create hypertext links (using either words or images). These links are what allow users to jump to other Web pages with a simple mouse click.

 

 

For a summary ‘cheatsheet’ of HTML tags, check out the following link:

 

http://hotwired.lycos.com/webmonkey/reference/html_cheatsheet/

 

XML

XML is short for eXtensible Markup Language. XML is a ‘metalanguage’ that lets Web builders define their own markup languages. Rather than specifying the tags themselves, XML specifies only the tag syntax.

 

XML is meant to augment HTML. While HTML describes how to display document's data, XML actually defines the data's content. For example, an HTML tag such as <I> specifies a certain font characteristic (italics). XML, on the other hand, describes the content that appears within the tags. For example, the following XML tag set defines a category of data called ‘ANIMAL:’

 

<ANIMAL>…</ANIMAL>

 

Within that category, we could place a data element called ‘Jaguar’:

 

<ANIMAL>Jaguar</ANIMAL>

 

By coding the data element ‘Jaguar’ into category ‘ANIMAL,’ we give it special meaning. For instance, searching for ‘Jaguar’ on the Web without XML would turn up thousands of documents about both the cat and the British sports car of the same name. However, using XML, you could easily narrow the search to pages that include information about the animal only.

 

As in our example, XML currently is being used mainly to help define data elements in database-like applications, thus offering a standard way to exchange data across the Internet. However, there is virtually no limit to how XML can be applied. Web builders can create and apply a set of tags specific to almost any task they wish to accomplish. As we shall see, this includes not only text, but also graphics. For example, XML is now being used as the basis for a standard vector graphics language for the Web (see Vector Graphics).

Proprietary Documents Formats

A proprietary document format is one developed and used by a company in a particular application program. The two most common proprietary word processing document formats are those used by Microsoft Word (the *.doc format) and by Corel WordPerfect. Proprietary formatted documents are created and stored as binary (machine coded) files that can only be ‘read’ by the application program that created them. Thus a Word document cannot directly be read (or opened) in WordPerfect, or visa versa.

 

In order to provide for some compatibility between word processors, most software vendors include ‘translation’ utilities that can convert the format of selected ‘external’ files to the needed native proprietary format. MS Word, for example, recognizes and can (automatically) open WordPerfect 5.0+ files. Typically, these conversion utilities are two-way, such that a Word document, for example, also can be saved as a WordPerfect file. Although these built-in conversion utilities generally can retain most of the formatting from the original document, they are not perfect. The means that some formatting may be ‘lost in translation’ and that some reformatting may have to be done manually.

Portable Document Format (PDF)

Although proprietary in nature, the Portable Document Format (PDF) developed by Adobe Systems has become the de facto standard for formatted electronic document distribution worldwide. Adobe PDF is a universal file format that preserves all of the fonts, formatting, colors, and graphics of any source document, regardless of the application and platform used to create it.

 

You can create PDF document either from scratch, or, more commonly, by converting existing documents to PDF format using Adobe Acrobat software. In addition, Acrobat software lets you create bookmarks, cross-document links, Web links, live forms, and security options within a document, as well as adding sound and video. Once created, PDF files are compact and can be shared, viewed, navigated, and printed exactly as intended by anyone with a free Adobe Acrobat Reader.

 

PDF files overcome many of the problems commonly encountered in electronic file sharing, as outlined below:

 

Common Problem

PDF Solution

Recipients can't open files because they don't have the applications used to create the documents.

Anyone, anywhere can open a PDF file. All you need is the free Acrobat Reader.

Formatting, fonts, and graphics are lost due to platform, software, and version incompatibilities.

PDF files always display exactly as created, regardless of fonts, software, and operating systems.

Documents don't print correctly because of software or printer limitations.

PDF files always print correctly on any printing device.

© Craig L. Scanlan, 2001. Version 2.0 - January 2002. Original version January 2001.

Return Home to Overview/Index