In the early days of computing, all software and all interfaces were character-based. Character-based applications treat a display screen as an array of boxes (typically 25 rows by 80 columns) each of which can hold one character. In these systems, everything that appears on the screen, including all letters, numbers, spaces and graphics symbols, is considered a character.
In modern graphics-based applications, the term character is generally reserved for letters, numbers, and punctuation. Graphics-based programs treat the display screen as an array of millions of pixels, with characters and other objects formed by illuminating patterns of pixels.
Regardless of display format, in order for text to have meaning to a computer, it must be converted into binary machine code (0s and 1s). In most systems, this is done using a standard coding system. Like a shared language, standard character coding systems help different types of computers and communication equipment interchange data and communicate with each other.
Unfortunately, there is not a single standard character coding system. Older character-based DOS programs use the American Standard Code for Information Interchange (ASCII - pronounced as ask-key). Microsoft Windows programs use the American National Standards Institute (ANSI) character set. Web browsers use the International Standards Organization (ISO) Latin 1 character set (officially named ISO-8859-1). The most ambitious of all is Unicode (ISO-10646), a comprehensive character set designed to cover all the world's languages (living and dead), as well providing representation of all major scientific symbols. Unicode is used in Windows 98(SE), NT, 2000 and XP. Luckily, the similarities among these character sets is greater than their differences, at least for most simple text-based communication.
The ASCII system represents all standard number, English letter and symbol input from the keyboard as a seven bit code (for example, the capital letter A is represented in ASCII as binary 1000001/decimal 65). This provides 128 possible characters (decimal 0 to 127). The first ASCII 32 characters are control characters, used for printing and transmission control (such as Linefeed - #10 and Carriage Return - #14). The next 96 characters consist of your standard English alphabet (lower and upper case/caps), the common punctuation marks, and the blank space character. Since the normal data chunk used by computers is the 8-bits byte [allowing 256 possible characters] and since ASCII uses only 7 bits, the extra bit can be used either for error checking or to create special symbols (see extended ASCII, below).
The following table specifies the printable ASCII characters (decimal 32-127). Note that ASCII #32 is the blank space character (a printable character):
|
32 |
|
48 |
0 |
64 |
@ |
80 |
P |
96 |
` |
112 |
p |
|
33 |
! |
49 |
1 |
65 |
A |
81 |
Q |
97 |
a |
113 |
q |
|
34 |
" |
50 |
2 |
66 |
B |
82 |
R |
98 |
b |
114 |
r |
|
35 |
# |
51 |
3 |
67 |
C |
83 |
S |
99 |
c |
115 |
s |
|
36 |
$ |
52 |
4 |
68 |
D |
84 |
T |
100 |
d |
116 |
t |
|
37 |
% |
53 |
5 |
69 |
E |
85 |
U |
101 |
e |
117 |
u |
|
38 |
& |
54 |
6 |
70 |
F |
86 |
V |
102 |
f |
118 |
v |
|
39 |
' |
55 |
7 |
71 |
G |
87 |
W |
103 |
g |
119 |
w |
|
40 |
( |
56 |
8 |
72 |
H |
88 |
X |
104 |
h |
120 |
x |
|
41 |
) |
57 |
9 |
73 |
I |
89 |
Y |
105 |
i |
121 |
y |
|
42 |
* |
58 |
: |
74 |
J |
90 |
Z |
106 |
j |
122 |
z |
|
43 |
+ |
59 |
; |
75 |
K |
91 |
[ |
107 |
k |
123 |
{ |
|
44 |
, |
60 |
< |
76 |
L |
92 |
\ |
108 |
l |
124 |
| |
|
45 |
- |
61 |
= |
77 |
M |
93 |
] |
109 |
m |
125 |
} |
|
46 |
. |
62 |
> |
78 |
N |
94 |
^ |
110 |
n |
126 |
~ |
|
47 |
/ |
63 |
? |
79 |
O |
95 |
_ |
111 |
o |
127 |
_ |
Extended ASCII. Extended ASCII (also called high ASCII) is the second half of the ACSII character set, characters 128 thru 255. It is non-standard or proprietary ASCII, with the characters or symbols defined as needed by the software or hardware vendor. For example, IBM and Microsoft established a set of symbols in extended ASCII for the DOS operating system that can be used to draw simple graphics on text-based screen.
The ANSI character set is used by Microsoft Windows 95/98. The ANSI set also includes 256 characters, numbered 0 to 255. Values 0 to 127 are the same as those in the standard ASCII set (above). However, the high ANSI characters (128 to 255) are different and include many foreign letters, special punctuation, and business symbols.
In Windows, you can enter any ANSI character by holding down the Alt key and typing the ANSI decimal code on the numeric keypad. For example to type the copyright symbol (©) in Windows, you would type Alt+0169 (169 being the decimal numeric code for the copyright symbol). Alternatively, you can select ANSI characters for insertion (and view their code numbers by accessing the Windows Character Map program. To do so, click:
Start -> Programs -> Accessories -> System Tools -> Character Map
Officially named ISO-8859-1, ISO Latin-1 is a superset of the ASCII character set and is very similar (but not identical) to the ANSI character set used in Windows. Both the HTTP and HTML protocols used on the World Wide Web are based on ISO Latin-1. Thus, when a Web browser, such as Netscape, formats a Web page on a client system, such as Windows, it maps the ISO Latin-1 characters as best it can into the native character set.
To represent non-ASCII characters on a Web page, you need to use the corresponding ISO Latin-1 code. To do so, you use a standard HTML format of an ampersand, followed by the pound sign, then the ISO decimal code, followed by a semicolon. For example, if you wanted to show a plus or minus symbol (±) on a web page, you would insert ± where you wanted that symbol to appear. Likewise, "a" with grave accent is decimal 224 in the ISO Latin-1. Therefore in HTML "à" is coded as à
Below is a table of some of the more commonly used non-ASCII characters (left column) and the HTML code needed to display them on a Web page:
¢ |
¢ |
£ |
£ |
© |
© |
« |
« |
® |
® |
° |
° |
± |
± |
² |
² |
³ |
³ |
µ |
µ |
¹ |
¹ |
» |
» |
¼ |
¼ |
½ |
½ |
¾ |
¾ |
|
For complete list of all ISO Latin-1 special characters and their number codes visit the following site:
http://hotwired.lycos.com/webmonkey/reference/special_characters/
|
Unicode (ISO-10646) was developed in 1993 by Apple, Microsoft, HP, Digital and IBM to be the single international character set standard. Unicode is a 16-bit character set that codes over 34,000 distinct characters. Luckily, the first 256 Unicode values are identical to the ISO-Latin characters (above).
Unicode covers all the world's major living languages, in addition to scientific symbols and the so-called dead languages (of scholarly interest). One of its major advantages is the elimination of the complex multibyte character sets previously used to support Asian languages. Unicode also supports "combining" accent characters, which follow the base character that they are to modify.
If you are using Windows 98 (Second Edition), Windows NT, Windows 2000, Windows XP or higher, you are using Unicode. Even if using the English language version of Windows, you can still insert characters from other languages (including Greek, Cyrillic, Hebrew, Arabic) into your documents using Word. To do so, open a blank document, click on Insert from the main menu, then Symbol, choose a standard font (e.g., Arial) from the left-hand drop-down box, then choose a Subset from the right-hand drop-down box.
A font represents a design for a set of characters, i.e., a way to display the characters on an output device (screen or printer). A font combines a particular style of typeface with other qualities, such as size, pitch, and spacing. For example, Times New Roman is a typeface that defines the shape of each character. Within Times New Roman, however, there are many fonts to choose from -- different sizes, italic, bold, and so on. Below are three examples of common fonts families: Times New Roman, Arial, and Courier:

When all software and all equipment (screens, printers) were character-based, there was little choice in fonts. Essentially all letters, characters and words appeared in a plain monospaced or fixed-pitch (typewriter-like) Courier font shown above. Nowadays, graphics-oriented operating systems and applications programs give us a nearly unlimited choice of fonts. Most graphics-based operating systems come with a wide selection of basic fonts (more than most people will ever use). Additional fonts are often installed when setting up new printers. Specialized fonts can be downloaded or purchased as needed
Computers and their peripherals devices use two methods to represent fonts. In a bit-mapped font, every character is represented by an array of dots. To print a bit-mapped character, a printer simply prints the corresponding dots.
The other method used to represent fonts is vector graphics. With vector-based fonts, the shape or outline of each character is defined geometrically. With the font defined by formula, the typeface can be displayed in virtually any size. Because they can be resized (or scaled), vector fonts also are called scalable fonts or outline fonts. Unlike bit-mapped fonts, scalable fonts display and print best at high resolution. The most widely used scalable-font systems are TrueType (Windows) and PostScript (Adobe).
Despite the advantages of vector fonts, bit-mapped fonts are still widely used. One reason for this is that small vector fonts do not look very good on low-resolution devices, such as on some display monitors (which are low-resolution when compared with laser printers). Many computer systems, therefore, use bit-mapped fonts for screen displays. These are sometimes called screen fonts.
Traditionally, a document was defined as a file created by a text or word processor, consisting of formatted words (characters and spaces combined in lines, paragraphs, etc). Although modern documents produced by word processors can include many other elements (such as graphics and sounds), the basic definition still applies. Within this framework, there are three basic types of document formats: plain text, HTML and proprietary.
A plain text file is the most basic form of document, consisting solely of the basic 256 characters provided in the character set used to create it (see above). Although often called ASCII files (based on the original ASCII character set), a plain text file may actually be created in one of two basic formats: ASCII (also called MS-DOS text) or ANSI (called plain text, text only or just text). Since both formats use the same basic characters (codes 0 to 127), as long as no special characters are used (above 127), these two formats are equivalent.
There are word processor programs designed specifically to create plain text files. These program are typically called text editors, and are used extensively by programmers who must use plain text files to write computer instructions. If you are using windows, you have a small text editor program already installed on your system, called Notepad. You can use Notepad to create or edit plain text files smaller than 64K.
Alternatively, all high-end word processors (word, WordPerfect) allow creation of text files via the Save As feature. For example, any MS Word 2000 file can be save as a text file by following these simple steps:
Unfortunately, MS Word makes matters needlessly complex by giving you six different text format choices (Text only, Text only with line breaks, MS-DOS text only, MS-DOS text only with line breaks, Text with layout, MS-DOS text with layout). KISS it! (Keep It Simple Stupid!) always select Text only. This saves your document as an ANSI text file without formatting. All section breaks, page breaks, and new line characters are converted to paragraph marks.
HTML is short for HyperText Markup Language, the standard authoring language used to create documents on the World Wide Web. An HTML document is actually just a simple ISO Latin-1 text file. HTML defines the structure and layout of a Web document by using a variety of tags and attributes. Below is an example of the HTML needed to create a simple Web page that displays the text Hello World:
|
<HTML>
<HEAD> <TITLE>Basic Web Page</TITLE> </HEAD>
<BODY> <P>Hello World!</P> </BODY>
</HTML>
|
Essentially all HTML tags have two pieces a start tag and an end tag. Start tags are enclosed in carrots (e.g., <HTML>), while end tags add a slash to indicate that the tag is complete (e.g., </HTML>
All HTML documents start and end with the HTML start and end tags:
<HTML> </HTML>
Within the HTML start and end tags goes the <BODY> </BODY> of your document, i.e., all the information you'd like to include in your Web page (text, graphics, etc). In our example above, the HTML body consists solely of a single paragraph of text (Hello World), set off with the paragraph tag set <P> </P>:
<P>Hello World!</P>
Although not required, well-designed HTML documents also include a header, marked off using a <HEAD> </HEAD> tag set. The HTML header can include critical information about the document (its title, author, keywords, etc) that does NOT display in the browsers document window.
There are hundreds of other tags used to format and layout the information in a Web page. For instance, <I> </I> is used to italicize fonts. Tags also are used to specify the source for any graphics displayed in the page. As a true text file, an HTML document does not actually include any graphics directly. Instead, the file includes tags that specify where the graphic resides on the Web and where and how it should be displayed on the HTML page. Tags also are used to create hypertext links (using either words or images). These links are what allow users to jump to other Web pages with a simple mouse click.
|
For a summary cheatsheet of HTML tags, check out the following link:
http://hotwired.lycos.com/webmonkey/reference/html_cheatsheet/
|
XML is short for eXtensible Markup Language. XML is a metalanguage that lets Web builders define their own markup languages. Rather than specifying the tags themselves, XML specifies only the tag syntax.
XML is meant to augment HTML. While HTML describes how to display document's data, XML actually defines the data's content. For example, an HTML tag such as <I> specifies a certain font characteristic (italics). XML, on the other hand, describes the content that appears within the tags. For example, the following XML tag set defines a category of data called ANIMAL:
<ANIMAL> </ANIMAL>
Within that category, we could place a data element called Jaguar:
<ANIMAL>Jaguar</ANIMAL>
By coding the data element Jaguar into category ANIMAL, we give it special meaning. For instance, searching for Jaguar on the Web without XML would turn up thousands of documents about both the cat and the British sports car of the same name. However, using XML, you could easily narrow the search to pages that include information about the animal only.
As in our example, XML currently is being used mainly to help define data elements in database-like applications, thus offering a standard way to exchange data across the Internet. However, there is virtually no limit to how XML can be applied. Web builders can create and apply a set of tags specific to almost any task they wish to accomplish. As we shall see, this includes not only text, but also graphics. For example, XML is now being used as the basis for a standard vector graphics language for the Web (see Vector Graphics).
A proprietary document format is one developed and used by a company in a particular application program. The two most common proprietary word processing document formats are those used by Microsoft Word (the *.doc format) and by Corel WordPerfect. Proprietary formatted documents are created and stored as binary (machine coded) files that can only be read by the application program that created them. Thus a Word document cannot directly be read (or opened) in WordPerfect, or visa versa.
In order to provide for some compatibility between word processors, most software vendors include translation utilities that can convert the format of selected external files to the needed native proprietary format. MS Word, for example, recognizes and can (automatically) open WordPerfect 5.0+ files. Typically, these conversion utilities are two-way, such that a Word document, for example, also can be saved as a WordPerfect file. Although these built-in conversion utilities generally can retain most of the formatting from the original document, they are not perfect. The means that some formatting may be lost in translation and that some reformatting may have to be done manually.
Although proprietary in nature, the Portable Document Format (PDF) developed by Adobe Systems has become the de facto standard for formatted electronic document distribution worldwide. Adobe PDF is a universal file format that preserves all of the fonts, formatting, colors, and graphics of any source document, regardless of the application and platform used to create it.
You can create PDF document either from scratch, or, more commonly, by converting existing documents to PDF format using Adobe Acrobat software. In addition, Acrobat software lets you create bookmarks, cross-document links, Web links, live forms, and security options within a document, as well as adding sound and video. Once created, PDF files are compact and can be shared, viewed, navigated, and printed exactly as intended by anyone with a free Adobe Acrobat Reader.
PDF files overcome many of the problems commonly encountered in electronic file sharing, as outlined below:
|
Common Problem |
PDF Solution |
|
Recipients can't open files because they don't have the applications used to create the documents. |
Anyone, anywhere can open a PDF file. All you need is the free Acrobat Reader. |
|
Formatting, fonts, and graphics are lost due to platform, software, and version incompatibilities. |
PDF files always display exactly as created, regardless of fonts, software, and operating systems. |
|
Documents don't print correctly because of software or printer limitations. |
PDF files always print correctly on any printing device. |
© Craig L. Scanlan, 2001. Version 2.0 - January 2002. Original version January 2001.