HTML Character Sets and Encoding
HTML Character Sets - Part 1
Forward: In this part of my series, I explain the basics of character sets and character encoding.
By: Chrysanthus Date Published: 31 Jul 2012
Note: If you cannot see the code or if you think anything is missing (broken link, image absent, etc.), just contact me at firstname.lastname@example.org. That is, contact me for the slightest problem you have about what you are reading.
A Character Set
A, B, C, D are four characters; that is a set of characters; that is a character set. a, b, c, d are four lowercase characters; that is a character set different from the previous one. A, B, 1, 2, Z are a set of five characters; that is a character set. A character set is just a set of characters.
Practical character sets are much larger than 4 or 5 characters. In the past, each (developed) country had a character set. If the keyboard you are using for your computer were manufactured to be used in your country, then the characters on the keyboard form the character set your country had in the past. Today, the actual character set used in your country has more characters than are displayed on the keyboard. The extra characters are new characters in the world, like the euro sign, or characters from other countries.
The character sets we shall look at in this series are the ASCII character set, the ISO character set and the Unicode character set.
Any character set is normally typed as a long list. Each character has an associated integer. The list is typed such that the integers are in ascending order. These integers are referred to as Code Positions for the character set.
An electronic document (HTML) or computer program or software is type in a character set. What is typed can be saved as a file in the hard disk or sent through a network (Internet). The work is not saved or sent through the Internet as typed. Generally the characters typed have to be coded before they can be saved or sent through the Internet. This is called, character encoding. Again there are different types of encoding schemes.
I will not go into the details of encoding in this series. Some names of encoding schemes are, ANSI, Unicode, UTF-8 and UTF-16. The ordinary programmer does not need to go into the details of encoding. You can know from the operating system documentation or the network engineers, which encoding they are using or want you to use, or is best for you to use. In the case where the character set and character encoding have the same name, you do not have a problem.
Note: The naming of character sets are in some cases not clear. In particular, a character set and its encoding may have the same name or slightly different names.
Note: ASCII and ANSI are two different things: ASCII is a character set, while ANSI is a coding scheme.
Some HTML elements have the attribute, charset. The value of this attribute is the character encoding for the element. The following tag shows how you can have an encoding for the whole page:
<META http-equiv="Content-Type" content="text/html; charset=EUC-JP">
Here the encoding scheme is, EUC-JP.
Now that you have an overview of character set and encoding, let us take a break and continue in the next part.
Related LinksMajor in Website Design
Web Development Course