Have you ever wondered how the internet displays text from every language, whether it’s a blog post in English, an Instagram post in Japanese, or a product description in Arabic — all on the same page? It’s something most of us (including me) take for granted, but at the heart of this seamless communication lies a silent hero: UTF-8 encoding.
Learn how to redesign your website with this free guide.
UTF-8 is so integral to the internet that it’s part of the foundation. It’s the “T” in HTTP, the backbone of HTML, and the unsung enabler of everything from URLs to viral social media posts to marketing copy. Without it, the web wouldn’t be the global, interconnected space we know today.
Before I begin, I recommend familiarizing yourself with the basics of HTML and ready to explore some light computer science concepts. Let’s unravel the mystery of UTF-8 together.
Table of Contents
What is UTF-8?
UTF-8 is an encoding system for Unicode. UTF-8 stands for “Unicode Transformation Format - 8 bits.” It can translate any Unicode character to a matching unique binary string, and can also translate the binary string back to a Unicode character.
To understand everything about UTF-8, I’ll walk you through the basics first.
How Computers Store Information
In order to store information, computers use a binary system. In binary, all data is represented in sequences of 1s and 0s. The most basic unit of binary is a bit, which is just a single 1 or 0. The next largest unit of binary, a byte, consists of 8 bits. An example of a byte is “01101011.”
Every digital asset you’ve ever encountered — from software to mobile apps to websites to Instagram stories — is built on this system of bytes, which are strung together in a way that makes sense to computers.
When we refer to file sizes, we’re referencing the number of bytes. For example, a kilobyte is roughly one thousand bytes, and a gigabyte is roughly one billion bytes.
Text is one of many assets that computers store and process. Text is made up of individual characters, each of which is represented in computers by a string of bits. These strings are assembled to form digital words, sentences, paragraphs, romance novels, and so on.
.png)
The Ultimate Workbook for Redesigning Your Website
Guidance + templates to simplify your next website redesign project.
- A four-part redesign planning guide
- A redesign budget template
- A website redesign audit template
- And more!
Download Free
All fields are required.
.png)
ASCII: Converting Symbols to Binary
The American Standard Code for Information Interchange (ASCII) was an early standardized encoding system for text. Encoding is the process of converting characters in human languages into binary sequences that computers can process.
ASCII’s library includes every upper-case and lower-case letter in the Latin alphabet (A, B, C…), every digit from 0 to 9, and some common symbols (like /, !, and ?). It assigns each of these characters a unique three-digit code and a unique byte.
ASCII Character Table
The table below shows examples of ASCII characters with their associated codes and bytes.
CHARACTER |
ASCII CODE |
BYTE |
A |
065 |
01000001 |
a |
097 |
01100001 |
B |
066 |
01000010 |
b |
098 |
01100010 |
Z |
090 |
01011010 |
z |
122 |
01111010 |
0 |
048 |
00110000 |
9 |
057 |
00111001 |
! |
033 |
00100001 |
? |
063 |
00111111 |
Just as characters come together to form words and sentences in language, binary code does so in text files. So, the sentence “The quick brown fox jumps over the lazy dog” represented in ASCII binary would be:
01010100 01101000 01100101 00100000 01110001 01110101 01101001 01100011 01101011 00100000 01100010 01110010 01101111 01110111 01101110 00100000 01100110 01101111 01111000 00100000 01101010 01110101 01101101 01110000 01110011 00100000 01101111 01110110 01100101 01110010 00100000 01110100 01101000 01100101 00100000 01101100 01100001 01111010 01111001 00100000 01100100 01101111 01100111 00101110
That doesn’t mean much to us humans, but it’s a computer’s bread and butter.
How many ways can a character be represented in ASCII?
The number of characters that ASCII can represent is limited to the number of unique bytes available, since each character gets one byte.
Let’s do the math: there are 256 different ways of grouping eight 1s and 0s together. This gives us 256 different bytes, or 256 ways to represent a character in ASCII.
When ASCII was introduced in 1960, this was okay, since developers needed only 128 bytes to represent all the English characters and symbols they needed.
But, as computing expanded globally, computer systems began to store text in languages besides English, many of which used non-ASCII characters.
New systems were created to map other languages to the same set of 256 unique bytes, but having multiple encoding systems was inefficient and confusing. Developers needed a better way to encode all possible characters with one system.
Unicode: A Way to Store Every Symbol, Ever
Enter, Unicode! Unicode is an encoding system that solves the space issue of ASCII. Like ASCII, Unicode assigns a unique code, called a code point, to each character.
However, Unicode’s more sophisticated system can produce over a million code points, more than enough to account for every character in any language.
Unicode is now the universal standard for encoding all human languages. And yes, it even includes emojis.
Unicode Character Table
Now, I’ll give you some examples of text characters and their matching code points. Each code point begins with “U” for “Unicode,” followed by a unique string of characters to represent the character.
CHARACTER |
CODE POINT |
A |
U+0041 |
a |
U+0061 |
0 |
U+0030 |
9 |
U+0039 |
! |
U+0021 |
Ø |
U+00D8 |
ڃ |
U+0683 |
ಚ |
U+0C9A |
𠜎 |
U+2070E |
😁 |
U+1F601 |
If you want to learn how code points are generated and what they mean in Unicode, check out this in-depth explanation.
So, now with Unicode I have a standardized way of representing every character used by every human language in a single library. This solves the issue of multiple labeling systems for different languages — any computer on Earth can use Unicode.
But Unicode alone doesn’t store words in binary. Computers need a way to translate Unicode into binary so that its characters can be stored in text files.
Here’s where UTF-8 comes in.
UTF-8: The Character Set in Web Development
UTF-8 is the most common character encoding method used on the internet today, and is the default character set for HTML5. Over 98% of all websites — likely including your own — store characters this way.
Additionally, common data transfer methods over the web, like XML and JSON, are encoded with UTF-8 standards.
Since it’s now the standard method for encoding text on the web, all your site pages and databases should use UTF-8.
Pro tip: A content management system or website builder will save your files in UTF-8 format by default, but it’s still worth verifying that you’re following this best practice — especially if you’re in the process of redesigning your website. Redesign projects offer a great opportunity to audit your site’s encoding settings and ensure they align with modern web standards.
.png)
The Ultimate Workbook for Redesigning Your Website
Guidance + templates to simplify your next website redesign project.
- A four-part redesign planning guide
- A redesign budget template
- A website redesign audit template
- And more!
Download Free
All fields are required.
.png)
How do you indicate UTF-8 in HTML?
Text files encoded with UTF-8 must indicate this to the software processing them. Otherwise, the software won’t properly translate the binary back into characters. In HTML files, you might see a string of code like the following near the top:
<meta charset=“UTF-8”>
This tells the browser that the HTML file is encoded by UTF-8, so that the browser can translate it back to legible text.
UTF-8 Character Table
Below is the same character table from above, with the UTF-8 character set output added for each. Notice how some characters are represented as just one byte, while others use more.
CHARACTER |
CODE POINT |
UTF-8 BINARY ENCODING |
A |
U+0041 |
01000001 |
a |
U+0061 |
01100001 |
0 |
U+0030 |
00110000 |
9 |
U+0039 |
00111001 |
! |
U+0021 |
00100001 |
Ø |
U+00D8 |
11000011 10011000 |
ڃ |
U+0683 |
11011010 10000011 |
ಚ |
U+0C9A |
11100000 10110010 10011010 |
𠜎 |
U+2070E |
11110000 10100000 10011100 10001110 |
😁 |
U+1F601 |
11110000 10011111 10011000 10000001 |
Understanding UTF-8 Character Conversion to Bytes
I have demonstrated in the table above how some characters take one byte, whereas others take more. But why would UTF-8 convert some characters to one byte, and others up to four bytes?
To save memory.
By using less space to represent more common characters (i.e., ASCII characters), UTF-8 reduces file size while allowing for a much larger number of less common characters. These less common characters are encoded into two or more bytes, but this is okay if they’re stored sparingly.
Spatial efficiency is a key advantage of UTF-8 encoding. If, instead, every Unicode character was represented by four bytes, a text file written in English would be four times the size of the same file encoded with UTF-8.
Are there other encoding systems besides UTF-8?
There are other encoding systems for Unicode besides UTF-8, but UTF-8 is unique because it represents characters in one-byte units. Remember that one byte consists of eight bits, hence the “-8” in its name.
More specifically, UTF-8 converts a code point (which represents a single character in Unicode) into a set of one to four bytes. The first 128 characters in the Unicode library — the characters I talked about while explaining ASCII above — are represented as one byte. Characters that appear later in the Unicode library are encoded as two-byte, three-byte, and eventually four-byte binary units.
Difference Between UTF-8 and UTF-16
As I mentioned, UTF-8 is not the only encoding method for Unicode characters — there’s also UTF-16. These methods differ in the number of bytes they need to store a character:
- UTF-8 encodes a character into a binary string of one, two, three, or four bytes.
- UTF-16 encodes a Unicode character into a string of either two or four bytes.
- In UTF-8, the smallest binary representation of a character is one byte, or eight bits.
- In UTF-16, the smallest binary representation of a character is two bytes, or sixteen bits.
Both UTF-8 and UTF-16 can translate Unicode characters into computer-friendly binary and back again. However, they are not compatible with each other.
UTF-8 vs. UTF-16 Character Table
Both UTF-8 and UTF-16 systems use different algorithms to map code points to binary strings. As shown in the character table below, the binary output for any given character will look different for both UTF-8 and UTF-16:
Character |
UTF-8 binary encoding |
UTF-16 binary encoding |
A |
01000001 |
01000001 11011000 00001110 11011111 |
𠜎 |
11110000 10100000 10011100 10001110 |
01000001 11011000 00001110 11011111 |
When should I use UTF-8?
UTF-8 encoding is preferable to UTF-16 on the majority of websites because it uses less memory.
Recall that UTF-8 encodes each ASCII character in just one byte. UTF-16 must encode these same characters in either two or four bytes. This means that an English text file encoded with UTF-16 would be at least double the size of the same file encoded with UTF-8.
Another benefit of using UTF-8 character sets is its backward compatibility with ASCII. The first 128 characters in the Unicode library match those in the ASCII library, and UTF-8 translates these 128 Unicode characters into the same binary strings as ASCII. As a result, UTF-8 can take a text file formatted by ASCII and convert it to human-readable text without issue.
When should I use UTF-16?
UTF-16 is only more efficient than UTF-8 on some non-English websites.
If a website uses a language with characters farther back in the Unicode library, UTF-8 will encode all characters as four bytes, whereas UTF-16 might encode many of the same characters as only two bytes.
Pro tip: If your pages are filled with ABCs and 123s, I’d recommend sticking with UTF-8.
Here's My Summary of Why and How UTF-8 Encoding is Important
Diving into UTF-8 made me realize how essential it is to the seamless digital experiences we enjoy every day.
Here's a summary of everything I went over:
- Computers store data, including text characters, as binary (1s and 0s).
- ASCII was an early way of encoding, or mapping characters to binary code so that computers could store them. However, ASCII did not provide enough room for non-Latin characters and numbers to be represented in binary.
- Unicode is a solution to this problem. Unicode assigns a unique “code point” to every character in every human language.
- UTF-8 is a Unicode character encoding method.
- UTF-8 takes the code point for a given Unicode character and translates it into a string of binary. It also does the reverse, reading in binary digits and converting them back to characters.
- UTF-8 has become the most widely used encoding method on the internet due to its ability to store text from any character set efficiently.
- UTF-16 is another encoding method, but is less efficient for storing text files (except for those written in certain non-English languages).
Put Your New Knowledge to Work
Working on this article about UTF-8 character set has been a fascinating journey. Like most people, I’ve always taken for granted that text on the internet just “works,” no matter the language, script, or platform.
Unicode translation isn’t something you need to think about when browsing or designing websites, and that’s exactly the point — to create a seamless text processing system that functions for all languages and web browsers. If it’s working well, you won’t notice it.
If you find your website’s pages are using up an inordinate amount of space, or if your text is littered with ▢s and �s, I recommend putting your new knowledge of UTF-8 to work. Let’s continue building a better, more accessible internet — one UTF-8 character at a time.
Editor's note: This post was originally published in August 2020 and has been updated for comprehensiveness.
.png)
The Ultimate Workbook for Redesigning Your Website
Guidance + templates to simplify your next website redesign project.
- A four-part redesign planning guide
- A redesign budget template
- A website redesign audit template
- And more!
Download Free
All fields are required.
.png)