What is UTF-8 Encoding? A Guide for Non-Programmers

Written by: Jamie Juviler
Woman with long dark hair wearing a bright yellow sweater, sitting at a desk with a laptop, holding a white mug, with a marketing workbook titled

FREE WEBSITE REDESIGN WORKBOOK

Learn how to redesign your website with this free guide.

Download the Free Workbook
person on laptop using utf-8 characters

Updated:

Published:

Have you ever wondered how the internet displays text from every language, whether it’s a blog post in English, an Instagram post in Japanese, or a product description in Arabic — all on the same page? It’s something most of us (including me) take for granted, but at the heart of this seamless communication lies a silent hero: UTF-8 encoding.

Learn how to redesign your website with this free guide.

UTF-8 is so integral to the internet that it’s part of the foundation. It’s the “T” in HTTP, the backbone of HTML, and the unsung enabler of everything from URLs to viral social media posts to marketing copy. Without it, the web wouldn’t be the global, interconnected space we know today.

Before I begin, I recommend familiarizing yourself with the basics of HTML and ready to explore some light computer science concepts. Let’s unravel the mystery of UTF-8 together.

Table of Contents

To understand everything about UTF-8, I’ll walk you through the basics first.

How Computers Store Information

In order to store information, computers use a binary system. In binary, all data is represented in sequences of 1s and 0s. The most basic unit of binary is a bit, which is just a single 1 or 0. The next largest unit of binary, a byte, consists of 8 bits. An example of a byte is “01101011.”

Every digital asset you’ve ever encountered — from software to mobile apps to websites to Instagram stories — is built on this system of bytes, which are strung together in a way that makes sense to computers.

When we refer to file sizes, we’re referencing the number of bytes. For example, a kilobyte is roughly one thousand bytes, and a gigabyte is roughly one billion bytes.

Text is one of many assets that computers store and process. Text is made up of individual characters, each of which is represented in computers by a string of bits. These strings are assembled to form digital words, sentences, paragraphs, romance novels, and so on.

The Ultimate Workbook for Redesigning Your Website

Guidance + templates to simplify your next website redesign project.

  • A four-part redesign planning guide
  • A redesign budget template
  • A website redesign audit template
  • And more!

    Download Free

    All fields are required.

    You're all set!

    Click this link to access this resource at any time.

    ASCII: Converting Symbols to Binary

    The American Standard Code for Information Interchange (ASCII) was an early standardized encoding system for text. Encoding is the process of converting characters in human languages into binary sequences that computers can process.

    ASCII’s library includes every upper-case and lower-case letter in the Latin alphabet (A, B, C…), every digit from 0 to 9, and some common symbols (like /, !, and ?). It assigns each of these characters a unique three-digit code and a unique byte.

    ASCII Character Table

    The table below shows examples of ASCII characters with their associated codes and bytes.

    CHARACTER

    ASCII CODE

    BYTE

    A

    065

    01000001

    a

    097

    01100001

    B

    066

    01000010

    b

    098

    01100010

    Z

    090

    01011010

    z

    122

    01111010

    0

    048

    00110000

    9

    057

    00111001

    !

    033

    00100001

    ?

    063

    00111111

    Just as characters come together to form words and sentences in language, binary code does so in text files. So, the sentence “The quick brown fox jumps over the lazy dog” represented in ASCII binary would be:

    01010100 01101000 01100101 00100000 01110001 01110101 01101001 01100011 01101011 00100000 01100010 01110010 01101111 01110111 01101110 00100000 01100110 01101111 01111000 00100000 01101010 01110101 01101101 01110000 01110011 00100000 01101111 01110110 01100101 01110010 00100000 01110100 01101000 01100101 00100000 01101100 01100001 01111010 01111001 00100000 01100100 01101111 01100111 00101110

    That doesn’t mean much to us humans, but it’s a computer’s bread and butter.

    How many ways can a character be represented in ASCII?

    The number of characters that ASCII can represent is limited to the number of unique bytes available, since each character gets one byte.

    Let’s do the math: there are 256 different ways of grouping eight 1s and 0s together. This gives us 256 different bytes, or 256 ways to represent a character in ASCII.

    When ASCII was introduced in 1960, this was okay, since developers needed only 128 bytes to represent all the English characters and symbols they needed.

    But, as computing expanded globally, computer systems began to store text in languages besides English, many of which used non-ASCII characters.

    New systems were created to map other languages to the same set of 256 unique bytes, but having multiple encoding systems was inefficient and confusing. Developers needed a better way to encode all possible characters with one system.

    Unicode: A Way to Store Every Symbol, Ever

    Enter, Unicode! Unicode is an encoding system that solves the space issue of ASCII. Like ASCII, Unicode assigns a unique code, called a code point, to each character.

    However, Unicode’s more sophisticated system can produce over a million code points, more than enough to account for every character in any language.

    Unicode is now the universal standard for encoding all human languages. And yes, it even includes emojis.

    Unicode Character Table

    Now, I’ll give you some examples of text characters and their matching code points. Each code point begins with “U” for “Unicode,” followed by a unique string of characters to represent the character.

    CHARACTER

    CODE POINT

    A

    U+0041

    a

    U+0061

    0

    U+0030

    9

    U+0039

    !

    U+0021

    Ø

    U+00D8

    ڃ

    U+0683

    U+0C9A

    𠜎

    U+2070E

    😁

    U+1F601

    If you want to learn how code points are generated and what they mean in Unicode, check out this in-depth explanation.

    So, now with Unicode I have a standardized way of representing every character used by every human language in a single library. This solves the issue of multiple labeling systems for different languages — any computer on Earth can use Unicode.

    But Unicode alone doesn’t store words in binary. Computers need a way to translate Unicode into binary so that its characters can be stored in text files.

    Here’s where UTF-8 comes in.

    UTF-8: The Character Set in Web Development

    UTF-8 is the most common character encoding method used on the internet today, and is the default character set for HTML5. Over 98% of all websites — likely including your own — store characters this way.

    Additionally, common data transfer methods over the web, like XML and JSON, are encoded with UTF-8 standards.

    Since it’s now the standard method for encoding text on the web, all your site pages and databases should use UTF-8.

    Pro tip: A content management system or website builder will save your files in UTF-8 format by default, but it’s still worth verifying that you’re following this best practice — especially if you’re in the process of redesigning your website. Redesign projects offer a great opportunity to audit your site’s encoding settings and ensure they align with modern web standards.

    The Ultimate Workbook for Redesigning Your Website

    Guidance + templates to simplify your next website redesign project.

    • A four-part redesign planning guide
    • A redesign budget template
    • A website redesign audit template
    • And more!

      Download Free

      All fields are required.

      You're all set!

      Click this link to access this resource at any time.

      How do you indicate UTF-8 in HTML?

      Text files encoded with UTF-8 must indicate this to the software processing them. Otherwise, the software won’t properly translate the binary back into characters. In HTML files, you might see a string of code like the following near the top:

      <meta charset=“UTF-8”>

      This tells the browser that the HTML file is encoded by UTF-8, so that the browser can translate it back to legible text.

      UTF-8 Character Table

      Below is the same character table from above, with the UTF-8 character set output added for each. Notice how some characters are represented as just one byte, while others use more.

      CHARACTER

      CODE POINT

      UTF-8 BINARY ENCODING

      A

      U+0041

      01000001

      a

      U+0061

      01100001

      0

      U+0030

      00110000

      9

      U+0039

      00111001

      !

      U+0021

      00100001

      Ø

      U+00D8

      11000011 10011000

      ڃ

      U+0683

      11011010 10000011

      U+0C9A

      11100000 10110010 10011010

      𠜎

      U+2070E

      11110000 10100000 10011100 10001110

      😁

      U+1F601

      11110000 10011111 10011000 10000001

      Understanding UTF-8 Character Conversion to Bytes

      I have demonstrated in the table above how some characters take one byte, whereas others take more. But why would UTF-8 convert some characters to one byte, and others up to four bytes?

      To save memory.

      By using less space to represent more common characters (i.e., ASCII characters), UTF-8 reduces file size while allowing for a much larger number of less common characters. These less common characters are encoded into two or more bytes, but this is okay if they’re stored sparingly.

      Spatial efficiency is a key advantage of UTF-8 encoding. If, instead, every Unicode character was represented by four bytes, a text file written in English would be four times the size of the same file encoded with UTF-8.

      Are there other encoding systems besides UTF-8?

      There are other encoding systems for Unicode besides UTF-8, but UTF-8 is unique because it represents characters in one-byte units. Remember that one byte consists of eight bits, hence the “-8” in its name.

      More specifically, UTF-8 converts a code point (which represents a single character in Unicode) into a set of one to four bytes. The first 128 characters in the Unicode library — the characters I talked about while explaining ASCII above — are represented as one byte. Characters that appear later in the Unicode library are encoded as two-byte, three-byte, and eventually four-byte binary units.

      Difference Between UTF-8 and UTF-16

      As I mentioned, UTF-8 is not the only encoding method for Unicode characters — there’s also UTF-16. These methods differ in the number of bytes they need to store a character:

      • UTF-8 encodes a character into a binary string of one, two, three, or four bytes.
      • UTF-16 encodes a Unicode character into a string of either two or four bytes.
      • In UTF-8, the smallest binary representation of a character is one byte, or eight bits.
      • In UTF-16, the smallest binary representation of a character is two bytes, or sixteen bits.

      Both UTF-8 and UTF-16 can translate Unicode characters into computer-friendly binary and back again. However, they are not compatible with each other.

      UTF-8 vs. UTF-16 Character Table

      Both UTF-8 and UTF-16 systems use different algorithms to map code points to binary strings. As shown in the character table below, the binary output for any given character will look different for both UTF-8 and UTF-16:

      Character

      UTF-8 binary encoding

      UTF-16 binary encoding

      A

      01000001

      01000001 11011000 00001110 11011111

      𠜎

      11110000 10100000 10011100 10001110

      01000001 11011000 00001110 11011111

      When should I use UTF-8?

      UTF-8 encoding is preferable to UTF-16 on the majority of websites because it uses less memory.

      Recall that UTF-8 encodes each ASCII character in just one byte. UTF-16 must encode these same characters in either two or four bytes. This means that an English text file encoded with UTF-16 would be at least double the size of the same file encoded with UTF-8.

      Another benefit of using UTF-8 character sets is its backward compatibility with ASCII. The first 128 characters in the Unicode library match those in the ASCII library, and UTF-8 translates these 128 Unicode characters into the same binary strings as ASCII. As a result, UTF-8 can take a text file formatted by ASCII and convert it to human-readable text without issue.

      When should I use UTF-16?

      UTF-16 is only more efficient than UTF-8 on some non-English websites.

      If a website uses a language with characters farther back in the Unicode library, UTF-8 will encode all characters as four bytes, whereas UTF-16 might encode many of the same characters as only two bytes.

      Pro tip: If your pages are filled with ABCs and 123s, I’d recommend sticking with UTF-8.

      Here's My Summary of Why and How UTF-8 Encoding is Important

      Diving into UTF-8 made me realize how essential it is to the seamless digital experiences we enjoy every day.

      Here's a summary of everything I went over:

      • Computers store data, including text characters, as binary (1s and 0s).
      • ASCII was an early way of encoding, or mapping characters to binary code so that computers could store them. However, ASCII did not provide enough room for non-Latin characters and numbers to be represented in binary.
      • Unicode is a solution to this problem. Unicode assigns a unique “code point” to every character in every human language.
      • UTF-8 is a Unicode character encoding method.
      • UTF-8 takes the code point for a given Unicode character and translates it into a string of binary. It also does the reverse, reading in binary digits and converting them back to characters.
      • UTF-8 has become the most widely used encoding method on the internet due to its ability to store text from any character set efficiently.
      • UTF-16 is another encoding method, but is less efficient for storing text files (except for those written in certain non-English languages).

      Put Your New Knowledge to Work

      Working on this article about UTF-8 character set has been a fascinating journey. Like most people, I’ve always taken for granted that text on the internet just “works,” no matter the language, script, or platform.

      Unicode translation isn’t something you need to think about when browsing or designing websites, and that’s exactly the point — to create a seamless text processing system that functions for all languages and web browsers. If it’s working well, you won’t notice it.

      If you find your website’s pages are using up an inordinate amount of space, or if your text is littered with ▢s and �s, I recommend putting your new knowledge of UTF-8 to work. Let’s continue building a better, more accessible internet — one UTF-8 character at a time.

      Editor's note: This post was originally published in August 2020 and has been updated for comprehensiveness.

      The Ultimate Workbook for Redesigning Your Website

      Guidance + templates to simplify your next website redesign project.

      • A four-part redesign planning guide
      • A redesign budget template
      • A website redesign audit template
      • And more!

        Download Free

        All fields are required.

        You're all set!

        Click this link to access this resource at any time.

        Topics: Website Design

        Related Articles

        Learn how to redesign your website with this free guide.

          CMS Hub is flexible for marketers, powerful for developers, and gives customers a personalized, secure experience

          START FREE OR GET A DEMO