<link rel="stylesheet" href="https://53.fs1.hubspotusercontent-na1.net/hubfs/53/hub_generated/module_assets/1/196499278758/1758645863767/module_blog-ai-disclaimer.min.css">

What is UTF-8 encoding? A walkthrough for non-programmers

Written by: Darrielle Evans
Woman with long dark hair wearing a bright yellow sweater, sitting at a desk with a laptop, holding a white mug, with a marketing workbook titled

FREE WEBSITE REDESIGN WORKBOOK

Learn how to redesign your website with this free guide.

Download the Free Workbook
woman looking up what is utf-8 encoding

Updated:

Early in my career, I worked as a technical consultant on the development team of a global health and wellness brand during a scaled digital transformation, which included an international website rollout for both consumers and their distributors. The project was very complex.

Free Workbook: How to Plan a Successful Website Redesign

The content had to meet local regulations, support multiple languages, and be delivered carefully through the custom headless CMS (Adobe Experience Manager) that we developed. It quickly became apparent how easily things could break when characters like ñ, ç, or entire Chinese glyphs weren’t appropriately encoded.

Although I had some prior knowledge of encoding, I quickly realized just how foundational UTF-8 is to building websites that work across borders.

In this post, I’ll break down what UTF-8 actually is, why it matters for anyone working on web projects, and how it quietly powers the multilingual, global digital experiences we use daily. Before I get started, I do recommend that you deepen your understanding by reviewing the basics of Unicode, as it’s the standard that makes UTF-8 possible.

Table of Contents

To understand everything about UTF-8, I’ll walk you through the basics first.

How Computers Store Information

In order to store information, computers use a binary system. In binary, all data is represented in sequences of 1s and 0s. The most basic unit of binary is a bit, which is just a single 1 or 0. The next largest unit of binary, a byte, consists of 8 bits. An example of a byte is “01101011.”

Every digital asset you’ve ever encountered — from software to mobile apps to websites to Instagram stories — is built on this system of bytes, which are strung together in a way that makes sense to computers.

When we refer to file sizes, we’re referencing the number of bytes. For example, a kilobyte is roughly one thousand bytes, and a gigabyte is roughly one billion bytes.

Text is one of many assets that computers store and process. Text is made up of individual characters, each of which is represented in computers by a string of bits. These strings are assembled to form digital words, sentences, paragraphs, romance novels, and so on.

The Ultimate Workbook for Redesigning Your Website

Guidance + templates to simplify your next website redesign project.

  • A four-part redesign planning guide
  • A redesign budget template
  • A website redesign audit template
  • And more!

    Download Free

    All fields are required.

    You're all set!

    Click this link to access this resource at any time.

    ASCII: Converting Symbols to Binary

    The American Standard Code for Information Interchange (ASCII) was an early standardized encoding system for text. Encoding is the process of converting characters in human languages into binary sequences that computers can process.

    ASCII’s library includes every upper-case and lower-case letter in the Latin alphabet (A, B, C…), every digit from 0 to 9, and some common symbols (like /, !, and ?). It assigns each of these characters a unique three-digit code and a unique byte.

    ASCII Character Table

    The table below shows examples of ASCII characters with their associated codes and bytes.

    CHARACTER

    ASCII CODE

    BYTE

    A

    065

    01000001

    a

    097

    01100001

    B

    066

    01000010

    b

    098

    01100010

    Z

    090

    01011010

    z

    122

    01111010

    0

    048

    00110000

    9

    057

    00111001

    !

    033

    00100001

    ?

    063

    00111111

    Just as characters come together to form words and sentences in language, binary code does so in text files. So, the sentence “The quick brown fox jumps over the lazy dog” represented in ASCII binary would be:

    01010100 01101000 01100101 00100000 01110001 01110101 01101001 01100011 01101011 00100000 01100010 01110010 01101111 01110111 01101110 00100000 01100110 01101111 01111000 00100000 01101010 01110101 01101101 01110000 01110011 00100000 01101111 01110110 01100101 01110010 00100000 01110100 01101000 01100101 00100000 01101100 01100001 01111010 01111001 00100000 01100100 01101111 01100111 00101110

    That doesn’t mean much to us humans, but it’s a computer’s bread and butter.

    How many ways can a character be represented in ASCII?

    ASCII was originally designed as a 7-bit system, which means it can represent 128 unique characters (values 0–127). That covers the English alphabet, numbers, punctuation, and some control characters like carriage return and line feed.

    A common misconception is that ASCII uses 8 bits (a full byte), which would allow for 256 characters. In reality, standard ASCII only ever defined 128. The “extra” bit in an 8-bit byte was often used for error checking, formatting, or left unused.

    Later, different systems did take advantage of the full 8-bit range to create “extended ASCII” sets with up to 256 characters. But because each system defined those extra slots differently, compatibility issues were common. For example, the byte value 130 might display as “é” on a Windows machine but as „ (double low quote) on another system.

    These kinds of conflicts highlighted the need for a universal standard that could store every symbol, in every language, consistently.

    Unicode: A Way to Store Every Symbol, Ever

    ASCII was fine when we were only thinking about English, but once the internet went global, it just couldn’t keep up. That’s where Unicode comes in. Instead of cramming characters into a limited set of slots, Unicode gives every symbol its own unique identifier, called a code point. Think of it like giving every letter, number, or emoji its own street address — no matter where you are, you’ll always know exactly what it is.

    Unicode has space for over 1.1 million code points, which is more than enough to cover every language, past and present, plus extras like math symbols, currency signs, and emojis. It’s the reason text doesn’t fall apart when you switch between countries or devices.

    Unicode Character Table

    Now, I’ll give you some examples of text characters and their matching code points. Each code point begins with “U” for “Unicode,” followed by a unique string of characters to represent the character.

    CHARACTER

    CODE POINT

    A

    U+0041

    a

    U+0061

    0

    U+0030

    9

    U+0039

    !

    U+0021

    Ø

    U+00D8

    ڃ

    U+0683

    U+0C9A

    𠜎

    U+2070E

    😁

    U+1F601

    If you want to learn how code points are generated and what they mean in Unicode, check out this in-depth explanation.

    So, now with Unicode I have a standardized way of representing every character used by every human language in a single library. This solves the issue of multiple labeling systems for different languages — any computer on Earth can use Unicode.

    But Unicode alone doesn’t store words in binary. Computers need a way to translate Unicode into binary so that its characters can be stored in text files.

    Here’s where UTF-8 comes in.

    The Ultimate Workbook for Redesigning Your Website

    Guidance + templates to simplify your next website redesign project.

    • A four-part redesign planning guide
    • A redesign budget template
    • A website redesign audit template
    • And more!

      Download Free

      All fields are required.

      You're all set!

      Click this link to access this resource at any time.

      UTF-8: The Character Set in Web Development

      UTF-8 is the most common character encoding used on the internet today. Actually, it’s the default for HTML5. Over 98% of all websites (probably including yours) store characters this way.

      You’ll also see UTF-8 show up in common data formats like XML and JSON. While these formats technically can use other encodings, UTF-8 is the standard for web data transfer.

      That’s why I recommend making sure all your site pages and databases are using UTF-8. Most content management systems and website builders will save files in UTF-8 automatically, but it’s still worth double-checking, especially if you’re redesigning your site. A redesign is the perfect time to audit your encoding settings and confirm everything lines up with modern web standards.

      How to Check and Update Your Site’s Encoding Settings

      Making sure your site is using UTF-8 isn’t complicated. Here are a few ways you can confirm or update your settings if necessary.

      1. Check your HTML <head> tag.

      Look for a meta tag like this:

      checking your site for utf-8 via meta tag

      If it’s missing or shows another encoding (like ISO-8859-1), update it to UTF-8.

      2. Review your CMS settings.

      WordPress: UTF-8 is the default, but you can confirm under Settings > Reading or by checking your wp-config.php file for DB_CHARSET set to utf8.

      Other platforms (Squarespace, Wix, Shopify, etc.) usually enforce UTF-8 automatically, but it’s still good to review the documentation or encoding settings.

      3. Check your database.

      If your site pulls from a database (like MySQL), make sure the tables and columns are set to utf8mb4. This version of UTF-8 supports the full range of characters, including emojis.

      4. Test your pages.

      To test your pages, you can use Google Chrome. Open your site in Chrome, right-click, and choose View Page Source. If you see UTF-8 in the meta tag and your characters (especially special ones like accents or emojis) display correctly, then you are all set.

      When it comes to HTML, your site needs to tell the browser or software that it’s using UTF-8, or the text won’t render correctly. This is what the <meta charset=“UTF-8”> tag does. It signals how to translate the file back into readable characters.

      UTF-8 Character Table

      Below is the same character table from above, with the UTF-8 character set output added for each. Notice how some characters are represented as just one byte, while others use more.

      CHARACTER

      CODE POINT

      UTF-8 BINARY ENCODING

      A

      U+0041

      01000001

      a

      U+0061

      01100001

      0

      U+0030

      00110000

      9

      U+0039

      00111001

      !

      U+0021

      00100001

      Ø

      U+00D8

      11000011 10011000

      ڃ

      U+0683

      11011010 10000011

      U+0C9A

      11100000 10110010 10011010

      𠜎

      U+2070E

      11110000 10100000 10011100 10001110

      😁

      U+1F601

      11110000 10011111 10011000 10000001

      Understanding UTF-8 Character Conversion to Bytes

      I’ve shown in the table above how some characters take just one byte while others need more. But why does UTF-8 give one byte for some characters and up to four for others? The answer is simple: to preserve memory.

      This spatial efficiency is one of UTF-8’s biggest advantages. If every Unicode character always used four bytes, a simple English text file would be four times larger than it needs to be.

      Here’s a quick example:

      • “Hello world” → 11 bytes (all single-byte characters)
      • “Bonjour à tous” → 13 bytes (the accented “à” takes two bytes while the rest take one)

      UTF-8’s flexibility means you get the best of both worlds: compact file sizes for everyday text, with the ability to represent virtually any character when you need it.

      Are there other encoding systems besides UTF-8?

      UTF-8 may be the dominant standard today, but it isn’t the only encoding system that exists. A few others you’ll come across, mostly in older files or legacy systems, include:

      • ASCII. The original 7-bit system, supporting just 128 characters (English letters, digits, and basic punctuation).
      • ISO-8859-1 (Latin-1). An extended version of ASCII that added support for Western European characters like ñ or ü. This was the default for early versions of HTML.
      • UTF-16. Another Unicode encoding that uses two bytes for most characters but can extend to four bytes for less common ones. It’s still used internally by some programming languages like Java and C#.
      • UTF-32. A fixed-width encoding where every character takes four bytes. Easy for computers to process but very inefficient in terms of file size, so it’s rarely used for web content.

      These systems paved the way for modern encoding, but they each had limitations. ASCII and ISO-8859-1 couldn’t represent every language. UTF-16 and UTF-32 could, but they required more storage space. UTF-8 was the perfect balance to efficiently handle common characters, but flexible enough to handle every symbol in Unicode.

      Difference Between UTF-8 and UTF-16

      Both UTF-8 and UTF-16 are ways of encoding the same Unicode characters. They differ by how they store them.

      UTF-8 uses a variable-length system where each character takes one to four bytes. Common characters like English letters only need one byte, while less common symbols may take more. This makes UTF-8 efficient for text-heavy languages like English and keeps file sizes small.

      UTF-16 usually uses two bytes for each character, and some characters take four. This means it can be faster for languages with lots of non-Latin characters (like Chinese or Hindi), but it also uses more memory for plain English text compared to UTF-8.

      I’ll never forget when I had to really dig into this difference during a client project. We were pulling text data from an older Windows system, and half the characters were coming through as unreadable boxes. At first, I thought the file was corrupted. In actuality, the source was exporting everything in UTF-16, but our site was expecting UTF-8. That was a long day, but I learned a very valuable lesson about the difference between the two and how it can mess with how content appears if not set up correctly.

      UTF-8 vs. UTF-16 Character Table

      Both UTF-8 and UTF-16 systems use different algorithms to map code points to binary strings. As shown in the character table below, the binary output for any given character will look different for both UTF-8 and UTF-16:

      Character

      UTF-8 binary encoding

      UTF-16 binary encoding

      A

      01000001

      01000001 11011000 00001110 11011111

      𠜎

      11110000 10100000 10011100 10001110

      01000001 11011000 00001110 11011111

      When should I use UTF-8?

      For most websites, UTF-8 is the clear choice. It keeps memory use low by storing common characters, like English letters and numbers, in a single byte. By comparison, UTF-16 needs two or even four bytes to represent those same characters. That means an English text file saved in UTF-16 would be at least twice the size of one saved in UTF-8.

      Another advantage is backward compatibility. The first 128 characters in Unicode line up exactly with ASCII. Because of that, UTF-8 can read and display older ASCII files without breaking, making it an easy fit for the modern web while still honoring the systems that came before it.

      When should I use UTF-16?

      UTF-16 makes sense in a smaller set of cases, mainly for sites or systems that use languages filled with non-Latin characters. In those situations, UTF-8 may need to store each character as four bytes, while UTF-16 can often get away with just two. That difference can make it more efficient for certain scripts, like Chinese or Hindi, where multi-byte characters are the norm.

      For the vast majority of websites though, especially those centered on English or other Latin-based languages, UTF-8 is still the most practical and efficient option.

      Pro tip: If your pages are filled with ABCs and 123s, I’d recommend sticking with UTF-8.

      Here’s my summary of why and how UTF-8 encoding is important.

      The more I’ve learned about UTF-8, the more I see it as one of those invisible details that quietly makes the Internet feel seamless. Most of the time, I don’t even think about it while I’m coding — until something breaks. A garbled character or a bloated file size is usually my reminder that encoding isn’t set up correctly.

      That’s why I make it a habit to confirm my projects are using UTF-8 from the start. It doesn’t take long, but it saves me from headaches later and ensures that my work is accessible to anyone, anywhere. I think of UTF-8 as one of those quiet essentials: You don’t notice it when it’s right, but you’ll definitely notice when it’s wrong.

      Editor's note: This post was originally published in August 2020 and has been updated for comprehensiveness.

      The Ultimate Workbook for Redesigning Your Website

      Guidance + templates to simplify your next website redesign project.

      • A four-part redesign planning guide
      • A redesign budget template
      • A website redesign audit template
      • And more!

        Download Free

        All fields are required.

        You're all set!

        Click this link to access this resource at any time.

        Related Articles

        Learn how to redesign your website with this free guide.

          CMS Hub is flexible for marketers, powerful for developers, and gives customers a personalized, secure experience

          START FREE OR GET A DEMO