Let's start with a hypothetical scenario. We work for a commercial software company, and our support team has received feedback from Customer A and Customer B. Both customers would like to see a different feature added to our platform. With limited development resources, how do we choose which feature to pursue first?
We need more data to determine priorities. Checking our point of sales system, we can now see that Customer A has subscribed to a more expensive tier of our product than Customer B. However, we can see from our renewals database that Customer A has elected not to renew their subscription while Customer B is renewing.
By combining relevant data from different sources across our organization, we have a larger understanding of the issue and can now redefine the original question: do we attempt to win Customer A back by solving their pain points? Or do we prioritize Customer B's needs since we know they will continue using the product? The answer will be determined by our organization's business priorities, but we can see now how data-driven context can frame our decisions to be more strategic.
If we don't have a centralized source for all of this information, then it becomes that much harder to make informed decisions. That's why organizations have turned to ETL to provide the context needed to understand their data and set priorities that are supported by the statistics they value.
What is ETL?
ETL stands for “extract, transform, load.” It is a standard model for organizations seeking to integrate data from multiple sources into one centralized data repository.
ETL's main benefits are:
- Quality: ETL improves data quality by transforming data from different databases, applications, and systems to meet internal and external compliance requirements. This consolidation provides historical context since all relevant data is cataloged for discovery, thus removing blind spots in decision making.
- Consistency: ETL simplifies analysis by transforming data to follow a universal standard. ETL also improves the accuracy of calculations and predictions when all data is archived and searchable.
- Speed: ETL improves the speed of decision making by removing the need to query multiple data sources which all may have varying response times to build a complete picture.
For an alternative explanation of ETL and its benefits, take a look at this video:
So, how does ETL work? Let's break the process down.
Extract, Transform, Load: The ETL Process
As the acronym implies, ETL consists of three separate phases: extract, transform, load. We'll examine each process individually.
Most businesses rely on multiple streams of data from different sources and formats. We can think back to the hypothetical example of Customer A and Customer B and the data points pulled on each customer. Before that data can be analyzed, it must first be located, copied, and moved to the central datastore. This is the extract phase of the ETL process.
It's worth noting that data can come from various sources, not just traditional databases. Raw data can be extracted from unstructured sources like documents, emails, and images, or structured sources like databases. Extraction is how all this disparate data is located and copied so that we can format it to meet our needs in the next phase.
Now that the data has been collected, it needs to be processed. Since this information is coming from various systems in various formats, we need to manipulate it to maintain data integrity and make it queryable. Following predefined rules, the transformation process will clean, standardize, deduplicate, verify, map, and sort the data to ensure it's ready to load to the datastore in the next phase.
Now, the transformed data is loaded into the datastore. There are two common methods for uploading data: full loading and incremental loading.
With full loading, every datapoint collected during the extract and transform phases is populated into new, unique records in the datastore. While this may sound ideal, this method can lead to datasets that increase exponentially over time and become unmanageable.
A more resource-friendly approach is the incremental loading method, where incoming data is compared with existing data and unique information is populated as new records. Though less comprehensive than the full loading approach, incremental loading requires smaller data warehouses that require less maintenance.
The loading process itself can also take place at different time intervals. Full loading and incremental loading typically happen at scheduled intervals. Some organizations have completely synchronized their process so that new data is extracted, transformed, and loaded as soon as it's recorded at its origin, providing real-time visibility. However, this requires tight integration between data sources, ETL tools, and the datastore and may not be suitable in all scenarios.
This video provides another examination of the ETL process:
The ETL Model vs. the ELT Model
ETL is how organizations first captured all of their data and gained greater visibility into their operations. However, with the rise of cloud computing, some businesses are now turning to the extract, load, transform (ELT) model instead. We'll compare both, starting with the ETL model.
The Traditional ETL Model
In the traditional ETL model, data is extracted from data sources, transformed to meet validation requirements, and loaded into an enterprise data warehouse. Data warehouses have strict schema, metadata, and data validation rules to ensure all data is structured and queryable according to the organization's requirements. This, in turn, means that all data must be transformed to meet those requirements in a staging area before reaching the datastore.
One drawback of this approach is that the data processing happens before it reaches its final destination, and if the dataset is large or complex, this can delay its delivery. ELT is a new approach organizations are taking to solve this issue.
The Cloud-Enabled ELT Model
In the ELT model, data is extracted, loaded to the datastore, and then transformed to meet validation rules. The main advantage of this approach is that the processing happens in the datastore itself, not the ELT tool that collects the data. With the widespread adoption of cloud computing, companies are leveraging the computing power of the cloud service provider's (CSP's) infrastructure, which will lead to rapid data delivery and faster processing times.
Another advantage of the ELT model is that it is usually paired with a data lake as the target datastore. Data lakes differ from data warehouses because they do not need the data to be structured according to schemas prior to intake. This also allows tools that can work with unstructured data, such as AI, to start analyzing and acting on the data sooner while it's being transformed "at rest."
The differences between the ETL and ELT models are also explained in the video below:
Now that we have a greater understanding of ETL and its process, how is it implemented in the real world?
ETL Use Cases
ETL can be used in multiple ways to move businesses forward, including (but not limited to) the following scenarios.
1. Data Warehousing
A core tenet of ETL is to centralize information into data warehouses, and that remains its primary use case. Collecting, standardizing, and storing data enables the other use cases we'll explore next.
2. Business Intelligence
Returning to our hypothetical example, business intelligence helped us reframe the Customer A and Customer B dilemma. ETL provides a single source of truth that allows businesses to analyze data from across the organization. Armed with this data, analysts can provide hard statistics to leaders and stakeholders that inform business decisions.
3. Machine Learning and AI
Machine learning and AI are fast-emerging technologies, and high-quality, comprehensive data is critical to their function. Machine learning is the process through which software learns and refines its functions without explicit rules to define behavior, and these algorithms are driven by statistical models to determine patterns and draw inferences. These models are built on an organization's data, making it crucial to ensure these tools have access to the best available information to power their decision making.
4. Cloud Migration
Organizations are taking advantage of the cloud to host their data and mission-critical applications, mainly high availability. To ensure all their data makes it to the CSP, these businesses use ETL or ELT to collect the data from their various on-premise databases or cloud-hosted business applications and load it to their cloud datastore.
5. Marketing Data Integration
In the digital world, customers have more ways to interact with a business than ever before. Customer behavior data can come from digital sources like eCommerce platforms, social networks, websites, and mobile applications. Data can also be collected from business applications like CRMs to track lead quality through call data, purchasing behavior, support tickets, and other metrics.
This presents a challenge for marketers who want to track all of these touchpoints to measure the success of their campaigns and the quality of their leads. ETL consolidates this data so that marketing leaders and operations professionals can record these metrics. ETL also enables these stakeholders to query the datasets to discover new insights into what is driving prospects to convert. In other words, ETL powers database-driven marketing.
ETL Supports Modern Data Strategies
The volume of data available to organizations is ballooning with the adoption of digital technologies. In addition, new methods are emerging for interacting with data in transit, such as machine learning and AI. ETL supports these modern data strategies and remains a cornerstone of data management. ETL provides a single source of truth for an organization to analyze its performance and operations, make informed decisions about its goals, and move the mission forward.