To understand data pipelines, let's start with a tasty metaphor. A pizza delivery service prepares your pizza at a restaurant, transports it with a driver, then delivers it to you. A data pipeline serves the same role with data: it collects the data from a source, transports it through the pipeline, and delivers it to a destination.
This is not a perfect metaphor because many data pipelines will transform the data in transit. But it does highlight the primary purpose of data pipelines: to move data as efficiently as possible, like the pizza delivery service. The goal is to produce business intelligence by ensuring data is available for analysis.
Moving data from one location to another may sound like a simple task, but as the volume of data produced increases every year and new sources of data are introduced — such as the Internet of Things — collecting all this information to create actionable insights is more challenging than ever. That's why organizations build data pipelines to achieve data integration.
What is a data pipeline?
A data pipeline is a combination of actions and tools that move data between a source and a destination. Pipelines will often have steps to transform and process the data between its origin and its target to ensure data quality and enable deeper analysis, and a data pipeline can have one or multiple data sources.
Before continuing, it's worth noting that data pipelines are sometimes referred to as data connectors. Data pipelines may also have the same source and destination. In this scenario, the pipeline's role is to transform and process the data before returning it for storage.
The extract, transform, load (ETL) process is a common type of data pipeline within organizations. It is used to extract data from disparate sources across the enterprise and load it into a central data warehouse. Before it's ingested, the data is transformed and processed to ensure it meets quality standards and is formatted to expedite analysis.
ETL's main benefits are that it breaks down data silos and provides a single source of truth for the operations of the business. Additional pipelines can then share relevant data from the central datastore with business operations systems to support their functions.
A second type of data pipeline commonly used within organizations is an analytics pipeline. Where ETL is focused on data integration, the analytics pipeline focuses on cleansing and processing incoming data to deliver actionable insights to the destination system. In summary, an analytics pipeline is built to, well, analyze data.
This video reviews the definition of data pipelines:
Benefits of Building Data Pipelines
Data pipelines have many benefits for a business beyond accomplishing their central task of moving data. These advantages are reviewed below.
1. Maximize IT resources.
Data pipelines automate the flow of data so that the engineering team can spend less time moving data manually and invest more time into optimizing the pipeline and streamlining analysis workflows. The smooth transfer of data from source to target means that analysts and stakeholders will always have the best information to draw conclusions and make decisions.
2. Reduce human error.
Given the various sources and formats of data an engineer may have to account for, manual data integration can be a complex task. The more complexity, the greater the risk of human error. Data pipelines reduce this risk with pre-defined workflows and tools to continuously manage the flow of data.
3. Increase visibility.
Data at its source only offers a partial view of an organization's health. Combining this data with other relevant sources provides a complete picture.
For example, your CRM may record a download for a product info sheet, but your website analytics tool shows that this user received a 404 error when trying to view the PDF. Without this additional context, you may not realize the problem until you receive feedback from a prospect, and who knows how many downloads will have bounced before someone speaks up?
Now that you understand what data pipelines are and why they matter, let's examine the process for moving the data from point A to point B.
Data Pipeline Process
Data pipelines are architected with a starting point — the source — and an endpoint — the destination. In most cases, they will also have built-in steps along the way for transforming and processing the data. Let's examine this process in more detail below.
Sources are the origin of data. They can be internal or external and in various formats. Common sources for data pipelines include relational databases, CRMs, SaaS tools, web apps, and IoT sensors.
Source data may be collected through API calls, push mechanisms, or webhooks, and these extractions may be performed at scheduled intervals or in real time.
During transformation, the source data is manipulated and changed to meet requirements and standards before it's loaded to the target system.
This manipulation usually involves several substeps, including sorting, validating, verifying, mapping, and cleansing the data. These actions can be classified as constructive, destructive, or aesthetic depending on if they are adding, deleting, or reformatting data.
Processing refers to the approach taken to data transformation, specifically around frequency and resource allocation.
Batch processing is the more traditional method, which takes place at scheduled intervals and requires the data to be downloaded before transformation. Stream processing is a newer approach that transforms the data in transit, which supports real-time analysis and does not require downloading the data. These concepts will be revisited in a later section.
A destination is the endpoint of the pipeline after the data has been processed. Two common types of datastores are data warehouses, which require structured data, and data lakes, which accept structured or unstructured data. The destination may also be referred to as a sink.
Note that some pipelines will transform and process the data after it has reached its destination, such as with ETL's companion model extract, load, transform (ELT). Other data pipeline tools may simply move data from point A to point B without performing any operations on the data.
Next, let's examine three main architectures that organizations use to design their data pipelines.
Data Pipeline Architecture
No matter which architecture you select, the result should have the same characteristics:
- High availability: your data should always be available without interruptions or losses.
- Elasticity: your pipeline should have the resources needed to withstand increased demands or unexpected events without impacting data integrity or availability.
- Self-service options: your pipeline should offer dashboards, visualizations, and tools so that all employees can access data and query datasets for new insights, not just data analysts and engineers.
- Democracy of data: the more your data pipelines can share information with different parts of the organization, the better individual employees and teams can inform their daily decisions.
There are three main designs to architect data pipelines, starting with batch processing.
1. Batch Processing
Batch processing is the traditional method of data processing. In this approach, data is extracted from sources and then downloaded for transformation before loading it to the target datastore, where it can then be queried and analyzed.
Batch processing is typically scheduled because the resources needed for transformation would not scale to support a constant flow of data, which is where stream processing shines.
2. Stream Processing
Stream processing is a newer approach which takes advantage of processing power — mainly the cloud — to transform data in real time. Stream processors receive data as it is generated at its source, process it, and distribute it to the destination in one continuous flow.
Stream processing supports real-time analysis and business intelligence. However, its shortcoming is that it does not usually capture historical context, and it's not always possible to build data streams with legacy technologies such as mainframes. This is why businesses have combined both approaches.
3. Lamda Architecture
The Lambda Architecture joins the batch and stream processing methods to support real-time streaming and historical analysis. This approach also encourages the storage of raw data so that future pipelines can transform the data to support new queries and analyses that provide deeper insights into historical and current data.
This video also examines the three main data pipeline architectures:
Data Pipeline Tools
Below is a selection of the tools available to build data pipelines.
Let's examine each in more detail.
1. AWS Data Pipeline
Price: Free with paid plans available
AWS Data Pipeline is a web service focused on building and automating data pipelines. The service integrates with the full AWS ecosystem to enable storage, processing, and reports. AWS Data Pipeline is fault tolerant, repeatable, and highly available, and it supports data pipelines from on-premise sources to the cloud and the reverse, ensuring your data is always available when and where you need it.
2. Apache Airflow
Apache Airflow is a platform to build, schedule, and monitor data pipeline workflows. Airflow is scalable, dynamic, and extensible, allowing you to extend libraries to fulfill your organization's unique use cases. The platform offers a robust web application and supports Python commands to create workflows instead of command line. In addition, Airflow integrates with multiple cloud services, including Google Cloud Platform, Microsoft Azure, and AWS.
Price: Free trial with paid plans available
Integrate.io is a data integration platform built from four established data tools: Xplenty, DreamFactory, FlyData, and Intermix.io. The platform offers ETL, ELT, and reverse ETL pipeline capabilities; API creation to support data consumption in applications and systems; and analytics on your data warehouse's metadata for deeper insights into available data for querying and analysis.
Price: Free trial with paid plans available
Fivetran is a data integration platform providing a fully managed ELT architecture for organizations looking to store data in the cloud. It offers maintenance-free pipelines and ready-to-query datasets to provide instant access to the data needed to drive decisions. The platform also provides robust data security and supports user-provided custom code to query datasets and build workflows, offering full control of your data pipelines.
Data Pipeline Examples
The use cases for data pipelines are as broad as the use cases for data itself. Any time an individual, team, or organization needs to extract or interact with data outside of its source, a data pipeline is needed.
1. User Groups
Customer data is invaluable for an organization to inform marketing and sales activity. If a company can identify its core user group and what attributes define that group, the marketing team can judge leads against that user persona to rank prospects and better guide the sales team's outreach.
Building data pipelines from point of sale (POS) systems, CRMs, application monitoring tools, and other sources will help the organization better understand their core users' needs, wants, habits, and pain points.
2. Ad Analytics
A digital ad may be served on a social platform such as LinkedIn, which then points back to a landing page on your website. If the prospect converts, they are taken to a POS system to complete the transaction.
To measure the performance of this ad, you need data pipelines to extract data from these three applications and aggregate it. Not only will this show how much revenue your ad campaign is driving, but engagement metrics may also show where prospects are encountering friction and dropping out so you can optimize the user experience and improve the conversion rate.
A newer trend in the development space is microservices. Microservices are lean applications that serve very specific use cases, which simplify debugging and speed task execution.
However, this means that data that would normally be shared between components in a platform must now be shared between multiple separate applications, increasing the number of dependencies in a system. This complexity requires optimized pipelines that efficiently move data between microservices to ensure productivity does not suffer.
This video walks through an example data pipeline design:
Data Pipelines Provide Deeper Insights
Data pipelines are a key component of a modern data strategy. They connect data from across the enterprise with the stakeholders that need it. Efficient movement of data supports deep analysis to discover patterns and uncover new insights that support both strategic planning and daily decisions.
There are multiple architectures to design your workflows and numerous tools to build your pipeline. The most important step is to realize the value of the data your organization possesses and start finding new ways to leverage it to move your business forward.