Recently, I looked up at-home COVID tests and saw that they were in stock at a local pharmacy. I walked to the first location and was informed they were sold out. Then, I looked up another location and saw they had tests in stock, only to find out after visiting that store that they were also out of stock.
What happened here was a breakdown in the flow of information. The product page provided helpful insight regarding which locations had the test in stock, which was a great feature. But the stock information itself was not up to date, meaning that this seemingly handy feature had actually led me astray.
If this retailer had implemented a data stream, then the availability of these tests could be updated instantly based on stock and purchase data. This way, customers could make informed decisions and avoid the same frustration that I experienced.
This scenario speaks to the challenge of timeliness with data streams, which will be examined later. Let's start by defining data streaming.
What is data streaming?
Data streaming is the process of continuously collecting data as it's generated and moving it to a destination. This data is usually handled by stream processing software to analyze, store, and act on this information. Data streaming combined with stream processing produces real-time intelligence.
Data streams can be created from various sources in any format and in any volume. The most powerful data streams aggregate multiple sources together to form a complete picture of different operations, processes, and more.
For example, network, server, and application data can be combined to monitor the health of your website and detect performance drops or outages for quick remediation.
This video reviews the concept of data streaming and also provides an introduction to batch processing, which will be examined later in this section:
Streaming the data is only half the battle. You also need to process that data to derive insights.
Stream processing software is configured to ingest the continual data flow down the pipeline and analyze that data for patterns and trends. Stream processing may also include data visualization for dashboards and other interfaces so that data personnel may also monitor these streams.
Data streams and stream processing are combined to produce real-time or near real-time insights. To accomplish this, stream processors need to offer low latency so that analysis happens as quickly as data is received. A drop in performance by the stream processor can lead to a backlog or data points being missed, threatening data integrity.
Stream processing software also needs to scale to meet expected and unexpected computing demand. If there's a spike in traffic to your website, you don't want to lose user behavior data because your processor was only configured to handle the average level of interactions at a given time.
Stream processors should also be highly available. This means they can continue to perform tasks even if components fail. If processors don't have redundancies built in to handle failures, they will inevitably encounter a situation where one error can lead to the entire software crashing. This will reduce your data quality since the stream is not analyzed for however long the outage persists.
Benefits of Data Streaming
The main benefit of data streaming is real-time insight. In the Information Age, new data is constantly being created. The best organizations will take advantage of the latest information from internal and external assets to inform their decisions, both in day-to-day operations and in overall strategy.
Let's examine a few more benefits of data streaming.
The ability to quickly collect, analyze, and act on current data will give companies a competitive edge in their marketplace. Real-time intelligence makes organizations more responsive to market trends, customer needs, and business opportunities. As the pace of business increases with digitalization, this responsiveness can be a distinguishing feature.
Increase Customer Satisfaction
Customer feedback is a valuable litmus test for what an organization is doing right and where it can improve. The faster a company can respond to customer complaints and provide resolution, the better its reputation will be. This speed pays dividends when it comes to word-of-mouth advertising and online reviews that can be the deciding factor for attracting new prospects and converting them to customers.
Not only does data streaming support customer retention, but it prevents other losses as well. Real-time intelligence can provide warnings of impending issues such as system outages, financial downturns, data breaches, and other issues that negatively affect business outcomes. With this information, companies can prevent or at least mitigate the impact of these events.
Next, let's review the differences between stream processing and traditional batch processing.
Batch Processing vs. Stream Processing
Batch processing requires data to be downloaded before it is analyzed and stored. In contrast, stream processing continuously ingests and analyzes data. Batch processing is an incremental approach to data collection and processing versus stream processing which occurs at a constant rate.
Stream processing is the preferred method where speed is a major factor. Batch processing is implemented in scenarios where real-time intelligence is not necessary or the data cannot be converted into a data stream for immediate analysis, such as when working with legacy technologies like mainframes.
This video takes an in-depth look at these two concepts and their use cases:
Data Stream Examples
Data streams can be built to capture data of all types. The key is to identify data that's critical to track on a real-time basis. Examples include location data, stock prices, IT system monitoring, fraud detection, retail inventory, sales, customer activity, and more.
The following companies use some of these data types to power their business activity.
Lyft is a ride-sharing app that requires up-to-the-second data to accurately match riders with drivers. When the rider first opens the app and inputs their destination, Lyft displays the current availability of vehicles and prices for different levels of service based on distance, demand, and traffic conditions. These factors can all change in seconds, meaning that Lyft needs to have that data available instantly to set accurate expectations with the user.
Once the rider has selected a service level, Lyft then aggregates data on available vehicles in this category and considers distance to the rider, whether the driver is free or conducting another dropoff, and expected time of arrival to match the best driver to the rider. These metrics are powered by additional GPS and traffic data.
Finally, when the ride is underway, location data is being reported from the driver's phone so Lyft can track the driver’s progress and location and match the driver with other ride requests, as well as provide another view into traffic conditions. Lyft has fine tuned its processors to accept all these streams of data and aggregate them to provide the best possible experience for its customers.
According to Statista, more than 500 hours of video is uploaded to YouTube every minute. That's a massive stream of data being processed and stored every hour of the day.
Given the large file size of videos, YouTube needs to configure itself for high availability to support its creators' content. Then, of course, it needs to provide that data in the opposite direction for those consuming this content, in addition to tracking and displaying view counts, comments, subscribers, and more metrics in real time.
YouTube also supports live videos where content creators and viewers can interact with each other through a real-time video feed and chat, making instant data transfer even more critical to ensure the conversation continues without disruption.
Speaking of YouTube, the presenter in this video walks through how to create an example data stream using PowerShell and Power BI:
Data Stream Challenges to Consider
Data streaming opens a world of possibilities, but it also comes with challenges to keep in mind as you incorporate real-time data into your applications.
Not only does data need to be accessed at the time it's recorded, but it also needs to be logged in a datastore that will retain this information for historical context. It's great that a customer has renewed their subscription, but if you can't view previous subscription periods, then you won't have the full picture of their purchase history and could miss opportunities to offer other products or services that are valuable to the user.
Data from data streams will quickly become stale, so it's critical that your application is agile to account for the latest information and update its algorithms accordingly. For example, you don't want the user to add items to a cart in one tab but the cart is empty if they open the site in a different tab.
The volume of the data stream can be massive, so it's important to ensure your storage and processing tools are ready to perform. You don't want to lose valuable data because a temporary spike in volume or system outage led to your infrastructure becoming overwhelmed. This means that it's critical to build failsafes into your system to provision extra computing and storage resources to handle surges.
With data constantly being collected, it can be easy to only prioritize the latest data. However, historical context is important. Recording a sequence of customer interactions in your CRM, for example, offers deeper insights than seeing that a person visited one web page. Instead, you see that they've visited a product web page after downloading two related eBooks and viewing a demo of the product. Now, their interest in the product is much clearer to you and your sales team.
Power modern businesses with data streaming.
Data streaming is a crucial piece of modern businesses, providing real-time intelligence to guide decision making and allowing the organization to respond to changing conditions. As digitalization increases the pace of business and data volumes expand, the best companies will position themselves to take advantage of these opportunities with data streams and deliver new insights at scale.