The Data Journey

During the last few years I've observed that many organizations are in the middle of a three steps process, that typically takes years. This process is the Data Journey.

The Data Journey
Photo by Matt Duncan / Unsplash

During the last few years, I had the chance to talk to many retailers and other large tech companies dealing with Data.

I've observed the same pattern again and again so far: these organizations are in the middle of a three steps process, that usually takes more than 5 years:

1. Data Consolidation

Migrating a platform to a micro-services ecosystem means you'll generate tons silos of information: every service has its own private repository.

Most analytical use cases require merging information from multiple services, and this usually gets challenging given that each service may even have different database technology for the data repository. The difficulty of solving analytical use cases grows exponentially under this paradigm.

That's why the first step of this process is consolidating this data in a data lake, ideally on the cloud for scalability and flexibility.

2. Data Modeling

Once the information is all together in a data lake, organizations need to model and normalize it so they enable analysts to get insights and run queries.

The same entity may be referenced using different codes across services, besides other non-standard conventions.

This task is ideally implemented following a domain-entity organization, and every vertical domain is accountable for modeling the entities they own for the rest of the company. This approach fits quite well with the Data Mesh Principles.

Ideally, teams will have tools that guarantee a single source of truth for definitions and logic, and some level of automation and observability, such as dbt. It's quite easy to imagine why this tool is growing in popularity, given these premises.

Completing this step enables for example analytical and ML teams, providing huge value. Now your organization has a single source of truth, and can run on-demand queries with "the power of the cloud". Huge leap forward.

3. Publication

Most companies are already between working in steps one and two, but they are really struggling with this last one: "Ok, now you have your data modeled on, say, Snowflake; how do I build a dashboard on top of that? How do I consume these metrics or insights in near real-time?"

The first attempt is typically building a REST data service on top of Snowflake, BigQuery, Redshift, etc. I've seen this countless times, always with the same poor results. These products are great for analytical, long-running queries, but they're just not meant for interactive use cases: they don't have the low latency and concurrency capabilities by design, they're something else.

And that's the exact situation where product like Tinybird provides the most value: making your already built information available to use in real-time, with the concurrency and latency you need.

Operational Analytics

The ultimate benefit you get as an organization when you complete the three steps is enabling real-time Operational Analytics.

This means you can operate your business in real-time based on actual facts and insights you get from your Data Platform.

For example, if you're a retailer running a big promotion during Black Friday, you'll be able to re-arrange the items on your website depending on their performance in real-time, of hiding out-of-stock products. Supplying the demand in an optimal way during a timely event is huge for increasing revenue.

Another common use case that's also unlocked is real-time personalization for user experience, when the website adapts to the users with their interaction.

A note on Streaming Analytics

There's a general agreement that distributed systems increase complexity exponentially. So does asynchronous communications between services.

Once you've shifted to an event-driven paradigm, you'll typically want to ingest all your events using a common hub, for example a Kafka cluster. But unlocking the value and getting insights from the information you already have on the platform in near real-time is challenging.

There are already a few products for streaming analytics such as Rockset, ksqlDB, Apache Druid, Imply, etc. They work great for some streaming use cases, but they all fall a bit short when it comes to high volume and concurrency, complex logic or multiple joins.

That's because they don't leverage a full OLAP, like Clickhouse for example, that enables arbitrary time spans (vs. window functions), advanced joins for complex uses cases, managed materialized views for rollups, and many other benefits.

Subscribe to alejandromav.com

Don’t miss out on the latest issues. Sign up now to get access to the library of members-only issues.
jamie@example.com
Subscribe