Proudly written by Wannes Rosiers, Data Area Manager @ DPG Media!
Updated on: Jan 11, 2021
Domain-Driven Design or Data De-Decentralisation, it’s leading us towards a Data Mesh
About a year ago, I spoke at the Leuven Data Science meetup about Domain-Driven Design (DDD). DS Leuven organizes monthly meetups to bring stories from the industry and scientific research to a vivid group of about 100 enthusiasts. It’s worth mentioning that they still succeed in reaching such an audience in the virtual era imposed by corona. You can expect in-depth mathematical explanations of machine learning models at the meetups, for example. And of course, that’s what these data nerds (I am one of them; hence, I can use the word) love the most in their jobs, yet they can all tell you that data preparation is actually 80% of their job, and it’s that 80% I focused on in my talk.
Domain-Driven Design in a Data Science context
Most data scientists have a scientific background (like mathematics, physics, or psychology) and start writing code (Python, Scala, R, for example) without any proper software engineering skills. So DDD was new to them and up until recently for me as well. Domain-Driven Design comes from software engineering, yet it is not a technique or methodology. It’s a goal to achieve, can be accomplished through different techniques. In DDD, we want our software solutions to be well designed for the business domain problem.
However, the terminology DDD is new to data scientists, one of the core concepts, namely the ubiquitous language, solves an issue well-known to all of them. A data scientist (the numbers guy), a data engineer (the code guy), and a business person (the cool guy?) talk three different languages. Hence one common language spoken by every single one of them is beneficial for all.
Next to that, most generations of data warehouses or data lakes try and struggle to find unique data definitions. After the meetup, I was chatting with people from other companies, and one of them gave the example that if you ask throughout the company what the word “train” means, people will have multiple explanations. How hard can it be, right? We face the same issue; for example, the word “page” can mean a physical page in the newspaper or a webpage. Here the concept of a bounded context might help: it’s the linguistic boundary of a language as well as the recognition that there are different languages and different problems at play.
When applying this insight to software engineering, one easily sees that ideally, one service is coded in one (business context) language. In such a way, DDD is a bridge to the usage of microservices. However, even when software engineers start breaking down monolithic applications, data engineers ingest that data in one central data lake. By doing so, they create the biggest monolith of them all…
An excellent article by Martin Fowler states that from 30.000 ft, the data lake is that gigantic monolith, and looking a bit closer, it’s divided into ingest, process and serve. Which in our case looks as follows:
The ambition of the raw data layer is to contain all DPG Media data. Historically it grew, and we realized that the path to an object in the data lake starts with DPG Media in Belgium, the Netherlands, or Denmark. Of course, there are good reasons why we got to this situation, but datasets are not grouped by the problem they solve. For example, the way users interact with our products leads to the creation of user clickstreams. Although it is nice and useful to filter on country or brand, all clickstreams inherently answer the same business question. We can convert them into multiple valuable assets for business: user profiles, input for recommender engines, and more. And guess what, by grouping this data logically together, the data scientist’s life gets a lot easier: reading one file in your preferred tool to create a collaborative filter model reduces that 80% data preparation time.
This has led us towards the concept of source-oriented as well as consumer-oriented datasets. Yet before diving into those, I would like to tell you about the recent shift we’re making regarding the responsibility to create those data sets or, rather, data products.
In comes the Data Mesh
The data mesh is a relatively new concept that has been adopted by larger companies like, for example, Zalando. Within the Benelux, we are most likely one of the biggest early adopters. A Data Mesh is not the technological successor of the Enterprise Data Warehouse or the Data Lake. It’s more an architectural paradigm, shifting the responsibility to create, maintain, and monitor data products.
Within DPG Media, we have 30 data engineers on 500 IT professionals and 5000 employees. The IT organization is divided into 19 areas where an area consists of multiple teams working on one product or platform. As we try to apply Domain Driven Design, these areas all solve their specific domain problem. Some examples are a Media Platform area, the Customer Services area, the Marketing area, and many more. It is simply too much for the data engineers to deeply understand the domain knowledge and speak each area’s ubiquitous language. This is one of the main reasons why we place the responsibility to create source-oriented data products within the teams having specific domain knowledge.
Next to placing the ownership of a data set within the corresponding domain should be considered a product. In IT terms: a data set in our data mesh must adhere to principles similar to an API contract. A data scientist is familiar with the fact that his workflows break whenever data sources change. Hence this contract prevents such breaking changes, again reducing the 80% data preparation time.
It’s important to mention that source and consumer-oriented data products behave inherently differently. Source oriented sets represent facts and reality of business and change less often. In contrast, consumer-oriented sets structurally go through more change, aim to satisfy a closely related group of use cases, and contain whatever format suits best. For example, the user profile set should be offered in a search optimized database with an API on top to interact with the operational system and a flat-file for the data scientists.
DDD and LeSS: a perfect match
DDD is not the only thing we are happily inheriting from our IT colleagues; we also strive to work agile via LeSS principles. Like the data lake structure, within a few years, we grew organizationally to a suboptimal situation for understandable reasons. We had a bunch of data science teams and four data engineering teams: an ingestion team (guess what they do), a technology-specific team, a domain-specific team, and a country-specific one. And all of them had their own AWS account and data lake. We did introduce the concept of Decentralized Data.
It also meant that we hosted multiple copies of the same dataset. We had no overview of all transformations that were performed on data before they reached the data scientist. We could not guarantee that data definitions were respected, let alone adhere to the correct bounded context.
Within the setting of LeSS, which allows us to adhere more to the goal of DDD, we decided to introduce a data area where all data engineers built on “one single data platform.”
Hence no ingestion squad anymore and a fresh start on a new AWS account. At first, we thought that in an ideal world, we would need to migrate all data sets to this new data lake (and structure it with DDD in mind). That would take us years: we have too many data sources. Fortunately, since it conflicts with the lessons learned regarding the data mesh. We now have the ambition to let data products live in the AWS account of its domain. Luckily we did introduce the UDC or Universal Data Catalog, which also covers some of the DATSIS principles of a data mesh (Discoverable, Addressable, Trustworthy, Self-Describing, Inter Operable, and Secure). This UDC is a sync between all data catalogs in all related AWS accounts.
De-Decentralized Data is born
And suddenly, DDD stands for De-Decentralized Data as well. This approach allows us to overview all data we have, simplify the number of transformations, and make both the source-oriented dataset as the consumer-oriented datasets shareable across the entire organization and all our data scientists.
We can now say that we adhere to the true meaning of DDD and comply with the article by Martin Fowler. We do not have an ingestion team anymore. The entire data area is operationally responsible for all data they ingest. The same holds for teams that ingest data into their data mesh and register it in the UDC. We are growing towards the situation that our 500 IT colleagues register the data in the data mesh and be operationally responsible. Our data engineers can now focus on what they can do best; processing a lot of data and our data scientists? Well, they as well do what they do best: the cool machine learning models!
Sounds cool? You can join our data area as we are looking for a data engineer. All info here.