Main

How supercharged CI/CD & Data Vault ensures data quality and development agility

The quality of data can be the difference between success and failure for businesses In today’s data-driven world. At the same time, focusing on data quality and governance in the wrong way can paralyse an organisation. In this insightful session, CEO and Founder of Y42, Hung Dang, explores how combining simplified use of powerful CI/CD with the right approach to data modelling provides an answer to balancing quality and agility. Hung Dang spent over 8+ years working in the data space and founded two successful data analytics companies. After being frustrated with all the various tools needed to build out a proper data infrastructure, he was inspired to build a product that enables everyone to achieve maximum efficiency through a deep analytical understanding of their business. Follow our socials to keep up to date: https://www.linkedin.com/company/business-thinking-limited/ https://www.youtube.com/@UCdhSsZWTwkX1-v8EX84WLYg

Datavault

13 days ago

In today's session, I do want to talk about how supercharged CI/CD and Data Vault ensures data quality and development agility, but before getting into that yeah just a very quick introduction from my side also. I've been  working in the data industry for a decade by now I studied statistics and in the first two years of my career I actually operated as a data consultant and then in 2016 I built my first data company where I help live events anything from Ed Sheeran to Justin Bieber live concert
to big festivals or the O2 Arena in London to really analyse their users and at some point one third of all live events in Europe they were using our software and yeah I sold that company end of 2019 to a German Fortune 100 company and at the time I was really obsessed with data I love data modelling I use every tool under the sun when I was a consultant starting from Alteryx to writing my own transformation within Python and Stata and AAR and so yeah over the years like working with data for s
uch a long time I've actually seen there is a big shift in the world and this is where I also would like to get started in today's session which is the word of data. Has changed a lot in the last 10 years and I want to highlight the key points that is forcing us to yeah to adapt, methodology such as Data Vault 2.0 or CI/CD and based on that change there's a new world order that I would like to talk about that requires data team to just deploy faster and to also maintain rigorous quality and fres
hness SLA and that's enabled yeah as I've said before with CI/CD that's enabled with specific methodologies like Data Vault and can really help your team streamline how to deploy much more effectively and faster but like what's the change in the world that we've seen right now and when I look back to roughly 10 years ago when I first started there weren't that many data sources that you want to pull your data from usually it's your transaction database maybe Salesforce maybe some marketing data
then you set up like one off pipelines ad hoc pipelines whether using like drag and drop tools like Alteryx or you write your own Python code and put it as a crone schedule on Google Cloud to get executed and then that data is usually being used by very limited downstream usage mostly for decision making for BI but also at the time yeah I was using it for ML use cases as well but yeah basically what really changes in the recent years is data is just being leveraged not just for BI but also in ap
plications and key automated decision making and we see  that data is coming now not just from a couple of sources but from hundreds of sources. We have to centralise that data we have to set out complex data pipelines then really power your ML, BI, AI embedded analytics use cases and the biggest change that we see there is we now ask from data teams to cover A, many more data use cases and B, to actually yeah meet  certain SLAs for data quality and freshness that is as rigorous as yeah a softwa
re product would have. So the data product like the the Mart tables that are being presented to the end consumer that's being consumed by the end consumer need to meet that rigorous requirement and because of that, the only way how data teams can do that, oh sorry this presentation I don't know why cut off here but basically we expect more from data teams and that can only be accomplished by working like software engineers and so what we see is this red line here if you don't employ best practic
es that software engineers have already pioneered before you will never be able to deploy fast and maintain this high rigor of like data quality and freshness SLA and those best practices yeah we call them Data Ops we can call them and software engineers have pioneered Dev Ops practices and so what are like these practices and what happens when we don't employ them and that's kind of the graph that my mental model like at first if you don't have Git or CI/CD the complexity is of course lower and
if you only have a couple of use cases you might be better off just quickly using a tool like Alteryx even drag and drop or a very simple like write out your code or use dbt for very simple pipelines. But then over time if you don't employ this best practices the complexity just sky rockets and because of the complexity you need to pay much more for engineering hours like obviously it cost you a lot and whereby if you use Data Ops best practices right off the bet you can better that complexity
to um and you can stay agile as a team deploy very frequently with confidence you can apply governance and data quality and really move through very scalable data infrastructure and processes and that's why the complexity line really flattens in this case and so you don't pay so much the more use cases you want to be able to cover with data team for as I've said AI, ML, VI use cases but also keep a very high level of data quality and freshness SLAs north of now 99.9% right like we all know like
for LLMs we've changed from a world of model first but like the kind of models it's a data first LLMs that works up to 16% more accurate than when you're trying to just deploy different models so the data quality is what matters for UPI use cases your your key decision- making UML and so that's why those are the two changes that I really want to talk about today. Which is how can you deploy faster and maintain data quality and freshness SLAs and all of that using CI/CD all of that using Data Vau
lt and let's get started with the first use case. Right so it is expected now for data teams to just deploy faster and what does that mean right? And the first thing is we are here at a Data Vault Meetup and so what I believed and I'm not going to walk you through how Data Vault works or I'm just going to highlight a couple of key concepts here I think most of you all know. This methodology in case you don't yeah there are many great workshops coming up as well as already announced before but ju
st to quickly touch base if we use a scalable data modelling methodology like Data Vault it will just help us as a team to yeah address agility and flexibility and like helps us deploy just much faster because so what is Data Vault? It is a detailed oriented historical tracking and uniquely link set of normalised tables that support one or more functional areas of business. It's a hybrid approach encompassing the best of breed between third normal form and star schema the design is flexible scal
able consistent and adaptable to the needs of the enterprise it is a data model that is architected specifically to meet the needs of today's enterprise data warehouse data warehouse and this is the formal definition as written by the inventor than lined and this is like a nice picture here on the data breaks website but like at the end you have these Hubs, Links and Satellites which are the key elements of Data Vault and with those guard rails you can onboard new developers new engineers that f
ollow certain guidelines and help them become much more productive and understand the setup and help them contribute yeah in ways that really benefit your organisation but as I've said before I'm not going to go too deep into that because I think we have more like a unique angle to talk about how you can even deploy faster and ensure data quality with CI/CD but just to highlight Data Vault is one way yeah to really help you deploy faster. Now another way that we have seen is the adoption of Git
and why Git and again we still at the use case of helping data teams deploy faster with confidence and when we introduce Git right like Git I assume everybody in this call knows what Git is so I'm just going to briefly touch based on that like Git is being used yeah to version control changes in your configurations and it helps you to collaborate you can branch off your code yeah you can merge it together you can like trace back you can audit everything that you have committed it's extremely yea
h efficient because Git uses a pointer system to track commit and branches and so roll back or merging changes between branches is instantaneous because Git just like Swaps the pointer and yeah you can also recover and roll back very easily and so usually if you define and you model your data you have some SQL definition like this very random code here but then you use that definition and you materialise a table inside the data well or view or materialised views or whatsoever. So let's stick wit
h one key elements that data teams need to adopt in order to become yeah as effective as software engineering team is definitely Git and with the introduction of Git we will be able to version control our code but it's very important that we don't just version control our code we have to some sort of ways like version control our data as well right and the way the world is currently doing that is we like use the same code configuration and we generate like production data set or schema in Snowfl
ake and then we create a stage in data setting schema and also for every feature branch that we generate we need to create a new data environment and what these data environments are is at the end of the day right you are a data team you want to make experiments and new code changes through your data pipeline but you don't you want that to happen in an isolated data environment and not override production right away like that's the key goal that we want to accomplish as a team to really branch o
ff and make our experiment and merge that changes back very successfully and in then preferably in an automated manner and because it because we can branch off and work so securely in our isolated branch code environment and data environment yeah a lot of people can work on the pipeline in parallel to build out more use cases and then can merge their changes back with confidence that it's going to work on the production environment once I merge that data now reality is it's not as simple as you
think it is because right now the way we're trying to build out those environments is when we create a new code branch a new like feature branch what we have to do is we need to create a completely new data set inside our data warehouse by cloning our production data set to that isolated environment and like by yeah cloning our production data set and so of course you can leverage tools like within Snowflake like zero copy clone or within big like table clone and but effectively you still have t
o clone your entire production data set to an isolated environment so you don't override that and when you then make a change in that isolated environment like you apply a code change and then you materialise that new table inside the data warehouse in that isolated environment you then need to of course change everything that's downstream dependent of that and then if you want to merge that chain to production you usually open up a pool request and then like the CI/CD pipeline would kick in and
then you would need to rebuild the same table that you have already built in your isolate environment and all of the down dependencies and then you need to like retest that and then after it passes the test you then merge it to production and then you have to rebuild the same tables again and all of the downstream dependencies and retest that right and so the way yeah we're trying to really build out environments yeah reminds me of that picture on the right hand side here so it looks like we en
d up working like you work in V final I work in V final V2 and it's just like so hard to collaborate and it's not dry right it's not the engineering mantra don't repeat yourself and so in this case it's just a simple PowerPoint that's 10 or 20 megabytes big to compute and to store but if we keep recomputing restoring the same tables over and over again for environments right like that in this case those data pipelines are gigabytes or terabytes and it's just very resource inefficient so in a nut
shell the way we version control data right. Now using environments powered by those hecky CI/CD scripts is very resource inefficient because it requires you to store and recompute the data over and over again for different environments it's very time consuming so your team is not productive in the time while they're waiting for the CI/CD pipelines because when you merge your code changes right like you need to wait for the re-computation on the staging environment on the prod environment when y
ou want to roll back something you have to roll back the code first and then you have to recompute the same data and all of the downstream dependencies again and now let's take a look at how actually Git version control code right the way Git versions control code is just extremely efficient because Git only stores new versions when a file is modified and so unchanged files they're not duplicated by Git right like unlike the way we version control data get when a versions control code or AKA fil
es it does not duplicate anything and get us this appointed system to track commits and branches like I've said before so because of that, when you merge changes between branches or roll back something git is instantaneous because it just swaps the branch pointer to a different commit and if we really like think about it the efficiency of Git is or like without Git the entire open source software movement wouldn't exist like we wouldn't have the software that we have right now our economy wouldn
't be where it is right now and now imagine what if we able to actually version control data as effectively as we version control code with Git right like when we and like there are so many use cases that really will help us deploy much faster as a team not waiting for those endless CI/CD pipelines to rebuild these environments and it would really streamline the way yeah we as a team stay agile because the version like you just merge your code you when your code and your data and it just works t
hat's going to be a game changer for the way data teams work yeah and deploy together and that's exactly what Y42 yeah has been working on and has been able to deliver as the first data orchestrator that offers foundational version control for both code and data using just Git so it means if you have you just use it and you branch off and when you branch off you don't just branch of your code you also branch of your data and then when you merge your changes you don't merge your code you also mer
ge your data and when you roll back your changes you don't just roll back your code but you also roll back your data together all using just the classic Git commands and because we are so efficient like the way Git handles files we're able to build out a lot of different asset version very effectively and then we can embed quality gates in these asset version such as assertion tests, anomaly detection or governance data diffing unit test, pull requests or yeah automated CI health checks and we c
an embed that and then we can use our intelligent asset based orchestrator that not just builds the data pipeline but also determines which of the version passes actually the audit gate and can go live and can be consume by a downstream asset or by like by a downstream AI, ML, BI school and so we are like really the only solution that can actively govern both your code changes for environments but also regularly hourly source updates and prevent bad data from ever going live I'm going to talk ab
out that now in the next section and this is also yeah as I've said before you don't just the expectations now for data teams is not just to deploy faster like you might just have one or two people now on the data team right it's not 2021 anymore but you still have to meet all of these data use cases and model your data in a way that it's scalable and that other people can participate right you don't just need to deploy it more frequently you also need to ensure data quality and freshness SLAs a
nd that's the tricky part like we have this problem of bad data right like everybody talks about bad data and I want to like zoom back one step and say like what is bad data why does it matter where does bad data actually come from and the definition that I always give is bad data as like un-govern data like data basically in the wrong hands and mal-form data which does not model the word correctly, it's inaccurate it's incomplete it's inconsistent or it's out of date so it's inaccurate right li
ke wrong data. You type in something wrong when you fill out Salesforce it's inaccurate it's incomplete you miss out a last name it's inconsistent you might have duplicated emails so the same person it's out of date that person already changed their name like last name because they got married or change  their email so that's what we describe as bad data and of course the result of bad data is like productivity issue "excuse me has bad data something to say also with that it's morally not good?"
Yes absolutely yeah I mean it has ethical reasons as well is that what you're referring to yes so it has yeah so societal and ethical impact right like for example of invalid AI model inputs and a lot of issues with bad data in general so unfortunately we just have a bit of time left here so I'm trying to get over still a big part of that so let's have the questions at the end but thanks for chiming in here. So the what we see right now in the world with bad data is the data volume has really i
ncreased exponentially in recent years and our data investment has also increased. We invest so much money now as an organisation into data infrastructure but at the same time like data incidence and data downtime keeps rising right and the financial impact of bad data is being estimated by Gartner to be like around 30 million per organisation so it really adds up to this trillion dollar bad data problem in the entire economy and we also don't believe that like data quality tools are the answer
to this data quality problem so when you set up a new data stack like you purchase a tool like a data warehouse such as Snowflake, BigQuery or Databricks then you need to add an array of tools in order to make use of that computation layer, computation storage layer you would need to buy a tool like Fivetran to incest your source data from Salesforce from your transaction database my SQL and so on then you need to transform that data using a tool like dbt then you need to orchestrate that data p
ipeline using Airflow and then after you build out your data pipeline you build on top an observability tool such as Monte Carlo but Monte Carlo can only prevent issues like cannot prevent issues from happening it can only report bad news when bad news already happen and so it's a bit weak in our point of view because like don't take our word for it this is Monte Carlo's website it says  "Data breaks Monte Carlo ensures your team is the first to know" right and it's in our point of view a quite
weak value proposition because as a normal organ like if I'm a normal enterprise I'm paying for this data stack here like north of 500,000 a month and excluding the compute and storage cost of the cloud data warehouse and so when working with data teams we have seen that yeah 53% of data teams identify data quality as the top priority 78% of data teams intend to invest in data quality solutions but we don't believe in this paradigm where problems need to happen in order to be found and this is l
ike let let's also go back and think through where does bad data actually come from right and so if we have like a classic data pipeline here I'm like this data pipeline sits on top of a data warehouse like Snowflake or BigQuery and I use  like I incest data first of all at the raw layer I pull data in from transactional databases  from Shopify and whatsoever then we transform that data and that data is being then used by operation apps Downstream VI, ML and whatnot and right like that's a very
basic data pipeline now let's assume we built out a data pipeline it works perfectly well there's no data quality issue everything is fine now the first place where bad data can happen is actually when I start modifying that data pipeline right and we need to modify the data pipeline because the world is constantly changing and our data pipeline is nothing else than us trying to remodel the world data is nothing than us trying to find patterns in the world and so because the world is constantly
changing so does our the need for the data pipeline to constantly change and adapt to  the business rules of the world and when we make a change in a fully functional data pipeline that's  one major source of bad data because yeah we can easily introduce some logical flaw in here we can you know add or delete the transformation step and we yeah tend to that and so that's one major source of bad data coming from and people we make that big problem out of oh bad data like it's so complex but in on
a very fundamental level bad data has two sources and this is if everything else stays constant the only other place where bad data can come from is at the very source so when you like use a two like Fivetran to pull in data from Salesforce once an hour suddenly pipe trend append or up sorts its new data into an existing live table and it has a duplicated ID or email and so it's coming from the source and yeah of course the  person who is responsible for it in this case is the growth team who i
s maintaining sales for us right it's not the fault of the data team but to like really highlight there are two places  where bad data can come from you make a change in your data pipeline and you like new source data arriving throughout your data pipeline and that's the only two places and now currently the way we handle like data we handle the updates is we need to yeah we need to wait for problems to happen first before we can find them as I've said before. So whether you do a code change or 
like an hourly source updates like these two use cases that I've shown you before like whether you do any of them we tend to write and publish directly inside the live data set so the example  I gave before I pull data from Salesforce once an hour using Fivetran. Fivetran will just append or upsert new incremented data to my live data set and if it's a bad data well it goes live like there's nothing you can do about it and that bad data will cascade downstream throughout your transformation step
throughout your data world modelling your Star Schema whatever all the way down to your BI your ML, AI tools and only if that bad data is like goes live then we start auditing that data we start auditing it us to set great expectations, dbt test, anomaly detection like Monto Carlo and then we resolve that issue and we don't believe that this is the right approach and we actually want to move into the world where we prevent  bad data from happening in the first place so we believe in a new parad
igm similar to the way Gandalf stands in front of the bar rock and says bad data you shall not pass and this is whether it's a pipeline update or a source update, we write that like we create a new version of that that data and it's in an offline environment, we don't deploy it we audit it there first with  again a normal detection data test whatsoever and only if it passes this audit it then can go online  and publish and that should happen both for code updates and regular Source refreshes. An
d so in very technical terms this is a CD/CI problem that we have so every time we create a new, whether it's a source update or code update, we do a continuous deployment we create a new version but we don't let that version go live yet and then the CI pipeline kicks in continuous integration is to like test, run anomaly detection, run assertion test, run unit test whatsoever and only after it passes then it gets deployed so it's actually a CD/CI/CD problem we need to embark in order to guarant
ee that bad data never goes live and yeah this is exactly what Y42 as I've said before up here has been able to deliver because we are able to version control the code and the data so effectively together and that helps again with the like two fundamental changes in the world one is to help teams deploy much faster and more frequently and B, is to embed quality gates and audit gates be and to determine which version then can really go live and be consumed by downstream assets, AI, BI, and ML too
ling and there is going to be a new master class next week with yeah with Rob from our side  and the Data Vault User Group to really showcase you how that works but for now  I'm just going to like quickly show you a couple of concepts even actually live within Y42 for the next 5 minutes and I'm already logged in Y42 in here. Today in the demo orc  and you can see that I'm in the product space so the way you can imagine spaces is within an organisation, the sales team the marketing team, the pro
duct team, they would all have their own dedicated spaces to build out the data pipelines and this product space here is connected to a data warehouse like Google, BigQuery or Snowflake Git Repo like GitHub or GitLab and  the storage like Google cloud storage, S3 and as blob storage. For today's demo I've already set up a very simple data pipeline. I pull in data from Google Sheets but Y42 has hundreds of pre-built connectors then I use dbt SQL to transform that data it's yeah it's not a Data Va
ult 2.0 because it's just a very simple example I just use classic staging and Mart yeah data modelling in this case and as you can see Y42 like we are on the main brand right now like this data has already been materialised within the data warehouse so that's actually the data that sits within BigQuery right now and within Y42 like every single interaction that you do is actually like Git using Git in the background so let me show it to  you like with Y42 you can, add like SQL model, Python, on
e of our hundreds of pre-built connectors in this, so I'm going to just show it with how it works with a simple cheat sheet and  so I'm going to set up yeah secret I've already set it up before and with that secret I can pull data from a couple like of sheets and I'm going to do it for the raw customers and raw orders and what happens here is Y42 generates everything as code so we build the fastest Git engine in the browser using web assembly so every single thing you do in the UI it generates c
ode at the end of the day and we need that definition of code in order for our CI/CD pipelines to work and because it's all code like you can of course have all the Git functionalities natively in here you can create new branches or whatsoever using just the UI or if you're very technical you can yeah just use the terminal and it's all you can do you can code with Y42 or use yeah the UI so we have a vs code embedded in here as well so you can just like really code everything instead of using the
UI but that's not what I want to show you I want to show you  this you know CI/CD magic and this version control of code and the data that really can like  help the team deploy so much faster and ensure bad data never goes live so for that I'm going to like discard this local code change here so yeah you don't see it anymore. So this is like the  data pipeline right like for today's demo I'm going to create a new brand and want to call  demo Data Vault and when I create a new branch I basically
create a new like branch of the code but I also branch of the data at the same time so I instantly get all the data available on this isolated demo Data Vault environment like I don't need have to clone any data it's basically a  pointer system similar exactly like Git internals and so I work completely in this isolated code  and data environment and when I make a code change in here such as like I you know delete that one line of code in here I just delete that one line of code I preview the d
ata whether it still  yeah works or not and we see okay it works I have yeah and so when I push this change like this  one line of code where I delete this one filter having payment method and equals coupon right like I just push this change what happens is Y42 takes the code like takes this very code here code definition and builds this asset on this completely isolated environment right and so let's trigger that build, one second, it's going to trigger the build right. Now using that new code
it's happening as of right now and suddenly we see it's red and what's happening here is there's one column level test that says the accepted value failed and I can go to like see the fade row and see there's one value called coupon that actually failed the data test and what happened is I've set up before that the payment method in here it can only be bank transfer, credit card and gift card but it cannot be coupon and because of our one code change that we we manage, coupon suddenly appears no
w in the data as you can see like when we preview the data with that new code we actually having one more row called coupon but coupon does not pass the test and what Y42 does is we automatically roll back to the previous version here where it passes the test and coupon is not live so it means the data that's being consumed by a downstream asset like this Python script right it won't contain coupon so we don't even let coupon go online in this isolated data environment and when we trying to make
a pool request to deploy that changes from our feature branch to main. That  will not like let me just quickly do that I create now a simple pull request at this point usually like conventional tools would need to create a CI/CD pipeline to rebuild your change asset and all the downstream dependencies on a staging environment retested but in the case of Y42 because we version control the code and the data together it basically goes instant so there's a CI health check that tells you instantly,
hey this payments table, it has a problem and some down like the downstream dependencies of it is like depending on this so please go and fix it first  before you can actually merge that change and this CI check costs you nothing it's instant and so it really streamlines the way you can work with your team to deploy these changes and deploy them, deploy them confidently and so in this case what you see is we stop with our like CI/CD approach by building out a new version we audit it, first we st
op a bad version from going live we stop you from merging that bad code from this feature branch into the main  branch and we actually version control code and data together so if you want to revert the code changes you would actually revert the data with it together as well and so yeah I just reverted the code so it means in a split of a second you will see that yeah it becomes green again and so in the build history you don't even see that red job basically we version  control the time dimensi
on with it as well and if we were to actually make you know a valid code change in this case and like build out this valid code change in this isolated environment then like and we that change and we want to merge that code change to main, it would merge the data with it as well and all of its downstream dependencies. So I'm not going to go much deeper into that but I just wanted to show you some really conceptual work of the presentation so in a nutshell what we believe and to summarise it agai
n, the world is changing the use cases for data has exploded in recent years we need to work like software engineers in order to deploy much faster and ensure high data quality for our downstream use cases and the only way how to do that is use a good like methodology to model your data so everybody's well aligned within your data team such as Data Vault 2.0 introduce software engineering best practices like Git and build out versions but don't do it the current way it's very inefficient and res
ource intense to like just wait for CI/CD pipelines and wait for changes to wave cost CI/CD pipelines and recompute the same tables over and over again, instead yeah of course I have to brand Y42 in this case but it's really the best solution to streamline your change management that you don't deploy any bad changes and it just cuts down your data warehouse cost by 30 to 60% and your waiting time and then also don't embed data quality tools on top of your data pipeline but use them in the CI pi
peline and so  you can prevent bad data from actually ever going live and yes that's the presentation an Y42 one more last sentence so at the core Y42 is a declarative asset based orchestrator. We run the best of breed open source tools such as Airbyte, Python, dbt and we have the best in class Git Ops to have your version control data and code and yeah a lot of observability and governance on top of that data catalogue performance analytics whatsoever. No need to show that all to you but please
attend the next masterclass where my colleague Rob will go very in-depth into how we can build our Data Vault 2.0 with all of these you know bells and  whistles of CI/CD and preventing bad data from ever going live so thank you so much, yeah  now open for any questions that you have.

Comments