CUPED Explained: Why you must know it in AB testing?

If you didn't see my previous episode with Kenneth, Kenneth is a statistician at Meta. He was a PhD at UC Berkeley in mathematics. And he has been researching on how to do experimentations at Meta for four and a half years. Yeah. Thanks for the intro. Nice. Today, let's talk about CUPED. CUPED is actually a magical technique in experimentation because it's so wildly applicable. There are not many techniques that are both theoretically sounding and real life applicable. That is a great technique

that everyone who does experimentation should understand. But, unfortunately, it's hard to find a good tutorial online. So, we want to do that tutorial with you. Yeah, of course, happy to explain a little bit more about how CUPED works. CUPED is kind of like the trade name for this technique. If you want to talk to someone more on the academic side, it usually goes by regression adjustments. It has a really long history. Some of the really big names in statistics, like Fridman started looking in

to this. But it wasn't really until 2013 when Winston Lin has more solid theoretical foundation for this. And that happens also to be the same year where a bunch of Microsoft folks wrote a paper on how to apply this in a pragmatic way. So, like Yuzheng was saying, it's one of the rare magical techniques where it's not often where you manage to improve something without making any sort of sacrifice. In statistics, a lot of times we talk about bias variance trade off. There's none of those here

. you just straight out get something that is unbiased, but you managed to cut the variance. So very, very rare and super powerful if you manage to use that. Basically, you don't sacrifice anything and you get your stat-sig results quicker. Right. with reducing variance, it can turn into a lot of things. Like by reducing variance, if you are already happy with your level of uncertainty, that means you can use fewer of your users in your experiment, run more experiments at the same time. Or if yo

u feel like all your decisions are a little bit too uncertain, you can use this to reduce your uncertainty, get more stat-sig results, get narrower confidence intervals. So, how do I do it? Right so the idea is that in your experiment, there are a lot of things that your treatment actually doesn't affect. For example, I might be doing some experiments on the front page of my website. Great, but this treatment would be very, very unlikely to affect how these users would perform time spent retenti

on before the experiment. Yeah. Unless we have a time machine, of course. We have a time machine. Maybe your experiments can affect something in the past. But the idea is that, there are quantities that we know are not affected by this experiment. And yet, we do see some differences between your test group and your control group, just by chance. So by doing regression adjustments or CUPED, you would be able to leverage these, slight differences in metrics that are definitely not affected by this

treatment to reduce your variance. Basically, you know something for certain and you use that information to do the adjustment. Exactly. Yep. Like one example for this would be suppose I have a magical pill that makes everyone like grow taller. Let's say I want to test like how strong of an effect this gives. you kind of have two approaches. You grab a bunch of high school age kids randomize them. Half of them gets this pill. The other half does not. And then you can compare the heights afterwa

rds. That's going to give you some signal on whether this pill works or not. But you can also think of it what if I instead know there are some differences between these two groups of students. Before even the experiments, I didn't do my randomization right, or Just got unlucky like this half turns out to be a little bit taller than the other half then what a lot of People would do is okay. Maybe we should adjust for that. But how do we adjust for that? There are a few ways that you can do it Th

e oldest way to do this would be do something called a diff-in-diff. Mm hmm. So what you do is you would instead of comparing the heights of these students after the experiment, you compare the changes in their heights during this experiment. And then you compare those two groups and you get your result. So this is one method of adjusting for any differences you may have between your test group and your control group. But we might be in a scenario where this is high school children we're talking

about, they are in their growth spurts. Everyone is probably gaining like 10 percent height per year. So taking the differences Isn't the most efficient thing you can do. You may say since I know everyone is going to gain 10 percent height I should really be subtracting 1. 1 times their beginning height because that would be more of a predicted height where they would be after the end of the experiment, instead of just the height at the beginning of the experiment. Now, if we can do a better pr

ediction of what happens at the end, we should certainly use that. Why use the height when you can use the predicted height? And that's basically the idea of CUPED. why use heights when you have predictive heights? can you use that line to help us illustrate? why both methods are unbiased, but one has lower variance? Of course. Yeah. So for example referring to the assumption that we made earlier that we don't have a time machine. We cannot go back to affect what happens before the beginning of

the experiment. Any differences in height. is purely statistical noise, which means that the expected differences in heights before the experiment is really zero. All we're doing here is taking the differences between the original heights between the two groups. Which is also going to be zero. So once you take average of that you're really just adding an extra term to your estimator that has expectation zero, but that is going to drastically reduce the numbers. For example, instead of having all

the numbers in the scale of like five foot. All your numbers are on a scale of inches, it's going to be way smaller than before. Now, the idea of doing 1. 1 times the height is taking this one step further. We know that 1.1 times original height is also going to be just noise. Meaning that If I compute 1. 1 times height in both my test group and control group, take the difference on average, that's going to always be zero. If it's not zero, it's just purely noise. I see. So when you see some ki

nd of error term. You need to figure out whether it is from the treatment effect or from the noise. And yeah, the noise, you need certain estimations, but in this case some noise are known. So you just take out the known noise, then you know this is treatment effect versus randomness. Right. Yeah. And that's actually one perspective we can take to generalize CUPED a little bit more. a lot of times when we use CUPED, people do defer to metrics or numbers that they know are before the beginning of

the experiment and therefore safe for us to adjust. Part of the point of the poster that I had at MIT code is on how, well, you really should also be using other numbers that you are sure are not affected by your experiment. Oh, I see. So not only the kind of my outcome variable of interest, but taking advantage of, for example, like gender or the school district or whatever, right? Right. Yeah, exactly. Unless your experiment, does something to the school district, then maybe you want to not u

se that. Yeah, I think the general assumption that needs to be valid is our experiment doesn't impact your input variables. Right. Exactly. your, the things that you want to adjust for should be things that your experiment is not going to touch. and certainly like those are things that we should check for as well. And when you design an experiment, these are also things that stakeholders people who have insights to what your experiment is doing should voice their opinion on. There are things tha

t will get affected without you realizing. Oh, that's true. Yeah. Well, I get why you emphasize because in your example is so obvious, right? I don't have a time machine, but in reality, when you do this practice, there are often unpredictable, like things you didn't consider your experiment can actually impact. Yeah. For example, like one silly way to think about this would be suppose I run two different web pages and I do experiments on one of them. Users actually show up to both of these. You

may think, okay, great. I'll also adjust for people's consumption of the other webpage. Cause I'm doing experimental for here. How is that going to affect the other website? It does. Maybe true. Maybe not. If these two websites are exactly on the same topic, say you drink, it's like different YouTube channels, for example. Well, maybe people's engagements and one would lead to them be more engaged with the other channel. And so that violates the assumption. Okay. So now we are doing a regressio

n analysis. Is there many way of doing the prediction? Right. So the number 1. 1 is going to be something you'll need to figure out through a regression. Generally, if you look at the literature, they would call it beta and we'll figure out what this beta is. one approach for this is you can split this into kind of like two different prediction problem. You can first take a look at, within your test group, how well can your pre experiment metric predict your post experiment metric? You get a lin

ear model out of that. And then you can do the same thing on the other hand for your control group. You can figure out how well your pre experiment outcomes are predictive of the post experiment outcome. Now after doing both of these. You have two linear models and remember, one of the fundamental questions that we're trying to do here is we want to figure out for each of your users are what is the effect on them, right? Of course we have a problem that we only get to see them under control or t

est, but not both. But now you have a predictive model, so you can predict what happens in the other can. For a test user we know their test outcome, but we did not know their control outcome. Mm hmm. With our control with our predictive model, the linear predictive model coming out from control, we can predict what the control outcome would have been, take the difference. And that's basically like your estimate of your treatment outcome on this particular user. Now you can repeat that for all o

f the users and take average. That would be how you can adjust for these differences and make good use of these predictions. I'm trying to see the difference between diff-in-diff because, I did a lot of synthetic controls, which is using the control sample and figuring out the weight. Most of the time using regression to figure out the weight to come up with a synthetic control for the treatment and use the evolution of the control as the predicted counterfactual of treatment. Is that the same?

That is actually no different from this, I think. there is a more general theme of where these predictive models can come from. So over here, the example that I gave Would be on okay, let's just use a very simple linear model. We do not use any external data, so we're only looking at, say, test group. Let's build a predictive model based on a test group. Do the same thing for control group. But if you want to make this more general, there's nothing stopping you from, say, Doing synthetic control

, looking at other experiments, not even part of this experiment, just data from somewhere else to come up with these predictive models. I see. Yeah, but then my opinion is when you have the possibility of many models, you go back to the degree of freedom it's easy to manipulate your results. That way, so if you want to do our own experiment and you look for the truth, feel free to do the predictive model that you see best fit. But if you are running an organization and people want to game the e

xperiments, then best to stick with the most simple form. Right. Or at least have a discipline of how to do things. Yeah. So, when we say CUPED is unbiased, there's an asterisk over there. the early example of doing the diff and diff or like 1. 1 times heights before the experiment that's assuming that the model itself doesn't come from the data. But in reality, the data does get used in coming up with these predictive models coming up with even just like the linear coefficients. Which means we

have kind of double dipped the data. Use the data to come up with a model, use the prediction, and then use that again to analyze your experiments. Which is a little bit iffy, right? how bad a problem So, it's actually not too big of a problem, unless you start using a lot of complicated models. And by complicated models, I mean you throw in 10 different things into this linear model. Or non linear model. Or non linear models, right. Yeah. So all of those would cause a little bit more trouble. S

o certainly, like, practicing a little bit more self control over how many things you throw at it. It's important. But if your predictive model is not too tailored, you're not really overfitting, generally shouldn't be a problem. So I can get our endorsements as a math PhD who published many papers on experimentation, that if I do CUPED with just a simple regression linear models, most of the time I should be fine. Right. Some cases where it may fail would be when you have heavy tails for exampl

e, we're talking about running an experiment on income, not even log income, just like straight out income. Those things are really, really heavy tailed, then you may run into some issues. But generally speaking, this is a technique that is guaranteed to be unbiased. Nice. I think that is a great explanation on the methodology. Now I'll challenge you to do this. Suppose I'm a new Data scientists who want to do CUPED. I may have the tool or I may not have the tool. I think at least in Deltoid or

in Statsig, CUPED is built in. So don't even worry about this. But suppose I want to at least write down the formula and maybe try to run the simulation. But just the formula. Which are the formulas I need to remember what, how should I estimate the coefficients and do the estimation? Yeah, of course. So in practice, we don't actually want to go through the whole predictive exercise cause it's going to be a lot of work. there's an equivalent way to think about this. I'm going to refer to Winston

Lin's very famous paper of how to do this suppose we have a Variable Y, which is the outcome that we care about We have a variable X, which is the thing that we want to adjust for and we also have a variable, let's call it T for treatment. So it's going to be a binary variable of whether this is a test or a control. This is your end height, the beginning height and your magic pill. Exactly. Or a placebo or pill. Right. Magic or placebo, exactly. Then what you really want to do is take the follo

wing steps. And these are like extremely concrete steps. If you are an R user or a Python user. Super straightforward to do in R or pandas. First you want to center your Xs. you want to center the mean. So subtract the mean of X from X. then you want to run a regression of Y on the following things. You want to regress Y on of course, there's going to be an intercept term, You want to regress it on the centered version of X. You want to regress on T. And you want to regress on the centered versi

on of X times T. Alright, so four terms in total. I know it's a little bit much but if you refer to Winston's Lin paper, you'll be able to see why we want to do these four terms. But after you regress on these four terms, all you need to do is look at the regression coefficients on T. that's going to be your effect size estimate. That's going to be your lift estimate. Nice. All right. One more question about applying CUPED on ratios because a lot of experiments are on conversions. Then how do yo

u do that? Yeah, so over there is going to be a little bit tricky. The high level idea is that you are taking a ratio Which means you have a numerator and the denominator And a lot of times each of these themselves are means so and the case of conversion ratio your denominator is going to be the average impression. And then your numerator is going to be the average conversion. Both of these are means So you may start thinking, okay, I want to estimate these means a little bit better. So maybe I'

ll just do regression adjustments on them individually. that would not be Well. So that would be valid. You can do that. It would be correct, but it will, it would not be optimal. Because reducing the variance in the denominator does not necessarily mean reducing the variance as a whole. There might be some weird covariance structure there that causes this to not be the case. You would want to think of the adjustments more holistically. think about it as, I have a ratio, And I'm going to Add or

subtract a term. That term has to be somewhat predictive of this ratio of course, yes, ideally correlated with that. And we also want to make sure that this term has mean zero. that was the arguments that we had earlier about why this is unbiased. And then there may be some coefficients in there that you can adjust, you can change those coefficients to minimize your overall variance. So over here I'm referring a little bit to Alex Deng's paper on this. he had a very nice form for why regression

adjustment works. It kind of comes down to, you're still using your original estimator. You're just adding or subtracting a term that is mean zero. And very correlated to your estimator after it, so we're going to take that format as to how we actually figure out the covariance. I think that's going to require some deeper thinking in terms of delta method and such. So the term that you subtract, is that just a error term mean? Yeah, so it would be the error term. Yeah. The term that you would su

btract would be something that looks like. X minus the mean of X. X minus the mean of X is going to have mean zero. And if you carefully choose what the beta is that goes in front of this term, you would be able to minimize the overall variance. So you choose the beta to minimize the overall variance in that minimization function you only have beta to solve for, then you can solve for a beta. Right. Okay. All right. So subtract the error term and choose the beta to minimize the overall variance.

we want to minimize the variance of this new estimator with the added on term. We first need to figure out what is the covariance of the ratio And that's going to require us to do much more math in terms of delta method. And then we also need to figure out how the covariance of this estimator is with x minus mean x. So we need to do a little bit more math over there. Yeah. But fundamentally, you can always fit your denominator, your numerator, your x minus mean x term, Into three by three type

of like covariance matrix, Which you'll get from delta method, and then minimize from there. Okay. So we probably need to post the papers on the video to know what we are referring to. Last question. Is there a package to solve for this? I don't think anyone really wrote a package. It comes down to just using NumPy to do some, I want to say like three or four matrix multiplications. Do you plan to write a package? It probably not, but it's available on StatSig and we can check our methods to mak

e sure it's scientific. So yeah, if there's anything that's like publicly available I would be glad to comment on things. So actually our warehouse native version we make our SQL transparent to our customers. if you get a demo of the warehouse native, you can check exactly how we do the math. finally, I guess if you are currently already using CUPED some of the advanced stuff that we have mentioned of, like, predictive modeling certainly would be helpful over there. my team at Meta actually has

a paper out for this I want to say two years ago on how to do like general predictive modeling to enhance the variance reduction technique that is CUPED. But at the same time, there are going to be some boundaries for how much we can do this. we also have a recent poster at. MIT code on where we think the boundaries are going to be, there will be some limits on how much variance you can keep reducing by doing better and better predictive modeling. It's not just about the techniques being fancy.

It is the techniques being applicable and trustworthy. trustworthiness is the most important thing in experimentations. If you use experimentation to manipulate results you lose the whole point of doing it. So it's better to, know the boundary well. I guess that would be my advice. Yeah and certainly it's a great solution. But it can only do so much. There are not a lot of these techniques out there. Yeah. So if you have it great. You probably won't be able to do much more than that. If you don'

t have it, certainly something worth trying. Nice. Thank you. Thank you. Hope it's helpful. See you next time.

Comments

@pragmaticdata

More reading: - Craig's blog to explain CUPED: https://www.statsig.com/blog/cuped - Kenneth's recent work on CUPED: https://arxiv.org/abs/2311.17858 - The paper by Winston Lin on regression adjustment: https://projecteuclid.org/journals/annals-of-applied-statistics/volume-7/issue-1/Agnostic-notes-on-regression-adjustments-to-experimental-data--Reexamining/10.1214/12-AOAS583.full - The paper by Alex Deng on CUPED: https://www.exp-platform.com/Documents/2013-02-CUPED-ImprovingSensitivityOfControlledExperiments.pdf - Applying ML to CUPED: https://dominiccoey.github.io/assets/papers/variance_reduction.pdf

@user-oq8ny6gi8z

Thank you once again for the summary and the interview. with 大神們

CUPED Explained: Why you must know it in AB testing?

Related articles

Comments