Building a Good Model

The Zen of Recast Model-Building

If you are using Recast as an operator, it means that you want to build media mix models on behalf of your clients.

If you’ve built media mix models using other software packages (both closed- and open-source) you have probably seen all of the ways media mix models can go wrong. The major issue that most agencies or consultants face when delivering media mix models for their clients look something like this:

The first media mix model is delivered with great fanfare and the results are shared out
1. But some of the results don’t look right, and so the agency starts running multiple versions of the MMM to try to get to results that “look right”
2. But now the results are changing – was the first model right or was the new model right? How does the client know they can trust the results?
Eventually a “final” model is delivered and now the client says: okay, great! Now it’s been 6 weeks since we collected the data for this initial model build. Let’s refresh the model with the most up-to-date data!
1. Oops, if we use the same model, now all of the results have changed!
2. Do we … go back to the drawing board and try different model configurations again? Do we … “freeze” the coefficients to force the results to match?
3. Or do we show the raw results to the client and have them lose trust in the model?
Now the client wants to really validate the model results, so they run an incrementality test and want to compare the results. The results don’t match
1. Does it mean that the MMM is wrong? Or is the experiment wrong? How do we make a decision?
The MMM says that some channel is really out-performing. You recommend shifting budget into that channel. But now sales go down, not up.
1. The client is pissed. You are probably getting fired.
2. In the best case scenario you don’t get fired, but now the client has tons of doubts about the MMM and will absolutely not be using it for budgeting going forward.

Recast is designed not to make building MMMs easy but rather to help you avoid these problems that modelers face when it comes to building and actually using MMMs in practice. This means that in many ways the Recast platform is more difficult to use than other platforms since there are more checks to run and more “validation” steps required to get to a good model. But all of those checks are there for a reason to help you avoid the most common pitfalls of MMM and incrementality engagements.

To summarize, we have built the Recast platform to solve the following problem that model-builders and marketing analytics agencies have traditionally faced when building MMMs:

The models estimates are wrong and so
1. the results do not line up with deliberate experiments
2. The models can’t be used for forecasting or planning
Models are not robust so small changes to the data or assumptions yield different results
There are no objective measures of model quality so analysts iterate infinitely many times without a “north star” for quality

Given all of this, what are the best practices for model-building using the Recast platform?

Recast Model Quality Checks

You want to build a model that passes all three core Recast modeling checks. Those checks are:

Parameter recovery check
Out of sample forecast accuracy check
Model stability check

While there are no guarantees in media mix modeling, we’ve found that a model that passes all three checks will usually give good results that line up with experimental evidence and generate marketing budgets that actually generate positive returns for the client.

Unfortunately, despite our best intentions, a model practically never passes all three checks on the first try. That means that practical model building is an iterative process where you will set a model configuration and then test it by running various iterations of the above check, generally in the order listed above. First running a parameter recovery check, then checking out of sample forecast accuracy check, and finally running the model stability check.

If you make changes to the model configuration, you generally want to go back and run all of the checks from the beginning, because any change you make could improve one of the checks at the cost of the others.

This is important enough to repeat it again:

If you make changes to the model configuration, you generally want to go back and run all of the checks from the beginning, because any change you make could improve one of the checks at the cost of the others.

So what should you be looking for when running each of the checks? In the remainder of this document we will go through each check in detail and highlight what the most important numbers and graphs to look at are as well as the most common fixes we’ve identified to help address certain failures.

Parameter Recovery Check

Interpreting the Parameter Recovery Check Dashboard

The parameter recovery check sometimes seems like the most obscure model check since it is run using simulated data, but it’s actually the most important check for identifying core issues in the model and getting this check right will make all of the other checks much smoother.

For more details on the theoretical underpinnings of the parameter recovery check, review the docs here, but the basic flow is:

Run a prior only version of the model. In Bayesian statistics, the model is prior x likelihood. A prior only ignores the data (the likelihood) and just obtains results from the prior.
Select a single draw (a random sample) from the prior and use it as ground truth.
Fit the full model (prior x likelihood) on simulated data, where the predictor data is real, but the dependent variable was simulated from the prior draw we chose.
Compare the simulated parameters to the estimated posterior parameters to see if our model is doing well at estimating the true values.

The parameter recovery dashboard helps us know how well we're estimating the true parameters when the true parameters are known. Most graphs will compare three things:

The prior (quantiles of all draws from the prior only run)
The posterior (the quantiles of all draws from the run with the simulated dependent variable)
The truth (the value of the parameter that was used to produce the simulated dependent variable)

What we want to see is that the posterior values are narrower than the prior (indicating the model learned something) and they include the truth (indicating the model learned the right thing).

At the top we report CRPS statistics. The CRPS can be thought of as a Bayesian version of MAPE that takes into account the uncertainty distribution in addition to the point estimate (if two predictions have the same point estimate, the one with less uncertainty will have lower CRPS). The CRPS ratios reported here are the ratio of the CRPS scores on the ROI recovery from the posterior divided by the CRPS scores from the prior. Lower scores indicate better parameter recovery, and any score over one indicates that the model is fatally flawed as the posteriors are worse than the prior.

Additionally, we’ve added a “classification score” that indicates how good the parameter recovery is compared to what we normally see.

Right below the summary we show the results of a “prior predictive check”. The prior predictive check is a standard part of a bayesian model development workflow and it is best thought of as a simple “sense check” to make sure that the priors are compatible with the data in the most basic sense. Here we can see that the black line (actuals) fall well within the range of possible values that are consistent with the data (blue shaded region) so we say that this model configuration passes the prior predictive check.

The next section shows the prior vs posterior checks for all of the most important parameters in the model. Here we can see the simulated “true” ROI in black, the prior in blue, and the posterior estimates form the model in red. What we want to see is that the black line consistently falls within the red posterior estimate.

Of course, we will expect that the model will be best able to recover the ROI parameters for the channels that have the most spend and the most variation in spend so we expect that the parameter recovery will get worse as we look at progressively smaller channels.

Here’s an example of poor parameter recovery for a small channel. Comfortingly, the actuals do fall within the prior. Not so comfortingly, the posterior is basically exactly the same as the prior.

The dashboard repeats all of these same graphs for all of the most important parameters in the model including:

Intercept
ROIs
Time Shifts
Spend-response (aka “saturation”) curves
Saturation parameters
Lower funnel betas/intercept
Context variable effects

Improving Parameter Recovery Checks

Poor parameter recovery may indicate that the model is fundamentally unidentifiable or that you simply do not have enough data or variation in the data to answer all of the questions that you might like to.

These are some of the options we consider when trying to improve parameter recovery:

1. Review highly correlated channels

Highly correlated channels present issues for parameter recovery because the model will be unable to differentiate between the effects of the two highly correlated channels. For example, if two channels like TV and radio are highly correlated, the model might be able to estimate their joint effect with a lot of precision but not be able to know which of the two is better or worse.

In cases of high correlation, combining the channels will improve the parameter recovery score but that often isn’t the best solution. Instead you might:

Check to see if the model can recover the joint effect, but just not the individual effects, and that might be fine for many practical business purposes
See if you can convince the business to more intentionally vary spend in the future – at that point the model will likely become identifiable and you’ll be able to differentiate between those channels

2. Combine smaller channels

Small channels making up <5% of the overall marketing budget often present difficulties in recovery because they simply don’t have a large enough impact on the KPI for the model to consistently identify their effects.

Combining smaller channels into a single larger channel can improve recovery, though it comes with the cost of reduced interpretability and can potentially cause issues of instability if, for example, the combined channels actually have very different saturation points or time-shifts

2. Assess placement of saturation spikes

Saturation spikes reduce the channel saturation in the region of the spike, making spend more effective. Figuring out where to place the spikes to get good recovery may take some trial and error.

Spike placement can be a real art, but we have some great documentation on placing spikes available at this link.

3. Make small channels non-moving

Reducing the number of parameters in the model can improve recovery. If you have many channels with small spend, you won't have enough data to inform how the beta changes over time. By setting them to non-moving, you estimate a single beta for the entire time series (ROI will still change based on saturation levels and contextual variables), which cuts down on the number of parameters the model has to estimate.

This can also be a good strategy for short-lived channels that are only “on” for a few months.

Important note: to some extent this is “cheating” by simply making the parameter recovery exercise easier. Non-moving ROIs are easier to recover, so you would expect to improve the recovery of any individual channel just by making it non-moving.

4. Adjust the spend levels for particular channels

Spend levels control how fast a channel will saturate. Lower spend levels will mean quicker saturation, with higher spend levels resulting in less saturation. Adjusting these values may make the model easier to identify.

5. Increase the number of days of data

If you have less than 2 years of data, getting more data often improves parameter recovery.

6. Link channels by betas, saturation, or timeshift

Linking is a strategy that is similar to consolidating channels but less extreme. Instead of merging the channels the model will link their estimates hierarchically, so that they share information with each other. The betas, saturation, and the timeshifts can be linked.

Out of Sample Forecast Accuracy Check

The out of sample forecast accuracy check (sometimes just called the “backwards holdout” check) is the most intuitive. We simply want to check how well the model does at predicting the KPI on time periods the model hasn’t seen before. This emulates the scenario you’ll be in when you share your model with a client and they immediately ask you “what does this mean about how many sales we’ll have next quarter?”. It is important that you feel confident when you answer this question!

When running these accuracy checks, Recast handles a lot of the tricky aspects for you:

It is a “true” holdout so we ensure that the model can’t see any of the future data during the training process (unlike some other open source tools that allow the model to “cheat” when they estimate the seasonality component of the model)
When running the forecast, Recast also doesn’t let the model see “lower funnel” channels like branded search or affiliates which make the forecasting task “too easy” because of their special connection to the KPI.

Both of these features make the forecasting task more difficult but because of that they more realistically mimic the true forecasting problem the business will face once the model results are actually being used for planning purposes.

When you launch an out of sample forecast accuracy run, you’ll have the option to set the length of the holdouts. We recommend focusing most on 60, and 120, and 180 days holdouts. The 180 day holdouts are most useful as checks on far-off yearly seasonal trends.

Interpreting the Out of Sample Forecast Accuracy Check Dashboard

The statistics at the top quantify the quality of the holdouts. A cumulative result in the 25th-75th percentile of the forecast is considered great, while a result within the 2.5-97.5 percentile is considered good.

When it comes to determining if a model “passes” or not, we mostly focus on the results for the cumulative forecast instead of the daily results since that is what the client will be focused on when it comes to helping them hit their goals (e.g., their Q1 revenue goal).

However, the daily holdouts can be very useful for diagnosing why the cumulative checks might be missing. For example:

They ramped facebook here, the prediction missed high because the dependent variable didn't increase. Facebook is either overcredited or not set to saturate fast enough.

or

The holdout misses here, they ramped facebook which probably explains the dependent variable increase, our model is probably undercrediting facebook.

CRPS scores can be especially useful when comparing across models, as a lower holdout CRPS generally indicates a better model.

The daily holdouts can be useful for identifying specific “misses” in the model that can indicate either missing or erroneous data or a misplaced or missing “spike”.

The daily graph shows the actual KPI with our confidence interval (IQR in darker color, 95% in lighter). This can be useful in assessing fit on individual days, particularly around big spikes.

The cumulative forecast shows the cumulative sum of the dependent variables and the model’s predictions. These are the most important graphs to focus on!

Stability Check

Most brands using Recast re-estimate the model with fresh data at least once per week. With Recast, that means a full re-train of the model and estimating every single parameter in the model “from scratch” without any knowledge of prior model estimates.

If every week you refresh the model and the core results change, that would be very bad. Not only would that indicate that the model is not robust and is likely wrong, but also it will cause your clients to seriously mistrust the MMM results (rightly, in our opinion!).

One pathology that is indicative of underlying model issues is parameter instability. If we only slightly change the underlying data and the estimated parameters swing wildly, it probably means that there is some sort of gross model misspecification that is leading the model to bounce between different possible (but inconsistent) parameter sets. We want to identify this before a customer makes any decisions off of the model’s results.

So, to avoid that scenario, you can run the Recast model stability checks to confirm that the model results are stable when simulating a week-over-week model update cadence.

When you launch a model stability check, Recast will run the model “going back in time” as if it was 7, 14, 21, 28 days ago (or longer!) so that you can evaluate how stable the model’s parameter estimates are over that period.

Interpreting the Stability Check Dashboard

When you run a stability check the Recast platform will produce a lot of dashboards. It will produce a “comparison” dashboard for every week of the stability run comparing that model’s estimates with the week prior and one overall “summary” dashboard that attempts to summarize the total amount of movement in the parameter estimates over the full range of stability checks.

The individual run comparison dashboards are covered elsewhere so in this section we will focus on how to interpret the summary dashboard.

At the top of the dashboard we see some overall statistics summarizing the stability over the whole loop and highlighting the stability for each week-over-week check.

The stability percentages are broken down to show the stability for the main “families” of parameters including marketing performance (saturated_shifted_spend), the intercept, spikes, and lower funnel effects.

The percentages represent the overlap in the posterior distribution of the parameter estimates and we generally consider anything over 80% “passing” and generally acceptable by clients and expected as marketing channel performance actually does change over time.

Important note: 80% is a good rule of thumb but there can be situations in which these stability metrics look poor but the underlying stability is potentially not that concerning. For example, the model can have bad stability because the model is very certain about something (usually the intercept because of a period of zero-spend across all channels), revises the estimate slightly, and we get essentially no interval overlap, leading to poor stability metrics, but the practical stability may actually be fine.

Next the dashboard shows channels that show up as “red flags” in any of the runs that make up the stability loop. The identification of red flag channels is generally overly conservative so this doesn’t necessarily mean that something is wrong, but it is usually worth taking a look to verify that before moving on.

Here we can see that there are a few channels that had less than 85% in-period-effect overlap across a few different runs. Facebook prospecting is the only channel here that’s really potentially concerning since it makes up a large proportion of total spend (over 60%) and shows estimates that are moving around across multiple weeks in the loop.

Next we show the 30-day holdouts for each run in the stability loop. These are useful for identifying if the model has any big misses for any of these weeks. In general, when the model misses a forecast, that will cause at least some revisions (since the model was “wrong” before, that means something needs to change!). These charts can help you pinpoint if there is potentially a missing spike that is leading to revisions.

Next the model shows breakdowns of the stability for “in period effect” (the total effect of marketing) as well as the intercept (the “baseline” or “organic” effect). Subsequently, similar plots are shown for all “contextual” variables (if there are any).

At the very end, the dashboard will show week-over-week change plots for any channels that are “hyperactive” (i.e., moving around in multiple weeks).

Here we can see the plots for facebook prospecting:

Visually, it doesn’t look like there's much need for alarm here as the differences in overlap are quite small. However sometimes we see charts where there is substantial disagreement across runs which would be cause for alarm and a deeper investigation.

Improving Out of Sample Forecast Accuracy and Stability Checks

Problems with stability and out of sample forecast accuracy are generally linked. The reason why is pretty straightforward: if the model misses a forecast, then that is strong evidence that something in the model was wrong, and so the model revises! So often, but not always, fixing stability also fixes forecast accuracy (and vice versa).

Review spikes

Missing spikes can cause the error term in the model to blow up, which has all sorts of knock-on effects. It's very important to check the placement and the fit of spikes!

There are a few main points to keep in mind:

spikes are greedy
spikes should be placed on the peak of the spike in the data
spikes should only be placed when a visible spike occurs in the data

Spike placement can be a real art, but we have some great documentation on placing spikes available at this link.

Review the intercept priors and ROI priors

Check to see if the priors are constraining the model too much – if the posterior estimates, especially for the intercept, are consistently “pushing against” the bounds of the prior, this could indicate that the prior is too tight and conflicts with the real world data. In such cases, loosening the prior can help the model better fit the observed data.

If your priors are very wide (i.e. uninformative), try running sensitivity checks with progressively more informative priors. Monitor how these changes affect model stability and holdout performance.

Important caveat: when running parameter recovery, if the simulated channel ROIs or intercept push against their prior bounds, this is less of a concern, especially if the model can still recover the true parameters well.

Check in-sample dependent variable fit

While in-sample fit is not a good measure for validating a model, it can help to identify areas where the model is missing. If there is one specific time period where the model is not fitting well, it could mean that you are missing data or important context during that period.

Adjust the spend levels

Spend levels determine the prior on the saturation point. Launch a run with spend levels doubled and another with them cut in half. Review the analysis dashboards to see if this helps the model “settle” into a consistent estimate of channel performance.

But make sure the implied priors on the saturation make sense for the channel!

Check the lower-funnel fit

If the model has lower-funnel channels, then poor fit for the lower-funnel channels can throw off the whole model and cause grave stability issues.

This is a very important check that is very easy to forget since lower-funnel channels are often not the main focus of the model.

Check the multi-modality in channel performance

In the run analysis dashboard there are spaghetti plots for “red flag” channels that show all of the draws for the two different runs. If these plots show multi-modality where there are basically two sets of draws, one estimating high ROI and another estimating low ROI then often tighter ROI priors or different spend level settings can help to shepherd the model into only one modality.

Check for channels “trading off”

Sometimes it’s the case that the instability in a model is totally driven by two channels “trading off” which channel is getting credit. In some weeks, channel A gets all the credit with a high ROI, in other weeks channel B gets all the credit with a high ROI and channel A looks terrible.

This is often due to some unidentifiability in the model and you can address the issue either with more informative priors or with additional explicit spend variation or by including a structured incrementality experiment.

More generally, there's a broad principle of "conservation of credit" in the model. If the intercept is unstable, the credit is also going to show up somewhere else, and if possible, modelers should track down all implicated model components.

(This can be spikes too! It's often spikes!)

Channels trading off against each other is the most common case but tradeoffs between channels, spikes, and the intercept are common as well so you must be vigilant.

Review contextual variables

Contextual variables are very “powerful” and can have wide ranging impacts on the model. Check that they're configured properly and the readouts make sense.

Don't include too many of these variables: with great power comes great responsibility.
Be careful with brand surveys and custom step function context variables you throw together yourself, especially if you have multiple in the same model that look similar since that can cause unidentifiability and stability issues.
When using contextual variables, it’s important to think about what the variables actually mean and most problems here can be avoided if you think about what the contextual variable means and whether or not your fit makes sense!

Check out more information about contextual variable configuration in the docs.

Check if instability is “reasonable”

Sometimes instability can be “reasonable” if there is good reason for the model to have “learned” a lot over the time period being examined. For example, if the brand is running a “go dark” quasi-experiment in a channel where they suddenly turn off spend in that channel, we will expect the model to learn a lot from observing that change in spend! So we would expect at least some amount of “instability” around the time of the experiment representing how much the model has learned.

Keep in mind that a perfectly stable model can often be a bad sign since it indicates that the model isn’t learning anything from new data!

Check if forecast misses are “reasonable”

Sometimes it’s okay that a forecast misses if something happened that was truly unpredictable (at least from the model’s point of view). For example, if there was a truly unique exogenous event (like a hurricane or an earthquake) that impacted the business KPI, then we would expect the model to miss the forecast! In this case, it might be a good idea to check the forecast in other periods that don’t include this large exogenous event to check the forecast accuracy in those periods.

Similarly, if the business made some intentional change that had never been observed before (like a price change) then we would expect the model to miss the forecast. In this case we often want to check how quickly the model adapts to the forecast miss and new regime, and how good the forecasts are before and after the regime change.