Measuring the impact of marketing
What happens when companies try to figure out what they're really getting from their digital marketing?
After two decades of search dominance, Google’s immense revenue streams no longer spark wonder in our minds. We forget to be awed by a company that rakes in over 250 billion dollars each year. Just as unbelievably, these riches derive predominantly from advertising, as a conduit between people and the more direct goods and services they purchase. The flip side of this abundance is that the rest of the world’s companies are buying those ads, spending small fortunes to vie for prominent visibility in front of Google’s users. And Google is only the first among many ad platforms.
Digital marketing is a key function at modern companies. It can be one of the largest costs, intending to drive similar or larger sums of revenue. Even if the marketing functions in companies may seem remote to the other employees, many companies could be accurately described as advertising organizations with a product arm attached. Measuring the effectiveness of advertising is a huge field, a constant endeavor both in-house at large companies and through a medley of third party services. It’s a tricky problem, and furthermore it is fairly normal to err by an order of magnitude.
Framing returns
An advertiser is the business that buys ads on ad platforms. If your company sells services or products to end users, it is probably an advertiser. Thanks to tracking, advertisers know which customers came from which sources, how much revenue those customers generated for the advertiser, and how much the advertiser paid for that visitor.1 Advertisers care about the return on their spending. There is even an industry term for that: Return on Ad Spend (ROAS), which is advertising-attributed revenue divided by advertising spend.
Google was (and still is) the most prominent search engine, and Google AdWords (now Google Ads) is built on an auction system, such that advertisers can bid for listings shown above non-paid (“organic”) search results. Higher bids should lead to more often winning ad positions, which should lead to more clicks and ultimately more revenue for the advertiser, with an associated higher cost paid out to the ad platform.2
Importantly, we care about causal, or incremental, revenue to the business. Google charges per click, so the value of advertising depends on the difference between incremental revenue and the costs charged for users who click on ads. The important distinction between incremental and non-incremental revenue is because some users may click on ads out of convenience rather than discovery. In the counterfactual world without the ad, some users would still see the advertiser listed in organic search results. It is common on Google Search to see a website listed in the first ad slot and also in the first organic slot.
Incremental revenue is how much more revenue the advertiser earns by running ad campaigns, which is not the same as how much revenue the advertiser earned from users who clicked their ads. It is not a truly known number. We either show ads or we do not, and we never observe both outcomes simultaneously. Usually ROAS refers to non-incremental calculations of revenue, and iROAS is a best estimate of incremental returns. A similar concept is lift, which is the causal percentage change in the metric of interest.
It might help to show all those definitions in one place.
Incremental revenue: (revenue in the presence of marketing) - (revenue that would have been earned without marketing)
ROAS: revenue / spend
iROAS: (incremental revenue) / spend
Lift: (incremental revenue) / (revenue without advertising)
Incrementality: (incremental revenue) / revenue
Incremental and non-incremental revenue are very different
In 2013 eBay released a bombshell of a paper, with the very non-bombshell title Consumer Heterogeneity and Paid Search Effectiveness: A Large Scale Field Experiment [1].3
eBay ran experiments in 2012 where they turned off advertising, either entirely or in select geographies.
A picture is worth a thousand words, and this picture is worth a lot more than that.
These first experiments were for brand marketing, which is where eBay bids on search keywords of their own name to prevent competitors from showing up in ads even when users are directly searching for eBay. The figures show how when eBay turned off brand marketing on MSN (everywhere) and Google (in Europe), users still clicked on the organic eBay listings instead. Total visitors from those platforms were approximately unchanged. eBay discovered that spending money on brand marketing substituted clicks from organic listings to paid listings, leaving eBay with the same amount of users but making the search companies quite a bit richer.
More specifically, in the MSN (Microsoft’s search engine, since renamed Bing) experiment they saw MSN clicks drop 5.6%, but that was during a season of lower volume from other engines too. When using the other search engines as a control in a differences-in-differences regression, they measure MSN clicks as only reduced by 0.5%, despite so much of their prior incoming volume arriving from paid clicks. The difference-in-differences approach assumes that the relative gap between Google and MSN would stay constant (the parallel trends assumption), such that we judge based on whether MSN clicks decreased more than Google clicks. For the Google experiment there is no such control since eBay didn’t advertise on other search engines in Europe, but they only saw clicks from Google users drop by 3%.
eBay also ran an experiment on Google advertising in the US, using geographic targeting to turn off non-brand advertising for some geographies while leaving it on for others. They kept brand marketing enabled. In this experiment they estimated that all Google non-brand advertising only increased sales by 0.44%.
The scale of these results is very striking. For eBay these had been massive advertising programs. It’s not that incremental revenue is a small adjustment to revenue. Instead, the incremental portion of measured revenue might be only a tiny percentage.
Modeling causal revenue without randomization
eBay’s paper clearly didn’t put an end to the digital advertising industry. They were one of the largest retailers on the internet. Maybe their brand recognition and large marketplace meant they didn’t need to advertise. Their experiment might not generalize to smaller retailers or different situations.
Researchers hunt for methods to measure the impact of advertising without reducing spend. The first idea that might come to mind is a regression, with advertising spend as a feature to predict total revenue, using time or geography as the unit of observation. If advertising spend has no effect on total revenue, you might expect it to have a coefficient near 0.
A big problem with this approach is that spend is correlated with revenue even when it isn’t causal. We might advertise more to users who would spend anyways. The eBay study actually tested this regression (except with logarithms for both variables, for a percent change interpretation), saying “a simple OLS yields unrealistic returns suggesting that every 10 percent increase in spending raises revenues by 9 percent” [1].
If, for example, advertisers use market population size as a heuristic to guide advertising spend, and market size also creates revenue through more potential visitors, then spend and revenue could be highly correlated even though neither directly causes the other. This is an example of omitted variable bias, where some other factor generates both the spend and the revenue in a dynamic system.
We can extend the regression by including more features, reducing the chance of omitting causal ones. When eBay added geographic and time features to their model, the coefficient on spend reduced substantially. But it was still far higher than using an unbiased method (instrumental variables, IV) or their randomized control trial (RCT).
Other methods look at matching treated users with untreated users who are the same in all other respects, such that the only thing differing between them is that one user saw an ad while the other didn’t. Finding exact matches over a wide variety of features might be hard. An extension to the exact matching concept is to match on probability of treatment, instead of matching on specific features, and this is known as propensity score matching.
It is hard to make any strong claims about how well these, and other, methods will perform in general. Unless you happen to work at an ad platform and could look across a variety of advertisers. One platform came along and did exactly that.
Insights from within a large ad platform
The reason that advertisers have to experiment with variation in time or in geography is that they are one step removed from the advertising treatment. Advertisers only get to see the users that actually arrive on their site, and the advertising platform won’t let them directly control or observe what happens before. Nor will the platform pass along user-level data on users who didn’t click on the advertiser’s links. Randomizing on users would be possible by an ad platform, while it it isn’t possible for the advertisers. Facebook, among others, eventually added some such functionality for user-level experiments.
Facebook published a study analyzing 15 such experiments facilitated by their platform in 2015, covering a variety of industries and campaign sizes [2]. Notably, lift varies heavily between the campaigns. Looking at checkout (purchases), many of them have lift in the 1-2% range, while at the upper end, four of the studies have lift over 25%, including the highest at 153.2%.4 It is hard to compare these numbers, because some advertisers may rely more on Facebook ads than the others do. It is even possible to have infinite lift in the extreme scenario of having no brand awareness, no user retention, and no other way of finding users.
In absence of full data on advertiser revenue, Facebook can’t calculate the financial returns to these advertising campaigns.5 What they can do is test out the accuracy of the observational methods, trying 13 variations of exact matching, propensity score matching, regression methods, and stratified regression methods. Facebook is in ideal circumstances to apply those methods because they have large samples and lots of data on each user. In their own words, “First, we observe an unusually rich set of user-level, user-time-level, and user-time-campaign-level covariates. Second, our campaigns have large sample sizes (from 2 million to 140 million users), giving us both statistical power and means to achieve covariate balance. Third, whereas most advertising data are collected at the level of a web browser cookie, our data are captured at the user level, regardless of the user’s device or browser, ensuring our covariates are measured at the same unit of observation as the treatment and outcome”.
Despite all of the data for observational methods, in most of the studies all of the methods overestimate the amount of lift. Not by a little bit, either — often with values 100% or even 1000% higher than the RCTs. Only 2 of their 23 outcomes (across the 15 studies, because some look at additional outcomes in addition to checkout) are consistently underestimated, and they are underestimated by a smaller degree than the effects in other studies are overestimated.
Some reasons we cannot rely on observational methods
A key obstacle highlighted by Facebook is that “because the exposed-unexposed comparison represents the combined treatment and selection effect—given the nature of selection in this industry—the estimate is always strongly biased up”. The omitted variable bias is a severe problem, and they acknowledge that “observational methods would require additional covariates that exceed considerably our combined observables’ explanatory power. This suggests that eliminating bias from observational methods would be hard, even for industry insiders with access to additional data”.
Like the regression method, matching and propensity-based methods can still be biased if we still have some difference in unobserved factors that affect conversion. For exact matching, if two users have the same observed factors then the decision of whether to advertise to them or not might be the same for both. Likewise for propensity methods, there is a risk that propensities will gravitate towards 0 or 1, since “sophisticated ad-targeting systems aim for ad exposure that is deterministic and based on a machine-learning algorithm. In the limit, such ad-targeting systems would completely eliminate any random variation in exposure”. While Facebook does identify variation in treatment that is plausibly not correlated with conversion likelihood, their empirical results suggest that the bias remains and there is still a selection effect where the treated population is intrinsically more likely to convert than the untreated population.
Google has a useful guide in which they also highlight that selection bias “perhaps represents the largest hurdle to [these types of models] providing valid estimates of advertising effectiveness” [3]. They cover a few types of selection bias, including “ad targeting as the ads are targeting a segment of the population which has already shown an interest in the product” such that the “underlying interest or demand of the targeted population is not observable and not included in the model”.
The rest of us are in a much worse position than Facebook was for their study. We do not have the same accuracy of user tracking or depth of user-level features. Omitted variable bias could be much more severe, and if our only data variation is time periods then we have few data points relative to the high number of features we would ideally like to include.
In summary
In this post I wanted to get two points across. Firstly, incremental revenue from advertising campaigns is likely substantially lower than non-incremental revenue, such that incrementality or lift measurement is an essential problem for advertisers. Secondly, observational measurement is wildly inaccurate relative to randomized control trials.
Now that I’ve laid the groundwork, my next post will cover some technical challenges that persist even when we can run randomized trials, and the post after will discuss organizational or strategic challenges that further complicate decision-making.
[1] Blake, T., Nosko, C., & Tadelis, S. (2015). Consumer Heterogeneity and Paid Search Effectiveness: A Large-Scale Field Experiment: Paid Search Effectiveness. Econometrica, 83, 155-174.
[2] Gordon, B.R., Zettelmeyer, F., Bhargava, N., & Chapsky, D. (2018). A Comparison of Approaches to Advertising Measurement: Evidence from Big Field Experiments at Facebook. Managerial Marketing eJournal.
[3] Chan, D., & Perry, M. (2017). Challenges and Opportunities in Media Mix Modeling.
We will set aside the case of users arriving after multiple advertisements. Most commonly we attribute the entire revenue to the last advertisement the user interacted with or saw, a method known as last-touch attribution.
Bidding higher is one way to increase program size. Another is to bid under more conditions, since platforms typically support keyword, geography, and demographic conditions. In a way, bidding more widely is a special case of bidding higher, by increasing bids from 0 to some non-zero positive amount.
Shout-out to Dimitriy Masterov for teaching me (and many others!) about these experiments.
This is lift across the entire Intent to Treat (ITT) population, as opposed to lift purely on the actually treated population. The paper includes both types of lift estimates.
Frankly, I wonder if the results would be favourable enough for them to want to share them even if they could.