Launching an ad test as described used to mean relegating yourself to endless daily data pulls and Excel pivots to track results. The tools have evolved to make the tracking portion a much easier. There are two ways that I recommend doing this.
The first is by using Google’s experiment tool. I’ve written about some of the benefits these provide previously; in addition, they allow you to layer flexibility across a campaign. So, if only some of your ads within a campaign are participating in an experiment Experiments may be a good fit. Experiments allow you to control a 50/50 (or any other split you may want) split on those while retaining other ad rotation settings at the campaign level such as going after maximal conversions on non-participating ads.
The downside of setting up Experiments is that there are a few extra steps involved in setup AND Google has now limited the number of objects participating to 1,000. So, if you are running a large ad test, this may not be a choice.
The second option is to use labels. If you are sure to properly label all participating ads as control or experiment, tracking results is as simple as peeking at your ad labels in the dimensions tab. Typically, you are working with a control ad that has been running previously. In order to get a clean A/B split in your impressions, it’s helpful to schedule your test ads to start cleanly the day you begin tracking using ad scheduling tools to launch that day (and scheduled end dates are also helpful at sticking to your test interval if you have one). In any case, I'd also like to note the date the test started in my label – it’s a great reminder to yourself (and other team members) of the date ranges to pull data for a given test.
Your “winning” metric needs to be determined based on your business goals. Is it visits, leads, revenue?
The simplest ad tests are looking for CTR improvements. That said, it’s pretty easy to lift CTR by making bombastic claims – “Vladimir Putin Nude!” “Claim Your Free Money Bucket!” – but they won’t turn into revenue for you. For some advertisers, though, the primary goal is visits and measuring CTR lifts only is valid.
On the flip side, there are valid reasons to look only at conversion rate or revenue deltas only, particularly if you are testing qualifiers in your ads (don’t click here if…) and are running a lead gen campaign. Additionally, if you are ecommerce and testing messaging primarily meant to drive better AOV (“Save 20% when your purchase 3”) or LTVs, revenue deltas only can be a valid success metric.
That said, the majority of advertisers are going to want to look at both CTR and conversion/revenue influence – ideally there is a rise in both, but that is rarely the case. Most frequently you’ll see an inverse relationship. Great CTRs come with some degree of overpromising and driving more unqualified clicks and decrease conversion rate thusly. If your conversion rate spikes via use of very exclusionary messaging, your CTR typically goes down as you are pre-qualifying those clicks. To measure the overall influence you need a new metric – we typically use CTR*Conversion Rate*1000 (the 1000 just makes the number easier to look at). If revenue is a component to you, then CTR*CVR*Rev makes the number less awkward AND takes AOV/Revs into consideration as a success metric:
The example above shows our test increasing CTR and decreasing CVR. Looking at the Index (CTR*CVR*1000) shows that the CVR dip is worth it here as overall the test is working 18% better on the combined CTR and CVR. If you factor in the lower revenue on the test panel, though (CTR*CVR*Rev), our ad is still winning, but by a much smaller degree – one that is probably not statistically significant at 3.2%. That doesn’t make it inactionable, though! Let’s talk more about results…
I started by talking about how easily you can track your tests via labels and then implored you to use custom-calculated metrics which throw that out the window. So, first, Dear Google, a custom-calculated metric column would be a handy UI improvement! Back to my point: the dimension tab in the UI is a great place to peek in and make daily (or whatever fairly frequent interval) checks. You are looking for a couple of things:
A relatively even impression split. If your setup is good, you should have pretty similar impressions for each panel. If not, something is wrong and you need to QA your setup , fix, and re-launch a clean test.
You haven’t launched a nosedive. Any new element or test risks totally tanking results as it’s an “unknown” – you want to check that you haven’t just launched the kind of stinker that could keep you from meeting your overall program goals for the month. If the early numbers are really bad, cut the test early to stop the bleeding.
The labels are still handy at the more rigorous test analysis too, though, as you just have to pull a quick download (or schedule a report) and you don’t need any pivots, just a quick calculated column. To determine a result, you ideally will get statistical significance meaning there are enough “actions” (impressions, clicks, conversions) to determine a winner AND the delta is strong enough to make a conclusion. Ideally this will happen within 4 weeks or less as seasonal and competitive auction changes can start to roller-coaster the results in less than meaningful ways. Run the results through an online T-Test if you are unsure. The test designs I previously outlined should help you get there by aggregating findings at a meaningful level, and you should have a nice stat sig finding that you can action on. Cut the loser, launch the winner, define the narrative:
“The Save Today! Messaging contributed to a 20% lift in performance via a stronger call to action and the use of an exclamation point” – tell the whole marketing and design department and the CEO because this kind of finding can influence better messaging throughout the whole business (my example is a “winning” narrative, but losing tests have just as important findings in them). Pat yourself on the back, and then design the next test.
Back down to earth for a moment: the reality is that stat sig still isn’t always there. What to do then? A gut check. As data driven as this industry is, there is still a huge art-not-science component, and the gut of a good SEM is a valuable tool. If you’ve proven that you haven’t screwed up performance but can’t call stat sig and you just prefer the test messaging….launch it! Or, perhaps your conversion pool is just too small because of the nature of your business (B2B, I’m looking at you). Same advice: check your gut and if it “seems” like it’s working (or not), call a result within a reasonable time period and move to the next test.
There are valid reasons to make the determination to keep a test running for another time cycle and continue gathering data. However, be cautioned that running an endless, inconclusive test is getting in the way of running the next *hopefully* meaningful and conclusive one. So, don’t be scared of using your gut; just be aware that is what you are doing and archive the result as such.
This is part 2 of a 3 part series. Read the first part here, and look out for the third and final article to be released next week!
Originally published Apr 10, 2014 3:58:18 AM, updated June 28 2019