A/B testing is an important part of the conversion optimization process. This makes it possible to test to what extent an intended change to the webshop or webshop has a positive, negative or neutral effect on the goals.
Introduction
There is a lot to consider when setting up an A/B test. For example, a test must work flawlessly on a wide range of browsers, devices, and screen sizes. It is therefore important to check whether any errors have crept into a test. This process is also called quality assurance.
Relevance
Unfortunately, quality assurance is an area that is often paid little attention to or even skipped on a regular basis. After all, it seems to take a lot of time and yield little. The thinking here is, for example, that only minor changes are tested and "what could possibly go wrong with that?". Furthermore, tight planning can sometimes cause quality assurance to be compromised.
However, skipping quality assurance can result in tests going live that do not function properly (for every visitor). This is why it is important that A/B tests are properly checked before they go live. This can provide the following benefits.
Fewer frustrated visitors
Almost nothing is as frustrating for a visitor as a website or webshop that simply does not perform well. This includes buttons that do not work, forms with invisible error messages, or elements that are on top of each other. All these types of points can be the result of A/B tests that have not been properly tested.
Fewer test restarts
In some cases, teams later discover that something goes wrong with a test. This may be the case, for example, after complaints from visitors have been received by the support department, or employees themselves raise the alarm. In these cases the test must be stopped, cloned, repaired, checked, and made live again. This restart requires time and attention from the (often very busy) CRO team. However, it also ensures that fewer tests can be performed per year in this way. After all, while the discontinued test was live (the data of which is now being discarded), another test could have been run that might have led to valuable insights.
More reliable and valid tests
In other cases, however, organizations never find out that a broken test has been live. However, if decisions are made based on this test, this can lead to problems. Because the test did not function (completely) correctly, the resulting data will not be very reliable and valid.
Less revenue loss
After all, broken tests simply cost revenue. After all, if a visitor cannot (easily) achieve his or her goals on the website, fewer products or services will be purchased. In this way, A/B testing becomes an unnecessarily expensive activity, because not only does one have to pay attention to the costs of testing itself, but also to the loss of turnover that broken A/B tests can cause in some cases.
Points to consider
The points below are not a complete list, but merely provide an overview of important points. After all, every organization will have its own unique points of attention, want to place its own emphasis, have access to a different mix of employees, budgets, tools, etc. The previous points will therefore have an effect on how exactly quality assurance will be organized for maximum efficiency.
Naming
Good naming is important to ensure smooth communication about A/B testing. Without good naming, there are guaranteed to be meetings about "that experiment on the homepage" or "the one with the modified text". Personally, I am in favor of a construction in which each experiment is given an ID. Consider, for example, EXA-0001, EXA-0002, EXA-0003, etc. for a company called Example Inc. As you can see, the decision was made not only to work with a number, but also with letters in front of it. This way, a CRO freelancer or CRO agency can easily separate the experiments for different customers. A number of leading zeros have also been chosen to avoid problems with sorting that you would have with EXA-8 and EXA-10 on some systems, for example. Preferably, this ID is assigned to an experiment as early as possible. A good moment for this could, for example, be the moment when an experiment is added to the roadmap.
It may also be wise to use a clear naming structure for variants. Personally, I usually choose to name the original 'Control' and name the variant with two or three words. For example 'With banner' or 'No reviews'. My advice would be to still work with English names, even on a completely non-English website, because you never know whether non-English team members will be involved in the future.
Free and non-binding 1 hour session?
Gain insight into your challenges surrounding CRO
Planning
This checks whether multiple experiments are running simultaneously. Are these compatible with each other from a technical perspective? Has consideration been given to possible statistical interactions that could occur between the variants? It must also be determined whether there are, for example, campaigns or promotions that could influence the experiments. For example, if there is a huge sale during the planned duration of the experiment, this could potentially have an effect on the results. Or is a specific voucher code mentioned in an experiment, has it been determined whether it is valid for the entire duration of the experiment?
In any case, ensure that planned experiments are coordinated with the relevant stakeholders. This ensures that they do not see any unexpected 'strange things' on the pages under their management. A short email can also help when launching an experiment. If you briefly explain matters such as the hypothesis, variants, duration, pages, etc., this is a useful reference work for a stakeholder if questions arise.
Traffic
Experiments need sufficient traffic to achieve statistically valid and reliable results. You should not only consider sufficient visitors to the relevant pages, but also sufficient conversions on the KPIs on which the experiment focuses. It can make a huge difference in the duration whether an experiment focuses on the click through from a product page (PDP) to a shopping cart, or on an actual purchase.
When checking the traffic, also pay attention to whether the traffic allocation is set correctly. This is the distribution of traffic between the different variants of the experiment. In most cases it is advisable to divide the traffic equally between the variants (i.e. 50% to control and 50% to the variant when working with one variant).
Targeting
On which pages, devices, and browsers should this experiment run? For example, things such as URL parameters (such as ?gclid or ?page) must also be taken into account. Can all the required pages be captured with simple targeting, or should regular expressions or even additional tagging be used on the pages themselves? Then there are also exceptions such as problems with trailing slashes or websites that work on both www and non-www, which of course also need to be taken into account.
Segments
In addition to pages, devices, and browsers, an experiment can also target specific segments. A well-known fact is that an experiment should only be shown to logged in visitors. Other options are, for example, that an experiment should only be shown to visitors who come from a certain website or who have seen a certain promotion.
Also keep in mind that the preview mode of some testing tools takes limited account of such 'audience targeting' criteria.
Goals
The goals on which an A/B test is assessed can be measured in the A/B testing tool itself, or in a separate analysis tool such as Google Analytics or Adobe Analytics. In any case, it must be determined for each experiment whether the relevant targets are correctly stored in the chosen tool. If, in addition to the primary KPIs (such as transactions or turnover), secondary KPIs (such as clicks on a button) are also used, these may need to be stored separately.
Tracking
Another important part of quality assurance is to check whether the relevant tools are loaded correctly. For example, if the A/B testing tool is not loaded on some pages on which the experiment should run, this has a major impact on the results of the experiment. The same applies to failure to (correctly) load the web analysis tool and any visitor behavior analysis tool such as Hotjar or Clarity.
Compatibility
Should the experiment still work for visitors using Internet Explorer? What about visitors on the screen size of an iPhone SE? On most websites these are relatively small groups of visitors, but sometimes they are still responsible for a significant percentage of turnover. During quality assurance, keep in mind that the experiment also works well for these visitors or that they will be excluded from the experiment.
KPIs
Has consideration been given to which KPIs the result of the experiment will be assessed on? This is necessary, for example, to make a good estimate of the duration of the experiment. You also need this data to avoid discussions afterwards when, for example, a certain value (such as the click-through rate) has increased, but another has decreased (such as the number of transactions).
Functional
Is everything working properly? For example, are there JavaScript error messages visible in the browser console? Or are there possibly parts of the website that no longer work properly? Such problems can sometimes be difficult to diagnose. Therefore, make sure that you save JavaScript error messages in Google Analytics, for example (and set custom alerts for when their number increases). It can also be particularly rewarding to stay in good contact with the customer support department. They are often the first to hear when there are problems on the website that you may have overlooked during quality assurance.
Conclusion
Performing good quality assurance can make the difference between a useless experiment and one that produces reliable and valid results. The benefits described in this article indicate what the added value can be for an organization. To ensure that as few problems as possible arise from A/B testing, it is therefore important to always perform quality assurance. The following applies: the more accurately it is carried out, the smaller the chance of problems due to an experiment.