Basic A/B Testing with Mixpanel
In a previous article, we explored how statistical analysis of historical data can help us identify which factors influence conversions the most. We found that;
a. the time it takes a user to onboard during the free trial and
b. the number of other colleagues they invite,
one more than the other may be affecting the chance of converting to a paid subscription. The relationship of these factors to conversion was measured to be statistically significant (using p-values), suggesting that this wasn’t coincidental.
However, since these conclusions were based on historical data, the best we can conclude is that there is evidence that merely suggests a potential statistical association between these factors and conversions. Suppose we were to observe the behaviour of two new users again tomorrow. In that case, we cannot take for granted that the performance of these two factors will certainly influence whether they convert. Since so many factors are at play affecting whether a user will choose to buy our product or not — often many we aren’t even aware of — we cannot rely on what the statistics on historical data show alone.
While understanding any associations in historical data is still essential, more empirical evidence is needed to guide decision-making confidently. The next step then is to progress from mere statistical association/correlation on historical data to real-life experimentation where we can observe the impact of such factors on real users and in near-rea- time. By conducting controlled experiments, we can more effectively test our hypotheses and better inform our decisions on what to develop next.
What is an A/B Test?
An A/B Test is an experiment setup that allows us to observe what happens when different groups of users are provided with a different version of our app. The groups receive a version of the app that differs in one or more things or ways — we call these variables. A test where the app we provide users with differs in only one variable is the simplest to implement.
For this example use case, we will only focus on changing one variable: The first group of users in our experiment will land on our regular signup form as it currently stands (it has more than seven input fields). In contrast, the second group will land on a variant of the same form but with fewer input fields (only three essential fields ). We call the first group that lands on the regular form without any changes the “control” group, denoted by the letter “A”. The second group that lands on a modified version of the form is the “experiment” or “B” group.
A/B Test Setup best practices
Formulate a clear hypothesis
Remember our fictional company CurioCity, from the previous articles? Let’s imagine our experiment is taking part during CurioCity’s first quarter of the year, where the focus has fallen heavily on improving signups. Together with the rest of the product team, we’ve been evaluating how users sign up for our app, and one of the areas of contention has been the signup form.
Our form includes the necessary email and password input fields required to log our users in and five fields our sales and marketing teams have used to collect information about users/prospects in our CRM. Data collected include their country of origin and where they’ve heard about us.
Given our renewed focus on increasing conversions and the increased bounce rate, we’ve been contemplating whether the form should be drastically minimised only to include the necessary info (email, password and terms agreement). The assumption is that any CRM-specific information could be captured in other ways after signup.
With this in mind, we’ve formulated the following Null Hypothesis; “Presenting the user with a signup form with only the necessary three input fields will not affect conversion.” Refer to a previous article on the role and usefulness of the Null and Alternate Hypotheses and how to formulate them best.
Split users randomly into appropriate experiment groups
One of the critical factors for ensuring the validity of an A/B test is the method of splitting users into different experiment groups. Ensuring the users are randomly assigned to either the control or experiment group helps avoid bias and confounding variables, leading to more accurate and reliable results.
Randomisation is essential to maintain the integrity of your experiment. If users are not assigned randomly to experiment groups, the risk of selection bias arises. For example, if you’re running a global website, you would want to avoid a group dominated by users from a particular geographic location like France. Such a skewed distribution could distort your experiment results and fail to generalise across your entire user base.
There are numerous tools available that can help assign user traffic randomly to pre-specified experiment groups. Major hosting providers such as AWS, Google Cloud, and Netlify provide services for A/B testing, including mechanisms for random user assignment. These tools and the engineering teams can help ensure that each user gets a unique property indicating the test group to which they have been assigned, allowing us to distinguish them in the experiment results.
This kind of user property will indicate which variant of our signup form they have been given access to. In our case, when users visit our website, we will assign them to one of two groups: the Control, which lands on the regular form or the Experiment group, which lands on the variant of the form with the reduced fields. We will name these groups “Control” and “B: Reduced Fields,” respectively. This property is attached to the user’s session or account. It ensures they consistently see the same variant whenever they access our app during the test.
Decide on an appropriate timeframe to run the test
The duration of an A/B test can depend on numerous factors that will be context-specific, including:
- Traffic Volume: If an app has a lot of traffic, we can gather data faster; therefore, A/B tests can be completed more quickly. Conversely, if our app has low traffic, we might need to run the test longer to gather sufficient data. If traffic is too low, the amount of time a test will have to run will need to be longer, making the A/B test less worthwhile in the first place.
- Effect Size: Depending on the size of the difference we are trying to detect, our testing time might need to be adjusted. When trying to detect a small effect, we need a larger sample size, and that may need more time to implement.
- Baseline Conversion Rate: The conversion rate before implementing the A/B test can also impact how long we need to run the test. If our baseline conversion rate is low, we might need to run the test for longer. As a general rule of thumb in the industry, an A/B test is deemed to only be effective if the sample size is in the thousands and the observed conversions in the hundreds (in other words, we have at least 10% conversion rates).
Generally, a timeframe of two weeks to a month is often used for A/B tests. This is typically enough time to account for day-of-week effects (for example, usage might be higher on weekdays compared to weekends) and to gather enough data for the results to be statistically significant.
Running the test longer isn’t better than running it shorter. If a test runs too long, other factors could influence the results. For example, changes in user behaviour over time or external factors like a marketing campaign or a holiday season could skew our results. Also, since one version might turn out to perform significantly worse than the other for our users then we risk exposing them to a suboptimal experience for longer which might in turn result in detrimental side effects (such as higher, unplanned churn, low conversion or disengagement).
Implementing a basic A/B test in Mixpanel
We’ve tagged the two groups of users (“Control” and group “B”) and we are already automatically sending each event they generate in our app (such as opening a form or signing up), to Mixpanel. We can now analyse their behaviour and overlay it with other events (such as conversions)
First, let’s visualise the two groups in Mixpanel through their generated events. We can do this by heading to Reports and creating an Insight. From there we select the “Form Open” event and add a breakdown with the User property “experiment group”.
We can then choose to visualise our data with a Bar chart from the top right corner dropdown. Doing so gives us an indication of distribution between the two groups.
We can see that we’ve had an equal amount of users who opened the signup form assigned to each of our two groups (Control and Group B).
From here, we can use Funnels to understand how our users are moving from one step to the next and how many are dropping off vs completing both steps. We can visualise this by creating a funnel report instead of insight and selecting a second event too.
Once we transition our view to a Funnel (by selecting the second button on the nav bar on top of the event selection), we will select “Sign Up” as the second event and maintain the same breakdown (i.e. by the User property of “experiment group”). We need to also select “Uniques” and the appropriate timeframe on the Date filter on top of the bars (our test has been running for 2 weeks).
The Funnel visualisation shows how many of the users who have opened the form have also signed up. The conversion rate is the percentage of users who have completed both steps. We can see that the conversion rate of Group B (which landed on the form with the reduced fields) is higher by ~6% compared to that of the Control group (the users who landed on our current form with 7 fields).
Could this increase have been coincidental? Or is it really showing that users who got a reduced fields form found it easier to convert? Mixpanel can help us to an extent with this question by giving us a Significance score between 0 and 1.
Since the significance is higher than 0.95, we can conclude that the result was unlikely because of random chance or coincidence and that something might indeed be affecting users in Group B, leading them to convert more than the Control. We discuss statistical significance in further depth in Part 2. and Mixpanel also have a great article on the subject.
Making decisions based on A/B test results
Our results showed a 6% increase in conversions with the new reduced signup form variant. This sounds compelling. But can we know what led to this change? Remember that all a hypothesis test like ours can tell us is that the result was not coincidental.
By running an experiment on real, current conditions we can certainly feel more confident that there might be a causal relationship between the length of the form and conversion. Still, we cannot take this for granted. Before we make any decisions we need to lean on additional tools and go beyond the data.
In product development, we should endeavour to have a healthy mix of quantitative and qualitative insights before making a decision. A great next or step would be to qualitatively analyse user behaviour through surveys, heatmaps, or user session recordings, to understand how they interact with the form and their motivations. We can use tools like HotJar for this. We can also interview recently converted users or lost prospects from both groups to understand how they perceived our sign-up flow and how much the signup friction influenced their decision.
Evaluate Business Impact: A 6% increase in conversions could be significant if you think about it in lost revenue, but what does that mean in the context of the overall business, strategy and priorities? We need to weigh the impact of implementing the new form variant against the costs. Costs can include development time, opportunity cost, and other potential downsides such as hindering the ability of our sales team to know as much as they need about the users that attempt to signup to our app.
These cost might turn out to be greater than the benefits of having ~6% additional conversions today. Our sales team could be using the more elaborate data captured in our current form such as Name and Phone Number to proactively reach out to all users that have attempted to signup but not converted. In a B2B context for example with a high ACV (Average Contract Value) and multiple pricing points, empowering our sales team with such information might turn out to be worth more than the small percentage increase in conversions (especially if this is mostly observed for the our most basic pricing tier). Before we consider minimising the data we collect on the form we need to work with our Sales teams to also analyse and quantify the impact they can have by using the additional data.
Iterate and Optimize: A/B testing doesn’t have to be a one-time thing especially when we learn new information. Since product development is a process of continuous improvement we can use our findings to formulate new hypotheses and run new or different tests.
Communicate the Results: If we decide to implement the successful variant, we need to clearly communicate the results and the reasoning behind our decision to all relevant stakeholders. In an upcoming article, we’ll discuss how to document decisions and their evolution using standard issue management tools such as JIRA and Notion and take an experiment-driven approach to defininng PRDs.
While A/B testing can be powerful, it should not be done in isolation or have its results taken for granted. We should always consider our findings within the overall product strategy, business context and user experience.