Hypothesis testing in product development
How to write an effective hypotheses
Consider an assumption you hold about how one factor could affect another in the context of your product. For example, for a fictional product manager at the imaginary CurioCity SaaS, an assumption might be that:
“adding generative-AI capability in our app will increase the engagement of our customer base.”
However, unlike a simple assumption, a hypothesis should be a testable proposition that is formulated based on existing knowledge, theory, or observations. Hypotheses need to be specific, measurable, and defined in a way that allows them to be tested through empirical research — meaning by experience and active observation.
One of the reasons for the above assumption not being a hypothesis is that it needs to be more specific. While generative AI sounds excellent, it doesn’t specify what exactly we are trying to observe and therefore measure. Is it the number of prompts a user sends AI or time spent on other AI-enabled features?
A compelling hypothesis should involve at least two specific variables to observe and measure. Consider the following one:
“During free trial, the total number of additional colleagues that a signed up user invites to the app, significantly affects their chance of converting to a paid subscription.
The variables in this case are:
a) the number of colleagues a user invites
b) the conversion event.
Both can be measured and observed. In Hypothesis testing, the first variable is called the independent variable (the number of colleagues) and the second variable is the dependent one (the one being affected by a change in the independent variable).
We’ve phrased this hypothesis in a specific enough way to allow us to prepare our experiments and investigate further. I call a hypothesis with this level of specificity a Level-1 type of hypothesis.
Refining hypotheses during product discovery with data analysis and other inputs
Now let’s consider a more complex hypothesis about the same variables:
“For every additional colleague a user invites to their free trial, the likelihood of conversion to a paid plan increases by 12%”.
Our hypothesis has been refined and is now even more specific than before. Previously we hypothesised that the number of invited colleagues affects conversions but had yet to specify whether it negatively or positively did so. In this refined hypothesis, however, we are being very explicit about the direction of the relationship between the two variables: We expect the conversion probability to increase as the number of invited colleagues increases. We are also being specific about the amount of change we expect to observe. As opposed to the basic, Level-1 hypothesis we saw earlier, I call hypotheses with this additional level of specificity, Level-2 type hypotheses.
However, how can we get as specific as this and how do we know whether these are justified inferences to be made in the first place? Why do we expect conversions to increase by 12% and not something else?
While we could make an educated guess based on our and our colleague’s experience of our users and products, these numbers weren’t chosen at random. As discussed above, defining hypotheses must follow a scientific approach informed by existing knowledge, theory, or observations. In this case, suppose that our product team has made specific inferences after observing patterns in our historical dataset and statistically analysing the impact of three factors on conversion (one being the number of invited colleagues). Based on these observations, we could extrapolate a hypothesis that we could expect the likelihood of conversion to a paid plan to increase by ~12% for every additional invited user. You can read more about how we came up with this number in Part I.
Reaching Level-2 specificity will help us assess whether the formed hypothesis is worth exploring in the first place. Especially when compared to other priorities we might have. Depending on the business context, a 12% increase in conversions might be less or more important than a 9% reduction in churn rates proposed by other hypotheses on our backlog.
Does this mean we can’t propose a hypothesis until this sort of analysis is in place? Not at all. Even a basic, Level-1 hypothesis — as long as it’s well-framed, can be more than enough to help us investigate. Refining a hypothesis further to a level-2 will inevitably need to be part of our product discovery flow. The more specific we can make a hypothesis, the easier it will be to compare against others, ultimately helping us prioritise amongst multiple, potentially equally interesting pursuits.
Before starting to define level-1 hypotheses, there is another type of hypothesis you should always begin defining first. I could call this Level-0, Hypothesis but I don’t need to; in Hypothesis testing, we already have the concept of “Null” Hypotheses and they can help us better prioritise our focus; let’s discuss why.
Why Product Managers should first formulate a Null Hypothesis
A Null Hypothesis simply proposes that there will be no change or effect between the factors we are interested in. Framing the previous example as a Null Hypothesis looks like this:
“The number of additional colleagues a user invites during their free trial has no effect on conversion.”
Suppose you accept this hypothesis as true (i.e., no effect exists). In that case, it means that if you test it (which you should do since all hypotheses need to be testable), any observed differences or effects that you find must only be due to chance or random variation rather than a real relationship or effect between the variables invited users and conversions.
The null hypothesis typically acts as the default assumption until shown otherwise, and the goal of testing it is to determine whether there is enough evidence to indicate an effect beyond random chance or variation.
An alternate hypothesis contradicts the null hypothesis. The hypotheses we’ve discussed so far have all been alternate hypotheses because they stipulated that a significant enough relationship (i.e. not coincidental) between the two variables must exist.
Since the alternate and null are two sides of the same coin (one expecting no effect and the other some effect), you might be tempted to think that the null hypothesis is obsolete in the presence of an alternate. Why would you establish a null hypothesis when you could formulate an alternate one and try to validate that one straight away? Here’s why:
Let’s say you define a Null hypothesis, you then test it and find evidence supporting the null hypothesis; for example, that an increase in invited users doesn’t affect conversions. This information is actually very valuable for us because it is confirming that the current way in which things work in our app is not a concern (i.e. as far as this variable is concerned, there is no effect on conversions whether negative or positive).
In other words, we can decide to allocate resources elsewhere if a proposed change isn’t statistically predicted to be better than the default situation (presented by the null hypothesis). So where, we would be planning, design and developing tactics to increase the number of invited users, we can instead shift our time and attention towards exploring other ways in which to impact conversions.
As an example, in Part I, we couldn’t validate that the marketing channel (“paid” or “organic”) via which users landed on our app, , had any predicted effect on whether users eventually purchased our product (converted). We didn’t need to investigate an alternate hypothesis further for that variable because the null hypothesis was accepted and that was enough. So instead of spending time on investigating the effect of this variable on conversions further, we focused on other variables: The time it takes for users to onboard and the number of invited users, a change in which seemed to be more statistically associated with the odds of conversion.
However, let’s say that we did observe an effect between independent variable X and conversions. How can we know if an effect we observe during testing is not because of random chance or coincidence? This is critical questions because if the effect is proven to be coincidental then we need to accept the null hypothesis (that no statistical effect exists) whilst if it’s proven to not be coincidental then it means that the independent variable seems to have an effect on conversion and we could accept the alternate hypothesis (that independent variable X affects conversions). To solve this, we can use a hypothesis test; let’s discuss one particular type of hypothesis testing, the Z-Test.
Simple product development hypothesis testing using a Z-test
There are a few statistical hypothesis tests we could implement. A common one is a Z-Test. It allows us to take and test data samples and check if the observed differences deviate from what we would expect given the hypothesis. Let’s look at an example:
From past data, you know that the average conversion rate of your newsletter’s signup form has been at 69% Suppose that you’ve recently made an improvement to your newsletter signup form. Then, in the last couple of months, you’ve observed an increase of conversions to 71%. How can you know whether this increase was coincidental or whether your improvements had actually affected conversions? All we need to figure this out is a null hypothesis to test with the Z-Test:
Null hypothesis:
“The specific change has no effect on the conversion rate.”
ZTest:
Z-Score = (X — μ*) / σ.*
Let’s break the formula down: Denoted by the Greek letter μ. is the expected population average under the hypothesis we have proposed. The population average for our case is the average conversion rate which amounts to 69%. X on the other hand, is the observed average, in this case, the 71% conversion rate we have observed recently.
Lastly, σ stands for the standard error, which measures the variability of our measurements. This number involves knowing the standard deviation and sample size (for example the total number of form submissions which could be 50). Scribbr has a great article on calculating the Standard Error, but for this article, we will assume we have calculated a standard error of 0.75 already. Now let’s ploteverything in the formula:
Z-Score = (71–69) / 0.75
The z-score results in 2.7 and indicates that the observed difference in conversions is 2.7 (which means 2.7 standard deviations), away from what we’d expect under the null hypothesis. Ok but what does this mean? How do we know whether the 2.7 z-score is a large or small score? How do we know if being 2.7 standard deviations away from the typical average suggests that the change in conversion was because of coincidence or because of our work?We can’t infer that directly from this z-score. However, after obtaining the z-score, we can use software to find the corresponding p-value which will tell us if the change to the expected mean was significant or just coincidental.
A p-value represents the probability of getting a z-score as extreme or more extreme than 2.7, given that the null hypothesis is true (that The specific change has no effect on the conversion rate). Suppose the p-value we get from the software is below a 0.05. A p-value below a certain threshold (often 0.05 or 0.01), means that our observation (the 2% increase in conversions) is unlikely to have risen by chance alone. In other words, there is less than a 5% chance of observing a z-score such as this or higher. Therefore, it’s more likely for the independent variable to have caused the change rather than to have happened coincidentally.
To calculate the p-value, we can rely on multiple tools ranging from Excel to BI, depending on the complexity of the use case. In our simplistic scenario, the z-score of 2.7 results in a p-value of 0.003, much less than the threshold of 0.05 (you can use this online calculator to calculate it as well). Now that we have run a Z-Test and gotten a p-value for the z-score, we can more confidently say that the change change we made is statistically associated with increased conversions in the last couple of months.
Other ways to test hypotheses
The Z-Test is only one of a few statistical methods we can use to validate hypotheses. We will need to choose different tests depending on the research question and factors like the size of the sample population. In the previous example, the type of research question and information we had available made using a Z-test more applicable.
A Z-test typically requires a research question comparing two means (for example whether the average conversion after our changes significantly deviates from the current average conversion). As a rule of thumb, a Z-test also requires a large sample size of over 30 observations and that the standard deviation is known (remember that the standard deviation was part of calculating the Standard Error in the denominator, which we didn’t cover). Check this article to learn more about the various testing methods and when to use each one.
Incorporating hypothesis testing in product development flows
Setting up effective experiments is the cornerstone of our data-driven decision-making as Product Managers. Hypothesis testing provides a vital framework, empowering us to form clear assumptions and rigorously validate them through observation and measurement. In practice, this requires thinking about product development as a set of “experiments”.
However, instead of experiments, we often juggle multiple responsibilities, from organising product requirement documents (PRDs) to prioritising backlogs. In an upcoming article, we will explore how to frame experiments in practice, seamlessly integrating them into our product development flow.
Meantime, beyond a hypothesis, experiments require a test. The following article will discuss a practical implementation of an A/B test, from setup to observation, to gain hands-on experience conducting experiments.
By embracing experimentation in practice, we can enhance our ability to make informed choices and optimise our products through continuous learning and improvement.