Photo by Carlos Muza on Unsplash

Analysing SaaS conversions with Logistic Regression —Part 1: basic statistical analysis for Product Managers

Stephanos Theodotou

--

Welcome to CurioCity (CC Inc), an innovative B2B SaaS company at the forefront of product analytics. As a newly appointed product manager, you’ll be responsible for the web analytics product.

CurioCity’s product suite includes tracking scripts and SDKs for comprehensive client and server-side data capture. Lately, the company has been gearing up for product-led growth, emphasising awareness, acquisition, and activation goals. The marketing efforts have been driving even more prospects into the inbound funnel, flooding the free trial of web analytics, which is your responsibility.

Once users enter the app, they’re guided through an onboarding walkthrough designed to help them understand the value they can get from the product. Every interaction and minute spent on the platform may influence their purchase of a Pro License. Your mission as the new product manager is to ensure a smooth onboarding journey that delivers value to users and drives conversions for CC.

The onboarding process sets the tone for the entire user journey, making it a critical component of the growth funnel. Assisting you in this is Jessica, a seasoned product analyst who can help identify usage patterns and optimise this essential aspect of our growth strategy.

Analysis setup

One of the first things you want to understand is how different factors affect the user’s conversion after the onboarding journey. There is a suite of analyses that Jessica could implement. For this article, we will specifically discuss Regression, one of the most suited to help with this type of research question.

As Product Managers, we often can’t be as versed or hand on in statistical analysis, but we need to understand the most appropriate tools we and our team have to inform decision-making and understand the results. In this article, we will examine at a high level what Regression is, why it can help, and how to carry out a simple regression analysis in Excel (though your analytics team can use more powerful statistical tools). Finally, we will use the findings in subsequent articles to inform our next product development decision.

Regression analysis will help us identify and quantify the relationship that multiple factors might have on a particular aspect of your product, in this case, their effect on conversion. For our purposes, we will use factors such as:

  • Time to Onboard: T2O measures the time it takes a user from the time they’ve signed up to the time they complete what we have defined as the “onboarding journey” — in our case, the semi-guided onboarding walkthrough that takes users through the various fundamental steps to make the most of our app.
  • Invited Users: When users signup to our app, they get their organisation and become org admins. This allows them to invite more users to the app by going to their settings, organisation settings and team management.
  • Channel: This is the source from which the user landed on our signup page. We’ve categorised this into “Paid”, i.e. the user landed on our app via a paid advertisement (such as a Google Ad), and “Organic” (not paid).

Introduction to Logistic Regression

We can utilise more than one regression analysis when trying to understand how certain factors might be related to the outcomes, such as conversion in our product. The type of data we want to assess will play a role in determining whether we should opt for a logistic regression or linear regression.

Since we are trying to understand the effect of various factors on conversion, we will name conversion the outcome variable of our analysis. When the outcome variable is binary, meaning it can only ever be one of two values (such as either a “Yes or No” or “Converted” and “Not Converted”), then we can use the Logistic type of regression.

Logistic regression is a statistical method that allows us to model a binary outcome — like whether a user converts or not — as a function of one or more variables — in our case, 3 variables: the Time to Onboard (T2O), number of additional Invited Users or the Channel from which users landed on our app. Simply put, this analysis will help us establish whether a change in the variables seems to predict the probability that someone will convert and to what extent.

The math of logistic regression

Regression is a formula equation we can implement for each of the records in our dataset:

Don’t be discouraged by how abstracted the formula looks now. We won’t use math when we later implement this. Instead, there are dedicated tools we will use to help us with that, but we need to understand how it’s composed at a high level. As you can see, the equation is made up of algebraic variables, which we need to replace with the factors we want to investigate.

Y is the conversion outcome we are trying to predict for some of the values of our factors. The “Xs” are replaced by the factors of interest; in our case, X1 was replaced with a Time to Onboard of 35 minutes, X2 with 4 Invited Users and X3 by the Channel (since this is a binary option of whether the source was “paid” or “organic” we can replace it with 1s and 0s, so we use 1 for paid).

The sister of Logistic regression is Linear regression. We would use that when the outcome variable is continuous and not binary — such as when predicting a customer’s spending (which could be anything on a range of linear values).

In fact, Logistic regression is partly based on linear regression. The negative exponent of the e in our logistic equation above is the Linear regression formula:

Let’s pause with theory now and go ahead and implement a logistic regression analysis on our dataset using Excel.

Implementing a simple Logistic Regression in Excel with RegressIt

While we can use a few tools to implement or write some code, we can still get a decent result just by using Excel. We will run a basic logistic regression using the Regressit Logistic add-in for Excel (only available for PCs, however), which will be more than enough for this article.

Sample dataset

Our dummy dataset includes historical data about 1000+ users and the time it has taken them to complete their onboarding in minutes (time_to_onboard). It also includes a binary value for conversion (where 1 means the user has purchased a subscription and 0 does not) and a binary value for the channel (where 1 indicates “paid” and 0 indicates the channel was “organic”). Lastly, invited_users shows the number of additional colleagues a new user has successfully invited to their free trial.

In Excel, having added the Logistic RegressIt plugin we have more options under the RegressIt menu on the Excel ribbon. The first thing we want to do is select all cells with data and then hit the Create Names command in the menu. This will us configure the next steps in our logistic regression analysis.

Having created names, we can go ahead and click on Logistic Regression on the left side of the RegressIt ribbon menu.

In the pop-up that opens we are asked to configure our analysis. There are a few things we should check here:

  • Make sure the Confidence level is set at 0.95 and the cut-off at 0.5
  • In the analysis section, select a combination of Logit & Exponentiated tables
  • and use coefficients with p-values
  • Finally, you should be able to see a list with of our variables (these were added after selecting “Create Name” earlier). Make sure to select the variables you want to test (time_to_onboard, channel and invited_users in our case) under the Independent Variables. Conversion should not be checked here. Instead, select Conversion as the dependent variable from the dropdown above the Independent variables.
  • Now go ahead and click “Run”

Evaluating results: Statistical significance of time, channel and invited and their relationship to conversion

We are finally here. The logistic regression analysis has provided valuable insights into the potential effect of Time, Invited Users and Channels on customer conversions. Hold your excitement; we still need to interpret the results. RegressIt makes quite a few available to us, so let’s look at a few, and most importantly, for our simple example, the coefficients and exponentiated coefficients table.

Let’s start with the coefficients table.

From here, we only need the coefficient and p-value for each of our three factors. Here’s what these two data points say:

Time to Onboard:

The negative coefficient for this variable shows a negative relationship with conversion (our outcome variable). This means that as Time to Onboard increases, we can expect conversion probability to decrease. At this point, you might ask, is the value of the coefficient showing me the probability of conversion? The coefficient value isn’t the probability of conversion, but the expected change in the odds of a conversion happening as our variables increase or decrease. For now, the coefficient is not in a format that can help us derive that and is only helpful in showing us the direction — i.e. whether a variable predicts conversion positively or negatively.

Now let’s check the p-value for Time to Onboard. This is under 0.05, which, as a rule of thumb, shows that the result is significant. So for this article, we can safely assume that this result is statistically meaningful enough to consider further. The p-values on this table have been derived by a hypothesis test that RegressIt has run for us automatically. The p-value indicates whether each coefficient (and, by extension, each predictor variable) significantly contributes to the effect on conversions or whether any observed effects result from random chance/coincidence.

You can find out more about p-values and their significance in hypothesis testing in this article.

Invited Users:

If we now apply the same thinking to the coefficient of Invited User, we can conclude that since this is a positive coefficient (we aren’t interested in the actual value yet, only whether it’s positive or negative), it means that every time the number of Invited Users increase, we can expect the odds of conversion to increase by a certain rate too,

The p-value for invited users is also under 0.05, meaning the result is statistically significant enough to consider.

Channel:

Starting from the p-value for the channel, we can see that, in this case, it’s well over 0.05. So we will not consider it a statistically significant predictor of conversions. This makes empirical sense, assuming that we have a well-targeted marketing campaign that brings in leads that would be interested in signing up for our product.

In my experience however, the channel has in the past been related to failed conversions. As we discovered from further tests, this was due to marketing campaigns that did in fact attract prospects looking for a product like ours, however they seemed to expect more advanced functionality from it. It seems like our website failed to explain the limitations of our product. As a result, users would signup but realise right after the product onboarding that this is not what they needed and therefore didn’t purchase.

We’ve established the direction of the relationship between our factors and conversion, i.e. whether they positively or negatively seem to be associated with it and their significance — i.e. whether the association is statistically significant enough to consider them further.

As we’ve deemed the Channel not to be significant enough for further consideration, we will next proceed to quantify the extent to which Time to Onboard and Invited Users appear to predict conversion. We will do this by looking at yet another coefficient — the exponentiated coefficient. This coefficient will help us answer the everlasting product management question of whether we should focus on improving one or the other first. Before we get to that, let’s discuss the mathematics behind how these coefficients affect the extent of the predicted odds of conversion.

The role of coefficients

If you’ve read the article from the beginning, you would have already come across these coefficients. Remember the algebraic variables b1-b3 we left behind and didn’t substitute with anything? These are the coefficients that our tool has calculated for us. They are the key to answering the question that logistic regression is poised to answer: “Assuming all else being equal, what is the probability that a customer will convert if Time to Onboard is 55 minutes?” But, as we mentioned, the value of the coefficient (for example, -0.047 for Time to Onboard) isn’t in a helpful enough format to derive the odds of conversion, but it will help us get to it. You can start seeing how by replacing the values of the coefficients with the values we got in our analysis for each of the three variables we are examining in the logistic regression formula:

Here, b1 is the coefficient for Time to Onboard, b2 is the Invited Users coefficient, and be3 is the source (1 representing “paid” and 0 representing “organic”). As you can see, the coefficients are multiplied by a given value of any of these three factors we want to test at any given time. For example, given the coefficients we got from our tool, the above formula can look like this for a Time to Onboard of 35, 4 Invited Users and a paid source, each with their respective coefficients (-0.047,-0.954, -0.067).

Since each of the coefficients is multiplied by the given value of one of our factors each time, both their magnitude and direction (negative or positive) determine the extent to which these variables affect the odds of conversion.

Our tool also calculates and provides us with the value of a, but we won’t discuss it in our article. So all that is left now is the e constant. This is a standard mathematical constant we won’t get into for this article, but you can see that whatever values we have above will always be the exponent of e. This is why the coefficient in its original form isn’t helping us yet. We still need to solve the exponent of b to get to the actual odds of conversion for a given value of Time to Onboard, Invited Users or Source.

Understanding the extent of the predicted effect of “time to onboard” and “invited users” on conversion

RegressIt has also done this for us in the Exponentiated Coefficient table (This is the same as the odds ratio): The Exponentiated Coefficients give us a better sense of how the odds of conversion are predicted to change for any given value of our variables.

Let’s interpret what this means for T2O and Invited Users:

  • Time to Onboard: The odds ratio of 0.976 “time_to_onboard” means that for every unit increase in “time_to_onboard”, the odds of the outcome being ‘1’ (e.g., conversion happening) decrease by 2.4% (since 0.976 is a 2.4% decrease from 1), assuming all other variables in the model are held constant. In other words, a more extended Time to Onboard is associated with lower odds of conversion which is expected given the make-or-break nature of product onboardings in our industry.
  • Invited Users: The exponentiated coefficient of 1.21 for Invited Users means that for every unit increase in ‘invited_users’, the odds of the outcome being ‘1’ (e.g., conversion happening) increase by 21%, assuming all other variables in the model are held constant.

Association vs Causation

We can see a predicted association between our factors and conversion, but an association may not indicate causation. In other words, just because a relationship seems to exist in our data, it doesn’t mean that, in reality, one factor explicitly causes the other (i.e., more invited users will cause more conversions). Models such as this one are always limited to considering a limited set of factors. They cannot be regarded as a definitive interpretation of the real world. However, they can help us experiment further by forming a hypothesis about real life.

Forming a hypothesis allows us to set focused, testable experiments and drive our product development initiatives based on data-driven insights. In fact, when we first decided to analyse historical data about these four specific factors, we already had some hypotheses in mind we wanted to explore, which we didn’t cover in detail in this article. When we then checked the p-values of each of our variables, we were in practice doing a hypothesis test to check whether the observed effects on conversions had been the result of random chance or something more interesting. Check the following article about forming and testing hypotheses for more.

Given that our logistic regression model showed an association between the predictor and outcome variables, and given we can validate this with qualitative information (such as by talking to our users and analysing the market), then it might be reasonable to hypothesise a causal relationship between Time to Onboard and Invited Users and Conversions. We can check this hypothesis too by refining our expectations and executing a real-life experiment such as an A/B test to see if our hypothesis holds ground or not (More on an upcoming article about implementing A/B tests).

Limitations

One gotcha to look for is multicollinearity, which arises when independent variables in the model are highly correlated with each other. This can distort the coefficients and p-values, making them unreliable. As a product manager, you may not need to understand the intricate details of implementing logistic regression models, but it’s crucial to work closely with your data analysis team to select variables carefully. If multicollinearity is a concern, tools like RegressIt can provide the Variance Inflation Factor (VIF) value, which helps to quantify the severity of multicollinearity in the model.

Another crucial aspect to consider is the model’s accuracy RegressIt provides classification tables, which offer a detailed view of the prediction success rate for each variable in your model. This can help you understand which variables are being predicted more accurately than others.

What we learned and what to do next

  1. Identifying significant factors: Regression analysis helped us determine which factors may have a statistically significant effect on conversions based on our historical data. The number of invited users and the time it takes to onboard seem to affect conversion, while the marketing channel doesn’t. With this information, we can steer conversations in the right direction and rally resources to analyse further and test improvements.
  2. Quantifying relationships: Regression allowed us to put a number to the strength and direction of the effect between our variables and conversions. This will help us prioritise and focus on the most valuable optimisations. The patterns in our historical data suggest that with every minute onboarding is delayed, the odds of conversion decrease by 2.4%. On the other hand, with every colleague a user invites to their free trial, the odds of conversion increase by 21%. While reducing onboarding time in any app is a sensible endeavour, the magnitude of the effect of invited users is far greater in our case, and we might choose to target it first.
  3. Apply business context: Statistical analysis can’t inform our next steps and must be accompanied by context specific to our business and market. For example, in the case of time to onboarding, even if the predicted effect on conversions is lower, it might be cheaper and easier to fix or align better with our strategic aims.
  4. Predictive modelling: Having established the predicted effects with regression, we can use the model to predict future target conversions based on changes in the independent variables. For example, suppose our users’ Time to Onboarding is 35 minutes. In that case, we can benchmark the predicted probability of conversion as it stands today vs if we optimise it to 25 minutes by replacing these values and the coefficient constants we get from our tool into our logistic regression formula.
  5. A/B testing setup: After establishing an association between a factor and conversion, the regression analysis can now guide real-life A/B testing where we can further test our assumption and not rely solely on statistical correlation. The regression can provide a baseline for comparison for our A/B test. (We’ll discuss a practical A/B test in a future article).

Read Part 2. Hypothesis Testing in Product Development

--

--

Stephanos Theodotou
Stephanos Theodotou

Written by Stephanos Theodotou

I'm a web developer and product manager merging code with prose and writing about fascinating things I learn.

No responses yet