Artificial Intelligence
Jul 2, 2022

Zero-inflated regression and explainability for customer retention forecasts

We developed new techniques to improve our ability to model zero-inflated data and to provide our users with the ability to more deeply understand their data.

Shanif Dhanani

At Apteo, we use a variety of learning algorithms and models to help our customers predict what their customers will do next, allowing them to create data-driven marketing campaigns and on-site optimizations based on their customer’s behavior. We have a particularly strong focus on repeat purchases and retention. By helping our brand customers take advantage of their existing customer data, we can help them figure out how to incentivize more purchases from their existing customers - an important aspect of growing their ecommerce businesses.

We continue to iterate on all parts of our product, including the data science infrastructure, and in this post we’ll be going over some of the data-specific optimizations we’ve made to our models and our UX. This post is written for those with a moderate familiarity with machine learning algorithms and techniques, but it could be interesting for anyone interested in learning new ways to improve their A.I. models. You’ve been warned 😊.


As mentioned above, we use a variety of models to forecast several key metrics, including forward-looking customer lifetime values (CLV), the likelihood of a customer repurchasing a previous product, and the likelihood that a customer will buy a new product at some point in the future.

Until recently, these were individual models designed to forecast key metrics in the following manner:

  • The CLV model was a single regression model designed to forecast the future spend for every customer and combine that with their previous spend to come up with a single value for their customer lifetime value
  • The repeat purchases model was designed to forecast the number of days in the future that it would take for a customer to repurchase a product that they had already purchased
  • The model that predicted the likelihood that a customer would purchase a new product is a classification model that predicted the probability of purchase for every customer + product combination

We use the output of each model to create different product features in our SaaS app, including creating segments of customers that are likely to buy specific products, likely to repurchase in various timeframes, and likely to become high-value customers.

While our original setup was a good start towards helping brands identify what their customers would do next, there were two key issues that we wanted to address and improve upon.

Areas of optimization

Inflated zeros

An interesting thing about customer behavior data for ecommerce stores is that there’s a large bimodal distribution when it comes to any metric related to retention. More specifically, for most brands, a large number of customers only purchase once, while a small number of customers make a large number of repeat purchases, driving the long-term value of the brand. These customers may sometimes be referred to as whales.

This issue leads to a problem of “zero inflation.” Whenever we try to forecast the future behavior of a customer, there are a large number of “0s” in the label data. In addition, there are a small number of extremely large values due to whales.

Regression models, even sophisticated deep network models, can have a hard time learning the distribution of such models, even after transforming the data using logs or Box Cox. This results in models rarely predicting a “0”, when in fact, most customers should have a 0 for their label, as the values from the whales tend to inflate all predictions. We set out to make these models more robust to this type of data distribution.


The second issue we were hoping to solve was more of a usability problem. Specifically, brands were interested in understanding what were the main factors affecting our models’ predictions. While it was nice for them to have these forecasts, they were also interested in breaking down these predictions to understand what key factors were affecting them so that they could take action.

Until this most recent rollout, our product didn’t provide them with a way to dive deeper into these predictions.

Addressing the issue with better data science

We addressed these issues with two different data science techniques, which I’ll describe below.

Zero-inflated regression

We first started with the problem of zero-inflated regression. As mentioned above, we looked into transforming our data using logs and Box Cox transformations, and we also looked into using different models (up until that point we had been using boosted trees). Unfortunately neither of these techniques did a better job of identifying customers who were not likely to have any future activity, so we instead switched to using a meta model.

We first created a classification model designed to predict which customers would have future activity. We then combined the output of that model with the output of our existing regression model (which we re-trained by using only customers who had repeat purchase activity).

For any customers that the classification model predicted would have no future activity, we assigned a value of 0, and for everyone else, we used the output of the regression model. This simple change led to an improvement in the distribution of our predictions, which a notable number of customers now having a label value of 0 assigned to their future activity.

One interesting thing to note is that we now noticed that a logistic regression model performed better on the new dataset (which consisted of only customers that had future purchase activity) than the existing gradient boosted tree model that we had been using before.

Global explainability

After updating our modeling strategy, we moved on to the task of helping our customers better understand how our models were generating their predictions. We started to dive into the most popular explainability techniques available today, and narrowed down our choices to feature importance metrics that were produced as a byproduct of the modeling process, or by using global SHAP values aggregated from the individual SHAP scores for records in our validation dataset.

After coming across an excellent post from Scott Lundberg, comparing the stability of the two techniques, we decided to go with global explainability metrics produced by aggregated SHAP values, as those tended to be more stable and consistent than the feature importance values from the models.

Interestingly, we now had to figure out how to use these values from each of the two models we were now generating for our key metrics (i.e. whether to use global explainability metrics from the categorical model or from the regression model).

We ultimately decided to use the regression model to provide explainability metrics for customer lifetime value forecasts, and the classification model for repeat purchase and new product prediction forecasts.

In all cases, our training system (BigQuery ML) provided us with a ranked list of features based on the attribution that each feature was assigned during the modeling process. By using these attribution features, we were able to create a ranked list of features, each of which had a category of importance assigned to it based on its attribution ranking. For example, a feature that may have a large impact on the model would frequently have attribution scores at or above 1.0. For those types of features, our system automatically attributes a category of “Very High Impact”, but for features that have a value near 0, we would assign a category of “Minimal Impact”, and for features with an attribution ranking of 0, we assigned a category of “No Impact.”

Using these scores and categories, we were able to create a visual table, combined with a human-readable name, description, and horizontal score “progress bar” indicator (as seen in the main image of this post) that provides our customers with the ability to quickly glance at their data to understand what factors are most important.

Furthermore, we allow customers to explore the relationship between each feature and the customer lifetime calculations using scatterplots and boxplots, with an example provided below.

Focusing on the user

As a data scientist, I geek out on any new cool projects that let me dive into data distributions and learning algorithms. But as a product manager and startup founder, I make sure that any new product feature we roll out to our customers can be useful and helpful for their ultimate goal.

Our zero-inflated regression meta models will help us create more accurate segments of customers for our brands, thus enabling them to realize higher conversion rates and return on ad spend. Our new explainability metrics, combined with some of the new segments we’ve created around customers likely to make a repeat purchase, will help brands (particularly those that have subscription products), better understand what drives their customers to make a repeat purchase, enabling them to create more targeted and accurate campaigns.

Going forward, I look forward to working on new features and offerings like the above. As we roll them out, I’ll be sure to describe our process and techniques, and the hope is that not only will our customers benefit from more sophisticated approaches to utilizing their data, but that data scientists may learn new techniques and approaches to handling real world data.

About the author
Shanif Dhanani
Co-Founder and CEO, Apteo

Shanif Dhanani is the co-founder & CEO of Apteo. Prior to Apteo, Shanif was a data scientist and software engineer at Twitter, and prior to that he was the lead engineer and head of analytics at TapCommerce, a NYC-based ad tech startup acquired by Twitter. He has a passion for all things data and analytics, loves adventure traveling, and generally loves living in New York City.