An end-to-end overview of the data science and software work that goes into creating product recommendations at Apteo.
At Apteo, we believe that intelligently using data can lead to better business decisions, higher sales, and faster growth. In the past year, we’ve focused on applying this philosophy to the world of ecommerce marketing.
We’ve seen how much more effective marketing campaigns are when they incorporate personalization features. Marketing campaigns that target your biggest spenders need to have different offers than marketing campaigns that target your dormant customers.
Emails that contain products specific to each individual customer are more likely to be seen and are more likely to lead to a sale than an email that contains a generic list of best sellers.
And ads that contain products that a customer is more likely to buy can have lower customer acquisition costs and higher conversion rates than generic product ads.
Creating personalized product recommendations has been a poster issue for data science for a while now. In the rest of this article, we’ll take a closer look at the technical details for how we at Apteo handle the technical work to power our personalized product recommendations. Our goal is to show you, from soup to nuts, exactly how we create and productize product recommendations.
Note that this article will be a lot more technical than what we normally produce, and is primarily intended for software engineers and data scientists. If you’re looking for a higher-level overview, we recommend taking a look at a previous article we wrote titled “What Goes Into A Cross-Sell Recommendation”.
As with any technical project, we need to be clear on what we’re trying to accomplish. Apteo’s product provides two key personalization features:
Given these three product requirements, we know that we’ll need a system that can predict probabilities. Specifically, we’ll need a system that can predict the probability that a particular customer will buy a particular product (or no product at all) going forward. Defining the requirements in a simple way like this might seem superfluous, but we’ve found that with most machine learning projects, it’s incredibly helpful to keep the end in mind.
The definition above can also help us get a sense for the type of modeling process and features (individual data attributes) that we’ll use. Using the requirements above, we know that we’ll need a system that can take into account the latest data we have for all of our customers within a store and generate measurements of how likely those customers are to buy specific products within a store. This leads us to using a regression model which needs to combine data related to customer behaviors, demographics, characteristics with data related to products and overall store characteristics.
Now that we have a sense for the overall requirements and data science process we’ll use, we can start to get specific about how we’ll structure the data. We’ll first need to specify what each observation represents.
Since we know that we’ll be training an algorithm to predict probabilities using the latest data that we have for customers, products, and the overall store, we can begin to envision a data structure that will enable an algorithm to learn how the combination of these data points lead to future purchases.
We know that customer behavior changes over time. Not only that, we know that a customer’s future relationship with each store will be influenced by their previous relationship with that store. Since we know that a customer’s relationship with a store will change over time, we know that either the modeling process or the data structure will need to take into account some sense of change over time.
At Apteo, we account for this process by structuring each observation as a snapshot of a specific point in time for each customer. Each snapshot contains a record of what’s known about that customer at that particular point in time. As a customer changes over time, new snapshots can be generated, and those new snapshots can provide a model with context and a set of references that it can use to identify patterns for how people’s behavior will be modified over time.
The obvious concern that arises with this approach is that the frequency of snapshot creation is unclear. If we create snapshots that are too close together in time, it would not only introduce bias into our model, but it would also increase data storage needs and costs. Alternatively, if we create snapshots that are too far apart, the model may not be able to glean enough information from some of the time-specific features that we’ll use in our modeling (more on that below).
In order to come up with an acceptable solution to this problem, we can rely on what we know about customer behavior to come up with a variable-frequency snapshot generation process. Specifically, we generate snapshots by recording what we know about a customer immediately prior to them making a purchase. We supplement these snapshots with one additional snapshot for each customer that has gone dormant. This allows our models to discover patterns in data that are indicative of a customer who will not make another purchase. Note that we define a dormant customer by using a simple statistical calculation that takes into account a store’s average number of purchases per customer, average time between multiple purchases by the same customer, and the total amount of time that has passed since each customer made their last purchase.
While it might sound complicated, when implemented for stores that have a sufficient number of customers and previous orders, this data construction methodology is not only simple to implement, but it also provides a robust method for a model to learn patterns across many different customers, products, and periods of time.
Now that we know what our observations represent, we need to very specifically define the features that will be associated with those observations. This is where data science becomes as much of an art as it is a science. In order for us to define features that have predictive value, we need to understand enough of the business that we’re working with to identify the types of features that can lead to a model being able to learn future customer behavior.
We focused primarily on quantifying key aspects of customer behavior and combining that with a few demographic metrics for that customer. You can find a complete list of the features we use here.
You’ll note that the majority of features are specific to a customer’s previous purchases - we look at things like how much a customer has spent, what they’ve bought, how often they buy, how long have they been a customer, and what their favorite products are. When possible, we also take into account what products they’re browsing on a store’s website.
We account for previous behavior by using counting features, which allow us to calculate things like the total amount spent over the past week, month, year, etc, and which allow a model to identify patterns across time without the need to use constructs like LSTMs or recurrent networks. This allows us the flexibility to experiment with different learning algorithms while allowing us to continue to include metrics of a customer’s behavior over time.
We combine these behavioral metrics with higher-level demographic data like a customer’s geographic location and gender. Finally, we also include some simple product information about a customer’s favorite product, and we’re experimenting with incorporating some simple NLP tactics for their favorite product’s name.
Ultimately, once we have identified the features that we want to use, we track them in the table on the page linked above, and as we experiment with adding new features, we incorporate them into the table.
With our feature list fully defined, we need to ensure that we are collecting and storing data properly so that our data structuring process can later take source data and organize it into a properly structured dataset.
We accomplish this through the combination of webhooks, which collect information about updated orders, and a scheduled job, which is responsible for periodically gathering data from external sources and recording them in our data warehouse. For the most part, these processes work in append-or-overwrite mode on the datasets in our data warehouse. This allows us to continuously gather information as time goes on, which our data structuring process can take advantage of as it constructs our training and prediction datasets.
As we continue to improve our processes, we occasionally identify new features that we would like to experiment with which we may not already have the raw data for. In cases like these, we identify the types of raw data we’ll need to create our features, create the code needed to collect that data from an external source, and then update our scheduled jobs to collect that data as needed.
For example, we recently incorporated features related to a customer’s on-site browsing behavior into our models. This was not information that was originally present in the main data source that we were using to collect order and transaction data. However, we noted that most of our customers use Klaviyo, which does collect this data (when a customer configures it to do so), and makes this data available via their API.
We were then able to update our scheduled job to collect this data from Klaviyo on a periodic basis and store it to our data warehouse, making it available for downstream processes to use, particularly, for the dataset structuring process to use when preparing a dataset for a model to either train on or predict from.
Having completed the work needed to store and structure this data, we are able to implement the training process relatively quickly. We refresh our models on a monthly basis, training them from scratch with a complete dataset every time. Fortunately, since we use BigQuery for our data warehouse, this is a relatively simple process.
Since all of our data is stored in BigQuery, the process of training a model is as simple as:
Step (2) above does require a lot of SQL work, especially given the large number of features we use, not to mention the dynamic nature of some of those features (for example, we use counting features to keep track of how often a particular customer buys a particular product, and since each of our models are created specifically for each store, we need to construct these features dynamically). Also of note in step (2), when we create labels for each observation, we need to ensure that we are looking forward for each customer to see what the next product they purchased was. It’s important to avoid lookahead bias and leakage when creating datasets like these.
Depending on the number of customers and orders for each store, the model training process can take as little as a few minutes or as much as a few hours. Fortunately, using BigQuery + SQL to train and store our models allows us to not have to worry about scale and take advantage of BigQuery’s built-in data transformation and data pipelining features, which means less code and maintenance for us.
In order to evaluate a model, we simply need to execute a single SQL statement, which will return a variety of evaluation metrics. Depending on the value of those metrics, we may choose to retrain the model with a different learning algorithm, or keep the model that was just trained. We’ve found that when using a more complicated model like XGBoost or DNNs, sometimes it helps to revert back to a simpler model like simple regression. This is especially true when stores have less data than we’d like to see when training more complicated models.
Note that at this point, we could also do dynamic hyperparameter optimization. At the moment, we have taken an approach that is less computationally intensive, optimizing parameters more generically first and then training models with those parameters, however that is an area of future improvement that we’ll take on as we grow.
Once a model is trained and stored on BigQuery, it can be used to generate predictions on the latest information that we have for each customer. We use the same dataset structuring process and queries that we used when training the model, except when we use the model for predictions, we ensure that only the most up-to-date snapshots for each customer are used as input to the model.
On a daily basis, we use our trained models to create these predictions. The model can generate predictions for each customer, outputting probabilities for each customer and the top products they’re likely to buy next.
We then filter out predictions for products that a customer has already purchased (to ensure we are creating truly new cross-sale product recommendations), and we also filter out predictions where the probability of purchase is below our dynamic threshold (constructed by using the set of predictions themselves), and store the results for later use.
Predictions by themselves are useless without productizing them and making them available to our users, so whereas the data science work may be completed, there’s still more work to be done to make these predictions usable. As mentioned above, we use these predictions in two ways.
First, we create segments of customers that are likely to buy each specific product. We use a scheduled job to aggregate the predictions generated by the model and organize the results into groups of customers that are likely to buy each product. We then make these available within each user’s account as a list of automatically generated segments. These segments are kept up-to-date as predictions are made, and customers move in and out of these segments as needed. If our customers have synced these segments to their marketing tools (i.e. an email tool like Klaviyo), our system will ensure that those segments are also kept up-to-date.
Next, we take the top product recommendations for each customer and we sync them to their profiles on Klaviyo. This allows our customers to incorporate these product recommendations into their emails, ensuring they can create dynamic, personalized emails for their customers.
Our hope is that this article gave you a sense for the type of work that goes into creating personalized recommendations for each customer. In the future, we hope to support the use of these personalizations across all marketing channels and on-site in a customer’s store. We continue to research and improve not only new features but new models and new methods for improving our recommendations. Our goal is to bring the same powerful A.I. capabilities that we built at Twitter to all ecommerce brands, even those that don’t have a data science team. If you’re interested in learning more about what we do, feel free to check us out at www.apteo.co or reach out to me directly at firstname.lastname@example.org.
Shanif Dhanani is the co-founder & CEO of Apteo. Prior to Apteo, Shanif was a data scientist and software engineer at Twitter, and prior to that he was the lead engineer and head of analytics at TapCommerce, a NYC-based ad tech startup acquired by Twitter. He has a passion for all things data and analytics, loves adventure traveling, and generally loves living in New York City.