How To Create The Best Dataset For Predictive Insights

Shanif Dhanani
Shanif Dhanani
Co-Founder and CEO, Apteo


  • Predictive analytics, backed by machine learning, can help you find hidden patterns and drivers of your KPIs, but you’ll need to create the right dataset to make the most of it
  • Horizontal tables, where each row represents an example that a machine can learn from, and each column represents something unique we know about that example, is a great way to structure your data for predictive analytics
  • For a hypothetical e-commerce store, a horizontal table could have behavioral attributes with counting features (i.e. how many times did the user make a purchase over the last one month, last three months, etc), account-specific data (i.e. how long ago did the user sign up), and demographic data (age, gender, etc)
  • When creating a horizontal table, make sure to create a column that represents the value of the KPI that you care about - sometimes this will require you to aggregate and transform data appropriately
  • Apteo’s predictive insights feature can help to automate the process. We’re also rolling out templates to make it easier to solve common problems across SaaS, e-commerce, and marketing businesses. Contact us if you want to learn mor
  • Download a sample dataset template here


The best businesses today are using data and machine learning to create market dominating products. 80% of time spent on Netflix can be attributed to their data-driven recommendation system. More than a third of all sales on Amazon are driven by their product recommendation system. And over $70B of ad revenue was generated by Facebook’s AI-driven algorithms in 2019.

While you don’t need to be a tech giant to create amazing products and make well-informed decisions, you do need to start leveraging data in newer and more effective ways. For much of the past decade, this meant hiring an army of data scientists, machine learning engineers, and software developers to help you build out an efficient process for collecting, analyzing, and operationalizing data.

But today, there are tools like Apteo that help you dive deeper into your KPIs and create personalized, automated, data-driven processes without you needing to know a thing about data science. Specifically, features like our predictive insights module can allow employees in your analytics and revenue generation departments to quickly hone in on what’s making the biggest impact on your key metrics, like revenue or user retention, and show you what to focus on to improve your operations.

Predictive Analytics

Business vector created by freepik

The analytics world moves quickly. At one point, creating historical charts of your metrics was enough for a manager to feel like they had a handle on what was going on with their business. Today, you can dive deeper. The idea behind predictive analytics or predictive insights is that you can use sophisticated analytical techniques to break down, at a very granular level, the types of customers or behaviors that lead to growth in your key metrics, and use learnings from those techniques to optimize your business moving forward.

For example, let’s say you run an e-commerce app where you sell clothes and you want to understand why you’re losing customers (an example of churn that we’ll use throughout the rest of this article). A few years ago, you might have been content to see a couple of graphs that displayed your churn rate averaged across all your users, with an optional filter to break things down by some simple demographics, like age or gender. 

But you may never have known that the customers who are male, who order once a month, and who live in New York are the most likely to stick with you for at least a year, while customers who are female, order twice a month, and spend $500 per order are twice as likely to stop purchasing from you in a year than all your other customers.

Today, you can use predictive analytics, which are basically just analytical techniques that use machine learning to surface the most relevant behaviors and segments of your data that impact your KPIs, to surface and act on these sorts of insights (thus the name of our feature: “predictive insights”).

In order to get the most out of predictive analytics, you’re going to need to provide the right type of data so that machines can learn what’s really going on with your business. Sometimes this can be a challenge.

As good as machines are at finding patterns, they still have no common sense when it comes to what your data actually means. So it’s our job to make it as easy as possible for machines to give us what we need. When it comes to predictive analytics, this means creating well-structured, horizontal tables.

Horizontal Tables

Time vector created by stories

When you want a machine to learn something, you need to give it several examples that it can learn from. The idea is that you provide the examples (usually as rows within a table) and different attributes that help to describe those examples (usually as columns with a table) and the machine will use the attributes to learn whatever it can about each example, usually with respect to some important event that happened (in this case, whether or not a user churned).

In theory, this is straightforward. But in practice, creating a table like this, which has lots of different attributes for a given example (also known as a horizontal table or a wide table) can be tricky. Creating the right attributes and examples can require an understanding of both your business and the data that you use to track it.

To help make this clearer, let’s go back to the e-commerce example from above. Let’s say that you have a list of all your transactions, along with the customer ID, the purchase amount, the date of purchase, and maybe several other things that you collect for every transaction. Now let’s assume you want to use this data to help a machine identify who’s likely to churn.

Your first instinct might be to export this data from your sales processing system, upload it to a predictive insights application, click a button, and hope for the best. But when you do this, you’ll be hard-pressed to get anything useful back from the machine.

That’s because the machine doesn’t necessarily know that you have a column in your dataset that has a unique ID of your customers, and that the same customer can by many times over the course of a year, or that it needs to look for the same customer ID across many transactions to keep track of when the last time a customer made a purchase.

We’ll have to transform the data a bit so that all of this information is presented in a way that the machine can use to find patterns. Essentially, what we’ll need to do is create a horizontal table that has customers as rows, and lots of different things that we know about customers as columns. We’ll also need one column in the dataset (traditionally the last column) that tells the machine what happened with this customer - did they churn or not?

In the sections below, we’ll go over some of the nuances that you’ll need to be aware of when creating this dataset.

Churn Analysis - An Example Of A Horizontal Table

The problem

Before we jump in, let’s make sure we have a really good understanding of the problem we’re trying to solve. We want to identify the characteristics of our users that are associated with higher-than-average churn. In order to do that, we need to create a dataset that will make it easy for a machine to learn about what leads to high churn rates. 

We’ll represent this dataset as a single table, where each row will represent a user, and each column will represent something we know about the user, and the final column will represent whether or not the user churned. We’ll also assume that we’re starting out with a basic data extraction from our sales system, which contains transactions, customer IDs, and similar types of data points.

Defining churn

The first thing we should be clear about is what it means to “churn.” Different businesses think about churn differently. Usually, in the case of e-commerce, a churned user is someone who has not made a purchase in some specific amount of time.

Every business will care about different timeframes, and most will actually want to understand churn across multiple timeframes. For the purposes of this article, we’ll consider a churned user to be someone who made their last purchase more than six months ago.

Planning the dataset’s structure

The first thing we should think about is the high-level structure of the dataset. We know that we have transaction-level data, but that we’ll need to get our dataset to describe users, not transactions. We know that we’ll need to have different attributes for users, and that we’ll need to have a final column that describes whether the user churned or not.

In order to reconcile what we know about transactions (single point-in-time events) with users (people that perform different actions continuously over time), we’ll have to start transforming some of our data.

We also know that we may have some additional information about our users that is not stored in our transaction system, for example, demographic data, data about whether a user opens our marketing emails or not, data about when they first registered with our service, etc.

Given these different types of data points, we know we’ll need to do some sort of transformation and combination for some of our data points. For example, we may want to take the average transaction size for a user over the past one month, three months, six months, etc. We may also want to record the maximum transaction size for each user over the past year.

We also know that we won’t need to aggregate other types of data, for example, how long ago the user first joined, or what their zip code or gender is.

Planning all of the items we want to represent ahead of time allows us to get a feel for the type of data transformation and aggregation work we’ll need to do going forward.

Let’s start creating a list.

Attributes that describe our customers

People vector created by pch.vector -

When you create a dataset that will be used to analyze your business, it’s important to use your domain expertise to identify the most relevant data points that you want to include. In the case of most apps, websites, or other services where individual people are involved, we know that we can include certain types of data that will likely help us manage our business better.

We’ll go over three types of data below:

  1. Behavioral characteristics
  2. Account-specific information
  3. Demographic information

Based on our experience, when it comes to people, the most predictive things we can include are behavioral characteristics. What did someone do in the past? How often did they do it? To what extent did they do it? These sorts of attributes are surprisingly useful when it comes to predicting what a user will do in the future.

We can additionally include account-specific information about users that may not be directly related to their behavior, but does provide supplementary information about who they are with respect to our service. For example, this could include information about how often they open our marketing emails, or when they signed up with our service.

Finally, the most obvious type of data is demographic information. Attributes like age, gender, income bracket, and similar types of data have long been used in the world of marketing to segment and target users.

We’ll go over each of these types of data points in the sections below.

Behavioral characteristics

Behavioral data is frequently used to understand and predict what users will do in the future. Because people change their behavior very infrequently, this type of data can be especially relevant and useful for machines to identify and predict aspects of future behavior by individuals.

As we mentioned above, users can take different actions over time. But because our dataset is represented as a single, horizontal table, we need to figure out how to summarize a user’s actions over time into several individual columns.

One of the best ways to do this is by using what’s known as counting features (in machine learning, the word “feature” is just another name for “factor” or “attribute”). Fortunately, the idea behind counting features is pretty simple: create a list of the important events that a user can do (login, make a purchase, buy a pair of jeans, etc), create a list of the different time period that you think will make a difference (previous day, previous week, previous month, etc), and just count how many times each user performed each event over each prior time period.

In addition to counting features, you can take averages where relevant. For example, you may want to take the average transaction size for a user over the previous day, previous week, previous month, etc. and create a new column for each of these values. Doing this will allow a machine to find patterns on how users transaction amounts over time may or may not be relevant.

Here’s an example of the types of attributes we might want to create for our hypothetical e-commerce store:

  • How much did each user buy in the past month, three months, six months, and year
  • How often does each user login to the site
  • What’s the average transaction size for each user
  • How much has their average transaction amount increased or decreased in the past month, three months, six months, and year
  • If this is a subscription box service, how many times did they skip their last box, their last two boxes, and their last three boxes
  • On average, how many t-shirts do they buy per shipment, how many jeans, how many blazers, etc.
  • How many times did they reach out to the customer service team in the past month, past three months, past six months
  • What’s their average rating for these interactions
  • How many friends have they invited over the past month, past three months, and past six months

You can create counting features for any important event within your product or service. You can even do this across all the events that occur within your product. Predictive analytics algorithms will be able to parse through even a large number of attributes to find the most relevant items.

Account-specific information

While behavioral data is great for understanding and predicting future user behavior, it can be supplemented by additional information that you have about how a user has interacted with your business across many different channels. 

For example, it may be helpful to take into account how long ago a user created an account on your system (note that whenever you have a data point that is represented as a date, it can be helpful to turn it into a number that represents how long ago that date occurred, since algorithms aren’t great at understanding dates by themselves). It could also be helpful to know, on average, how often your users read your emails.

These sorts of items can help a predictive analytics algorithm understand how to better contextualize the entirety of the data that you provide it. Below is an example of some of the account-specific attributes we could include for our e-commerce store:

  • If the product provides multiple tiers, what tier are they on
  • How many marketing emails have they received
  • What’s their email open rate
  • How long ago did they sign up
  • How long ago did they convert
  • How did they find out about the service

Demographic information

Demographic information provides some basic stats about people - their age, location, gender, etc. These have been used in marketing for quite some time now as a way to provide basic segmentation and targeting of users. They can be helpful for a variety of different problems, and will likely be useful for the sake or understanding churn in an e-commerce store. We’ve provided a small list of suggested demographic information below, but you can supplement these with any number of additional pieces of information that you have:

  • Age
  • Gender
  • City, state, zip code
  • Income bracket

The final step: creating the KPI column

Once you have the majority of your dataset ready, the final step is to create the column that represents the metric that you want to understand. In our example, that would be churn, but it can be any metric that you want to have a better understanding of.

As we mentioned above, whether or not a user has churned is dependent on both what the user has done (made a purchase or not) and a timeframe. In our case, we want to consider a user as churned if they made their last purchase more than six months ago.

Once you have that item included in your dataset as a column, you’ll be ready to throw your dataset into a predictive insights feature and start diving deeper into understanding what drives churn.

Data formatting

When you create your dataset, it’s also a good idea to standardize the formatting of your data. Here are a couple of examples to keep in mind:

  • If you have a currency column, just use the value itself, and don’t include the currency symbol (i.e. 40.21 instead of $40.21)
  • Same for when you have percentages
  • When you have missing values, try to fill them with reasonable values, or replace them with some sort of special token like [MISSING], which you’ll know how to interpret when you see a report with your results
  • Replace dates with their respective months, and when important, quarter, season, and any other cyclical/seasonal attribute
  • Double check any of your numerical columns to see if there are any crazy values in there that are either too high, too low, negative, or not actually numbers
  • If you have any columns where there are mostly unique values (like a customer ID or a category that only appears a few times), you may consider removing them, since they may not provide a lot of predictive power (and even if they do,  you would still need to figure out what to do with these values if they do provide predictive power, and oftentimes you can’t do anything with this information)

If you use Apteo, we automatically handle many of these issues (and more) for you automatically, but it doesn’t hurt to have a clean dataset.

The mechanics of creating a dataset

People vector created by pch.vector -

In the section above, we gave an example of what to do and why to do it, but we didn’t talk about how. We know that data will likely be stored across multiple tables within a database, and oftentimes, also across multiple systems.

As you can imagine, it could take a bit of planning and effort to create the ideal horizontal table to identify predictive insights on your KPIs, but it’s worth it. Just like anything else, we recommend starting small and only adding additional complexity as you need to.

In the e-commerce example, the first thing you might do is to get a list of all your active users, which might require a SQL query or maybe a data dump from a third party system. Once you have a list of your users, you could start small and collect 3-5 statistics for each one, structure them as columns within a CSV or Excel file, and then run that through Apteo’s predictive insights feature and see what you get. You might have enough to get started, or you may want to enhance your dataset with additional attributes.

You may need to write several SQL statements to aggregate and join your data, or write a quick script to extract data from multiple sources and then join them back together into a single dataset.

The key thing to consider is to plan ahead, start small, work iteratively, and make sure to identify what you’ll do with your reports once you get a feel for what drives your key metrics. If you need help or just want to get a better feel for what to do, get in touch and our data specialists will be happy to learn about your use case and provide guidance.

For those that are interested, we’ve provided a starter template, which you can download here.

Predictive analytics for common SaaS, e-commerce, and marketing problems

We know that creating a horizontal table for predictive analytics may not be immediately intuitive, but once you get the hang of it, it comes naturally. The approach is usually the same for most problems:

  1. Identify your core unit of representation that will act as an example that a machine can learn from (in our case, this was a single customer)
  2. Identify the attributes you want to use to describe that example
  3. Extract attributes about the example and represent them as columns
  4. Extract the KPI you care about and represent it as a column
  5. Use predictive analytics to analyze your KPI

If this still sounds intimidating, we’re here to help. Not only do we help to automate much of the work above, but if you’re working at a SaaS, e-commerce, or marketing company, we’re creating templates to solve your most common problems (like churn, retention, and acquisition). If you’d like to be notified when each of these templates is available, sign up for our newsletter, and if you’d like to request a template for your team, just let us know the details.

In any case, best of luck with your efforts, we’re here to help if you need, and if this guide was helpful for you, drop us a line, we’d love to hear from you, thanks!


Header image created by freepik

About the Author

Shanif Dhanani
Shanif Dhanani
Co-Founder and CEO, Apteo

Shanif Dhanani is the co-founder & CEO of Apteo. Prior to Apteo, Shanif was a data scientist and software engineer at Twitter, and prior to that he was the lead engineer and head of analytics at TapCommerce, a NYC-based ad tech startup acquired by Twitter. He has a passion for all things data and analytics, loves adventure traveling, and generally loves living in New York City.

More from our Blog

Get Started with Apteo for Free

Learn how 1,000+ teams are using Apteo as their lorem ipsum platform.