Getting your data ready for machine learning doesn't have to be hard.
Machine learning, unlike human learning, needs input data in a very particular format, and doesn’t have intuition about how to read data. To AI, the “United States” is different from the “US,” and “34 dollars” is different from “$34.”
Some learning algorithms also need a lot of data, preferably thousands of rows and many columns, or attributes, that are indicative of what we’re trying to predict.
First, let’s look at an ideal dataset, and then explore how to create one.
Suppose we want to predict employee attrition. Attrition is our KPI, and its values are binary: Yes or No. The other columns are our attributes, which are indicative of attrition. For instance, an employee with a low wage, low satisfaction, poor performance, and overtime is more likely to quit.
We always need a KPI column and attribute columns. Other KPIs could be customer churn, conversion, email opened, or any other metric we care about.
Notice that we have several different data types above. The attrition column is text, the satisfaction column has some decimal values, and the department column is categorical.
It’s fine to have different data types, as long as each column is just one data type. We don’t want mixed data types in one column. For instance, if attrition has a value called “0” or “False,” we’d want to cast those into the value “No” instead.
Further, we want to make sure that the same values have identical formatting and spelling. In the overtime column, for instance, “Yes” would be treated differently from “yes” or “Y.” The latter pair of values would need to be cast into the value “Yes.”
The KPI is the heart and soul of machine learning on tabular data. It’s what you’re trying to predict, so choosing the right KPI is crucial.
It should be a meaningful figure, not a vanity metric, and it needs to be something you’ve collected data on for a while. If you’re using tools like Google Analytics, you’re already tracking a ton of data, so it’s more of a business decision in choosing what metric matters the most.
For instance, Google Analytics tracks KPIs like the bounce rate, conversion rate, time on page, organic vs paid sessions, and many more. For many companies, conversion rate is the most important KPI, but you may also want to explore why some visitors immediately leave your site (bounce rate), which attributes impact the time on page, and so on.
Attributes are columns that are predictive of your KPI. If we take the earlier example of employee attrition, things like wage and overtime hours are clearly predictive.
However, something like “employee ID” would likely not be predictive of attrition — it’s just a unique number assigned to an employee, without underlying meaning.
There are broadly a couple of tactics we can use to find attributes for any KPI: research and data analysis. For instance, to find attributes for attrition, we could simply search “employee attrition with machine learning” on Google Scholar, and within seconds we’ll find over a dozen rigorously analyzed predictive attributes: age, tenure, pay, job satisfaction, employees’ perception of fairness, education, marital status, supervision, advancement, recognition, growth potential, burnout, and more.
Another tactic is to explore correlations in our dataset. If an attribute is strongly correlated with our KPI, then we’ll likely want to include it (with one caveat, that we’ll get into in the next section). If, on the other hand, the attribute is not correlated with our KPI, like an ID field, then we know to exclude it.
A common challenge in building machine learning problems is “leakage,” which refers to accidentally putting the KPI in an attribute column.
For example, suppose we wanted to predict death from heart failure. If we included an attribute like “discharge,” which refers to whether or not a patient left the hospital, then we’re leaking the KPI into our attributes. If the patient returned home, then they obviously didn’t die from heart failure.
If we wanted to predict lead conversion, and had an attribute called “money spent,” referring to how much a given customer spent, then we’re again leaking the KPI. A non-converted lead won’t have spent any money, and converted leads will have a non-zero value in that column.
We also want to analyze “compound” attributes to see if they contain our KPI and need to be dropped. For example, I recently built a model to predict whether sales deals would be win or lose in Apteo. The data contained a column called “sales velocity,” which is a compound attribute that includes the win rate. Again, this is an example of KPI leakage, so I dropped the sales velocity column.
In general, models can be improved with more data (more rows and more attributes) and higher-quality data.
Check out these techniques on structuring data to learn how to combine datasets. Excel calls combining datasets “consolidation,” and this can be done in any spreadsheet software or a programming language like Python or R.
At a high-level, if two datasets share a specific column, such as customer ID, you can “merge” on the shared column. If two datasets have identical column names, you can “concatenate” them, or simply paste one dataset at the end of another.
Machine learning isn’t magic, and anyone can reason with what data might be useful for a machine learning model. If a certain attribute might help you make a prediction, it would probably be useful for the model. If an attribute seems irrelevant to you, it probably is, but it wouldn’t hurt to check with research and analyzing correlations.
Frederik Bussler is the Founder of the Security Token Alliance. As a public speaker, he has presented for audiences including IBM, Nikkei, Slush Tokyo, and the Chinese government, and is featured in outlets including Forbes, Yahoo, Thrive Global, Hacker Noon, European Commission sites, and more. Recently, he represented the Alliance as a V20 delegate.