Data transformation doesn't have to be a complex subject.
We’ve all heard about how AI can boost revenue, cut costs, and drive business transformation. That being said, AI isn’t step one. AI is a data-hungry set of tools, and “data transformation” is needed to get data in the right format for AI and data science.
More formally, data transformation is the process of converting data from one form to another. The data transformation process may include data blending, data merging, and data integration.
Since real-world data typically sits in many different areas—whether it’s Salesforce, Mailchimp, Google Analytics, Stripe, or any of a million other tools—data merging is commonly used to combine data. This is also referred to as data discovery.
For example, working at HPE, I was tasked to identify cross-sell opportunities, like selling servers alongside installation services and high-tier support. However, this data sat in several different files, so I had to find a shared column to merge the data on, such as the sales representative IDs. Further, I used “data blending” to create a larger, master dataset . When two datasets have the same header names, they can be concatenated instead of merged.
To give another example, analyzing student success data at Champlain College, I needed to merge silos of student datasets on a shared column, like student ID. You’ll almost always use a unique “ID” column to merge data on, but this ID column won’t be used as an attribute when finding predictive insights.
Often, data can be analyzed without much data transformation. However, to find meaningful insights, the more data you have on your KPI, the better.
For instance, continuing the HPE story, I could have picked a single sales file, say from one geography and one time period, to find cross-selling opportunities. You probably see the issue: With less information, I’ll find fewer insights. By combining data across geographies and timeframes, I can get a better picture of cross-selling opportunities.
The same holds true for analyzing student data. With a dataset of students from just one semester, for instance, I might find some insights into why students drop out, but it’d be better to have a much larger dataset.
In general, AI model accuracy is positively correlated with data quality and data quantity. Data transformation is used to create larger, high quality datasets.
Real-world data is notoriously messy. Getting it into the right format can take time, patience, and a basic understanding of data. Even analysis tools that provide data cleansing and ETL, or “extract, transform, load,” often don’t provide enough automatic data wrangling for data analytics.
Predictive analytics needs data to be in a very particular format, which isn’t always intuitive to us humans. For instance, if one value in your dataset is “United States” and another is “USA,” then those will be read as seperate values. Data mapping is used to convert commonly incorrect values into the right format. Also, if you have a dollar value like “$35,” it might be read as a text datatype, instead of the integer that it is. Often, you’ll have duplicate data as well, which can easily be removed in Excel with the click of a button.
There’s a million ways in which your particular data might be messy, so check out our guide on preparing data for machine learning to learn some of the top tips for data cleaning.
We’ve covered a few of the major kinds of data transformation: Data merging, data blending, and data cleaning.
“Data transformation” is more commonly a statistics term that covers techniques like log transformation, square-root transformation, and arcsine transformation, but in the context of data analysis, data transformation is really any manipulation of data into another format.
Data transformation even includes integrating data into other workflows. For example, you can extract data from tools like Salesforce, Mailchimp, or a data warehouse using their native data export features. You can extract data from these tools and upload the files to Apteo for analysis, or connect the tools directly to Apteo.
If you’re dealing with big data, it’d be easiest to connect the data source directly, rather than exporting it and then uploading it. Directly connecting the data also enables you to analyze data closer to real-time through “refreshes” instead of having to regularly extract data and upload it yourself.
The first step to data transformation is getting a feel for the data. You want to understand what you’re trying to predict, or what the KPI is, and what attributes might impact that KPI.
Then, you need to analyze the data’s quality. As mentioned, data needs to be in a very particular format for analysis. You’ll want to check for null values, which decreases the quality of your data. Most real-world datasets have some missing data, but the less is missing, the better.
You’ll also want to make sure that the data values are in the right “data type.” For example, a dollar value should be written as a number like “45.50,” without the dollar sign. Further, a date column should be in an accepted date format, like MM/DD/YY, rather than some custom format like “24th Oct. ‘16.”
After your data contains a KPI, attributes, and has been cleaned, you’re ready to find predictive insights.
The simplest way to get started is just opening your data up in Excel, Google Sheets, SQL, or Python, and visually checking the data. Do you spot any obvious issues, like missing values, incorrect data types, or unnamed columns?
In Python, you can use the Pandas library to inspect and clean data, as I explored in another article. A commonly used method is .info(), which describes the dataset’s column names, missing value counts, data types, number of rows, and more.
Another powerful method is .replace(), which can be used to replace one value with another, such as “USA” to “United States.” This is also referred to as data mapping, as you’re mapping one set of values to another. Methods like .merge() and .concat() are used to combine different data sources.
Python has many thousands of libraries and methods you can use to explore data. Data scientists know how to use many of these tools, but you need just a few to fix bad data and turn raw data into insights.
As you’ve seen, transforming your data into the right format can be a tedious, time-consuming process, but it isn’t particularly challenging. Unless you’re dealing with a perfect dataset, it’s also a prerequisite to meaningful business intelligence.
If you don’t know how to use the data transformation tools we’ve discussed, like Excel, Google Sheets, or Python, you may want to watch a few videos to brush up on some basic skills like understanding data. There are even no-code tools to automatically clean your data, like Clean Spreadsheets and Pasteur.
Data management can be used to speed up the time it takes to get insights, as higher quality data at the source will mean less data cleaning down the line.
Frederik Bussler is the Founder of the Security Token Alliance. As a public speaker, he has presented for audiences including IBM, Nikkei, Slush Tokyo, and the Chinese government, and is featured in outlets including Forbes, Yahoo, Thrive Global, Hacker Noon, European Commission sites, and more. Recently, he represented the Alliance as a V20 delegate.