What is data munging? Dive deep into this nuanced topic to understand how to clean your data.
Predictive analytics is used by leading companies like Facebook and Google to optimize their products and services. Even if you’re not aware of it, you use AI-driven products every day, from the search results on Google, to the emails that land in your inbox instead of spam, or even the songs recommended to you on Spotify.
However, to use predictive analytics in your organization, having just any dataset is not enough. You need a clean, prepared dataset, with a KPI column, and attribute columns that describe that KPI. In short, you need to do data munging.
Data munging is the process of turning raw data into an actionable format for analytics. Real-world data is notoriously messy, and machine learning is notoriously picky—data needs to be in a particular, clean format.
With the explosion of big data, data munging is more important than ever. Tons of data, sitting around and not being used, doesn’t help anyone.
Data is only useful after data munging, or once it’s in a format ready for analysis.
Data munging and data wrangling are often used interchangeably, although data wrangling is a far more common term.
Both terms refer to the manipulation of data to accommodate the needs of downstream analytics, and have no practical difference.
Data munging isn’t a nice-to-have, it’s a must-have. Not only does data munging open the door to the world of AI, but even simple analytics need clean data.
If you have messy data, visualizations and dashboards might not work at all, or could break at any moment. Further, messy data will make analyses flawed and inaccurate. Clean data is a prerequisite for accurate analytics, both traditional and predictive.
Perhaps the biggest challenge of data munging is that there are countless ways for data to be messy.
First off, there are many data types, like dates, text, integers, floats, and booleans. Each of these data types can have different kinds of errors. If a date is improperly formatted, it may be read as a text field instead of a date. A messy float, like “$3.50,” will likewise be treated as a text field due to the dollar sign. Even a boolean column could be improperly treated as a text column, if, for instance, there are multiple spellings of a condition, like “true” and “True.”
These kinds of errors barely scratch the surface, but you’re likely to only face a small fraction of potential errors in your data. Data munging is the process used to uncover these flaws and fix them, much in the same way that bugs are uncovered and fixed in code.
The data munging process begins with exploratory data analysis, which is when you get a general feel for the data.
Exploratory data analysis, or EDA for short, involves several techniques, including data quality analysis and attribute analysis. Data quality analysis refers to understanding flaws in the data, such as where the empty values are, where there are misspelled categories, where there are null values, where there are outliers and so on.
Once you understand the flaws in the data, you can fix them one by one. A common way to deal with missing values is to fill them with the mean or median value of that column.
EDA can also be used to find meaningful attributes. By creating various visualizations, you can see how the different columns are related to each other, and at a glance see what might impact the KPI. If a certain column seems irrelevant and unrelated to the KPI, now is when you remove that column.
After cleaning the data and determining what attributes are relevant, you can approach any analysis.
If you have a relatively small dataset, with tens of thousands of rows or less, virtually any data munging techniques can be done in Excel or Google Sheets.
Once you’re approaching a larger dataset, these tools start to seriously slow down and limit your workflow. In theory, Excel can work with up to a million rows, but in practice, it can be a pain to work with at that scale.
Instead, you might want to use a tool like Python’s Pandas library, which has countless ways to quickly and easily munge data.
If you’re not sure on how to use Excel, Google Sheets, Python, or other common tools, you may want to take some starter courses on understanding basic data cleaning and data processing techniques. There are also no-code data munging tools like Clean Spreadsheets and Pasteur.
Data munging used to be an even harder process than it is today. Older versions of Excel supported even fewer rows, and there wasn’t as much documentation around data munging tools. While analytics tools like NumPy and SQL have been around for a while, the Python Pandas library, enabling data munging on DataFrames, only came out in 2008.
Often, companies would need a trained data scientist or other access to data science expertise to do data munging. Dealing with missing data, and other complex data issues can still be a time-consuming process.
Today, no-code data munging tools are emerging to automatically munge your data for you, and these will expand dramatically in the future. No one likes data munging, yet countless employee hours are spent on it.
Tools that reduce the time for data munging will thus become more and more important, and will be integrated with traditional and predictive analytics tools, letting business users without data analytics experience find insights.
Frederik Bussler is the Founder of the Security Token Alliance. As a public speaker, he has presented for audiences including IBM, Nikkei, Slush Tokyo, and the Chinese government, and is featured in outlets including Forbes, Yahoo, Thrive Global, Hacker Noon, European Commission sites, and more. Recently, he represented the Alliance as a V20 delegate.