Back in 2014, the CEO of IBM made waves by making the claim that “data is becoming a new natural resource.” There were those that agreed with her, and there were those that pointed out how her analogy was flawed (data is manmade after all).
Regardless of how “natural” data is, organizations that effectively manage, process, and utilize data can benefit from significant economic gains from their efforts, much like those that harvest and process natural resources. But in order for firms to effectively monetize data (which can come both in the form of increased profits and as reduced costs), they must implement processes to systematically collect, organize, transform, and apply their data to real world problems. After all, raw data is just like an untouched mine - full of potential but ultimately useless.
Fortunately, organizations can take a staged and structured approach to making the most of their data. Though it does take some effort, by using a well-planned, measured, and thoughtful process, it’s possible for organizations to go from not even knowing what data they have to leveraging automated machine learning systems that make key business decisions seamlessly.
Data is all around us. Nearly every business process, product interaction, and customer communication generates some form of data. The most forward-thinking organizations record all of it, but many organizations don’t even think to keep track of what they have.
The first step in making use of data is to record it. Rather than just discarding data as it comes and goes, organizations should begin recording it in a central location so it can be used when they’re ready to implement analytics and machine learning systems. When beginning to record data, organizations should keep track of both obviously valuable data and also any data that could have some incremental value. It’s not always easy to see the value of a particular dataset at any given moment, but a long and complete record of historical data can allow analysts, engineers, and managers to have a more complete picture of all important metrics when they need them. Furthermore, a dataset that has a larger number of variables that cover many different areas allows machine learning models to be trained more accurately.
In the tech industry, some of the most valuable data comes from product interactions. The process of capturing and recording these types of events is known as instrumentation. Behavioral data from product interactions can be extremely valuable because it can reflect where users find value, what leads them to convert into paying customers, or what causes them to engage more with a company. When companies instrument their products, they can leverage all of this data down the line to optimize key revenue-generating processes. In addition, this data can be used to track down bugs and optimize server capacity in the present.
While recording lots of data can be extremely valuable, it also comes at a cost. Depending on the frequency and velocity of incoming data, it may be necessary to implement a large-scale data warehouse to manage it all. Setting up such a system could require a large upfront investment in time, money, or engineering. With that said, low-cost, large-scale systems such as S3 and Google Bigquery do exist today, and working with these systems could significantly reduce the barriers to creating a data store.
Additionally, when setting up a data warehouse, it’s important to consider the schema of data that will be stored. Structured data is usually better than unstructured data, and while it’s always possible to store data as it comes, implementing a structure to data before you begin recording it can save a lot of time and heartache down the line.
For organizations that are just getting started, it’s important to evaluate whether a standard database system will suffice, or if a more complicated system, created of multiple data pipelines and ETL jobs, is necessary.
In any case, after an organization has begun to record its data, it can then move on to the next stage in its data journey: organizing it, categorizing it, and visualizing it.
When individual datasets are collected, they are frequently incomplete. User data can be supplemented with third party data on demographics. Tables describing individual transactions need to be joined to information about individual users. Even VLOOKUPs in Excel can help to join one set of data to another.
After organizations create individual structured records of their datasets, they should begin to create the structures needed to supplement these datasets with additional data. In database parlance, this is known as “joining” data. Joins can be done on an as-needed basis when creating graphs or visualizations, or they can be formally defined within the data warehouse itself, which allows these systems to enforce data integrity when new records are created.
When data is properly recorded and organized, it can be useful to categorize and catalog it so that it’s visible to everyone within an organization. Normally, datasets, schemas, and databases lie primarily within the realm of data and engineering teams. However, by defining, describing, and surfacing data to the whole organization, multiple teams and people can begin to determine how to best use it to improve a company’s processes.
Once data has been properly recorded, structured, and categorized, the next step is in the process is to begin analyzing it, and the first step towards analyzing data is to visualize it. The human mind is designed to process visual imagery, and can quickly identify relevant patterns and information from graphs.
Visualizations and aggregations of key variables can bring to light important patterns and insights that may have otherwise gone unnoticed. They can provide direction on and guidance on data quality. They can surface trends. Fortunately, there are a large variety of tools that enable high quality data visualizations, many of which are free or open source, including Apteo’s platform.
The final stage in maximizing data is to use it to make decisions. Data scientists and machine learning engineers are crucial in this step. In a deck he produced about data products, Meninder Purewal, a data scientist at Bank of America, created a useful method for categorizing data scientists based on three different types of outputs: ad hoc, strategic, and modeling. Each type of output provides incrementally more automation and systematization. The most data forward organizations, however, employ all three types of methodologies and data scientists, as there are benefits to incorporating all three types of analysis.
Ad hoc analysis
The most common method for analyzing data involves tactical reports that tend to be one-off answers to ad hoc questions posed by colleagues and managers, or as exploratory investigations into new datasets. These types of reports are generally supported by graphs and tables and aren’t immediately scalable. While it may not sound like these are high impact, they serve an extremely important purpose. These analyses can usually provide enough information for managers to make important decisions, or at the very least, they provide guidance on what additional analyses to conduct next.
The next level up involves making strategic, high-impact, broad-reaching decisions by using statistics, A/B experimentation, simulation, and advanced analytics (sometimes even including predictive analytics). Data scientists that work on strategic projects such as this are expected to understand business concerns, possess excellent communication skills, and have a breadth of different techniques at their disposal, along with the knowledge to know which one to use in any given scenario.
Strategic analysis projects can take weeks, or months, to complete, and are expected to produce in-depth analysis and conclusive direction on where to focus efforts. When managed properly, these types of projects can provide significant value to organizations; however, due to their complexity, combined with an occasional lack of direction and lack of clarity around goals and objectives, these types of projects have a high probability of failure, and must be managed carefully by experienced analytics managers.
Modeling and automated decision-making
The final level of data-driven decision-making involves the use of machine learning models to make decisions, either for strategic reasons or at scale and on a continuous basis. Machine learning engineers can leverage large amounts of collected data to automate decisions that are either too frequent, too complicated, or otherwise difficult to do. The most sophisticated companies can fully automate entire portions of their products and decision-making process, enabling them to be completely data-driven organizations.
Unlike our natural resources, the amount of data available to organizations is increasing rapidly. While some of it may be useless, organizations should be able to leverage a large portion of it to make highly impactful decisions. With the right focus and the proper tools, organizations can steadily transform their processes, beginning to use data more systematically and effectively. Tools like Apteo’s data platform can help to reduce the amount of time it takes to effect this transformation. If you’re interested in learning more about how we think about data, feel free to drop us a line.
Shanif Dhanani is the co-founder & CEO of Apteo. Prior to Apteo, Shanif was a data scientist and software engineer at Twitter, and prior to that he was the lead engineer and head of analytics at TapCommerce, a NYC-based ad tech startup acquired by Twitter. He has a passion for all things data and analytics, loves adventure traveling, and generally loves living in New York City.