At my startup, Apteo, we’re working hard to help companies use data and analytics more efficiently, starting with the world of finance. It’s clear that organizations are making a big push to increase their use of both external and internal data to improve how they make decisions, but they’re all facing significant challenges when attempting to do so.
In my last post, I discussed one of the most common problems that we’ve seen organizations face — discovering relevant data. In this post, I go over another common problem that’s particularly troublesome in the world of finance — how to know if a new dataset can be relevant to one’s workflow. I refer to this issue as “data validation”.
Today’s finance professionals are constantly being sold new data. In fact, according to the site alternativedata.org, there are more than 400 different alternative data providers today.
These providers are selling everything from job statistics to credit card transactions to foot traffic information to satellite imagery analytics. In order for finance professionals to know if a dataset will add value to its workflow, they need to test that dataset to see if it answers any key questions or validates whether that dataset enhances a key part of their research process.
Firms from the buy-side, sell-side, and corporate data teams can benefit from using data to improve their workflow. Analysts and portfolio managers from buy-side firms frequently use data to model company earnings and revenue. Researchers on the sell-side can leverage data to supplement their theses around whether a company’s performance will be strong or weak in the coming quarter. Corporate teams can use data to do anything from allocating marketing spend to identifying high-potential hires to improving their internal operations.
But data by itself is useless. It’s the application of data towards key problems that facilitates informed decision-making. For financial firms, data is frequently used in either an associative data point in an underlying analysis, or as a leading or predictive indicator for a key metric.
Data that’s associated with a key metric can provide analysts with a starting point for researching factors that affect a company’s financial strength. Data that’s predictive of a key metric can help forecast that metric into the future, allowing analysts to create more precise and refined models of a company’s financials.
Financial firms can benefit greatly from the intelligent application of data to their workflows. The right type of data can be extremely valuable, but knowing precisely what makes a dataset the “right type” isn’t always straightforward.
While some large organizations have data science teams, many firms don’t have the resources to employee expensive data science. Even those firms that do have data scientists may not have enough of them to get through all of the analysis work that firms require — validation being just a small portion of it. With hundreds of different vendors selling thousands of different datasets, some of which may have billions of individual records, firms that do have data science teams may have a large backlog of datasets to test.
Meanwhile, firms that don’t have data science resources have to either outsource analysis tasks, use simple charting or graphing methods for validating data (which is only doable when the dataset fits into their BI tool), forego validating useful datasets, or use their gut instinct to determine whether to purchase a new dataset.
For those firms that do have data science teams, the process of validating a new dataset isn’t always straightforward. Some datasets can come in a large, unaggregated, and unclean format. These sorts of datasets would need to be loaded into a data warehouse (and if one doesn’t exist, a data warehouse would need to be created), cleaned (which can include many different steps, including imputing missing values, joining a dataset to a new dataset, combining different attributes within a dataset into a new attribute, or performing a variety of other transformations), and aggregated, prior to running an analysis.
Depending on the data warehouse system , the analysis itself could take significant time and technical expertise to engineer. For example, a credit card transaction dataset which contains billions of individual records might be loaded into a Hadoop system, which would then require a Java, Scala, or other similar job to parse through all of the records in a relevant ETL job, transforming them as needed (i.e. by taking the percent change from one period to the next). The result of that job might then be fed into an analytical model, dashboard, or report that would then need to be further analyzed.
The most sophisticated analyses would require statistical methodologies to be applied, or would require predictive models be built. Since a dataset needs to be additive to an analyst’s workflow, which may already consist of additional datasets, a multivariate approach would need to be taken, whereby a dataset’s value would need to be additive or incremental to what an analyst can get from what they already have.
This may require a data science expert who understands how to build additive models and measure the accuracy between a model with the dataset and without it. That data scientist would need to understand things like stepwise regression, or cross-validation, or simulation, or any number of other analytical techniques.
Data is becoming more and more sophisticated, requiring significantly more resources in order to analyze it. The amount of time it takes to analyze data is growing, and the techniques available to analyze it are becoming increasingly sophisticated. At Apteo, we’re focusing on creating a platform that allows domain experts to discover and analyze data without requiring extensive data science expertise.
While we believe that data science expertise will always be needed, we see the world evolving to a point where several standardized tasks can be automated, freeing up time for high-value firm employees to focus on new, creative endeavors. We believe that implementing some standard data science techniques into an intuitive, easy-to-use platform can help professionals answer their questions and achieve their objectives more efficiently.
As a data scientist, the process of validating data is an interesting technical challenge. But as someone running a data and analytics company that’s dedicated to helping everyone make better decisions, it’s part of a larger journey that our company is undertaking. We’re always looking for feedback, insights, and even more information around the problems we’re solving, and would love to hear from you if you’re interested.
If you think we’re completely off-base and have a bunch of great ideas about how to make our product better, we’re always looking for feedback. Feel free to email me at email@example.com.
If you think you could use our product for your own business, feel free to reach out: firstname.lastname@example.org.
Shanif Dhanani is the co-founder & CEO of Apteo. Prior to Apteo, Shanif was a data scientist and software engineer at Twitter, and prior to that he was the lead engineer and head of analytics at TapCommerce, a NYC-based ad tech startup acquired by Twitter. He has a passion for all things data and analytics, loves adventure traveling, and generally loves living in New York City.