Last week, the Apteo team announced the launch of our cloud product. That launch was the culmination of many months of user discovery, problem validation, MVP creation, and beta testing. And while our team is proud of what we’ve built so far, we know that there’s a lot of work to do in order for us to solve some of the key problems we’re attacking.
As I referred to in my launch post, we’re solving a few key problems around the use of data. Our long-term vision is to make it easy for anyone to incorporate data and advanced analytics into their decision-making. We’re starting in the world of finance, where we’ve seen folks from nearly every function and firm experience several challenges.
One of the most significant challenges that we’ve seen them face is also one of the most basic and fundamental — how to find the data they need. As the world creates a significant amount of new data every day, it becomes both harder and more time-consuming to find a new dataset that might be relevant to one’s work.
Fortunately, there are now a variety of new technologies available that can work together to solve this problem. In this post, I wanted to go over a few of the more interesting ones and how we’re starting to integrate them into our product.
It’s probably no surprise that there’s a large push in the world of business to increase the use of data to make better business decisions. Sometimes, this process is as simple as looking at your own usage data to better understand and optimize against user behavior. But increasingly, in the world of finance, the first step in this process is identifying and gathering datasets that have the potential to be additive to one’s work process.
In the world of sell-side banks, specifically in research and analysis roles, this could mean finding data and deriving insights that provide greater context into one’s research. By providing research that’s backed by hard data, sell-side researchers can increase the rate at which their clients consume their reports, which has a downstream positive effect on their revenue.
In the world investment and portfolio managers, this might look like finding operational or supplier data that can serve to better forecast a company’s earnings, which allows these managers to improve their investment theses into those companies.
Some examples may help to set the stage.
Today, there are several companies that use satellite imagery and advanced visual processing to understand things like oil supplies, geopolitical activities, and even retail sales. For example, it’s now possible to use satellites to analyze and estimate the number of consumers at a store or mall and estimate the change in retail sales at that location from a prior period.
Analysts can subscribe to this data and supplement their financial models with more accurate estimates based on this information.
Today, it’s possible to gather information off of publicly available websites. By creating tools to periodically check product pricing on retail websites, analysts can understand if pricing trends are rising or falling, which then enable them to understand and predict corporate earnings and sales in the future.
Back in 2016, Foursquare, an app that lets users “check in” to whatever venue they’re currently located in, analyzed their data from users visiting Chipotle, and noticed a large drop in activity. That allowed them to accurately forecast a drop in same-store sales for the Mexican fast casual restaurant before nearly anyone else did.
There’s no question that having the right data can lead to making extremely prescient business decisions. But the size of the opportunity is rivaled by the difficulty in acquiring the right data.
Today, data is scattered. Public governments make data available on their websites, frequently in a hard-to-access location. There are more than 400 recognized providers of paid alternative data, additionally, firms are producing huge quantities of proprietary data every day.
At Apteo, we think about the world of data as a pyramid — each layer is valuable, and each layer presents its own challenges when aggregating and analyzing data.
At the bottom of the pyramid is a huge treasure trove of public data — much of it very useful, but it’s the most scattered, disparate, unstructured, and varied. In the middle of the pyramid lies paid data. This is where we’d place the world of alternative data vendors. Their data can be useful, but is very niche, and while it’s more structured than public data, it can still come in a variety of different formats and structures. At the top of the pyramid lies internal proprietary data. This data tends to be extremely useful for companies, and is less varied and likely more structured than paid or free data.
As you go up the layers of the pyramid, it usually becomes easier to find the data you need, but with that said, there are definite and noticeable challenges in discovering data at each layer.
At the bottom of the pyramid, there’s a world of public data that’s extremely valuable for finance professionals. The U.S. government alone produces statistics on housing, interest rates, population, employment, and a variety of other fascinating statistics. But the U.S. isn’t the only governmental organization to produce data. Today, nearly every advanced national, and many statutory and municipal organizations, provide data. In addition, large NGOs such as the World Health Organization and United Nations also make data available to the public.
Unfortunately, because there are so many providers, each of which is held to its own standard, and each of which has its own limitations when it comes to technical resources, it can require a lot of time and effort to become aware of an interesting dataset from one of these organizations. Google is a decent enough place to start, but we’ve heard that many times, users either don’t know what to search for, don’t have the time to go through results, or don’t even think to search for a particular topic when they’re beginning an analysis on a new problem.
The world of paid data is growing quickly, much of it from aggregators of data, but also from companies that produce existing data that they then decide to sell to others (this is known as data exhaust).
Many of these providers may be too small to market and advertise, so despite the fact that they may have valuable data, it can be difficult to know where to turn when looking for a new dataset. Frequently, though, there are larger issues with paid data — primarily around data validation, this is something that I’ll cover in a later post.
Finally, today, companies are producing a myriad of structured and unstructured data. In financial firms that conduct research, we’re frequently seeing companies creating longitudinal surveys, highly in-depth research on companies, extremely sophisticated financial models, and a variety of other extremely advanced insights.
However, these insights and reports are frequently only available to the immediate teams that created them. Many firms don’t have a single, centralized location for structuring and sharing datasets. So even if someone in your firm created a dataset that could be useful to you, oftentimes you’d have to know that they created it, or reach out to someone who manages all things data, in order for you to gain access to it.
Fortunately, though, new technologies are helping solve the problem.
Machine learning is a field that’s near and dear to my heart, and it’s a field that’s advancing quickly. It has become extremely adept at structuring, organizing, and discovering patterns within data. For those very reasons, machine learning and data IO techniques can be used to help alleviate the problem of data discovery.
Today, it’s possible to read and write large amounts of data extremely quickly. But even more impactful, we can now query data in very quickly and efficiently, with minimal and inexpensive infrastructure. Data storage techniques like Parquet on flat files in S3, or key-value stores like Redis or MongoDB, make it simple to run data-heavy applications with minimal lag time.
Perhaps one of the most noticeable leaps in recent machine learning techniques come from the field of natural language processing. We now have the ability to gain more advanced understanding of the statistical relationships between words to a point where we can begin to understand context, grammar, and meaning. This allows us to take advantage of text-based metadata to group and categorize data efficiently.
In addition, semantic search techniques (combined with efficient data storage) make it possible for more humans to interact with machines using increasingly more natural language interfaces. This enables search results to return increasingly more relevant results, even when users don’t type in the exact keywords that they’re looking for.
When you have a large amount of disparate data, it can be difficult to know whether some or all of that data is useful for the task that you care about. However today there are several different techniques that help us determine, at least at a high level, which attributes in which datasets can be relevant to key objectives.
We can now automatically identify which datasets are worth examining further and which may even be indicative or predictive of a key metric (for example, revenue or earnings). By using these automatic techniques, we can further categorize datasets and even begin to associate them with individual stocks, bonds, and financial metrics.
At Apteo, we’re incorporating many of these techniques into our product. We use Parquet, S3, Redis, and queuing services to ensure our data processing pipeline is optimized for speed.
We use natural language processing to examine any text-based information we have for datasets and use that to categorize those datasets based on their relevance. And we use feature engineering techniques to tag datasets down to the level of an individual company or metric. This way, you can start focusing on the things you really care about (like companies, financial metrics, stocks, bonds, sectors, and keywords) and we’ll show you not only the data, but also how that data can be useful for those things as you’re looking at them.
We’re working hard to make it easier for anyone to find the data they need when they need it, even if they don’t exactly know what they need at the time. But as much as we believe in the power of technology to make workflows easier and more efficient, we also believe that domain expertise and the power of human + machine will be increasingly important and relevant going forward.
While some believe that technology is a threat to the future of jobs, we believe that it will be an incredible enabler for vast amounts of productivity. While software is making greater strides in pattern recognition and computation, it is still far from being able to fully comprehend and understand the vast amount of latent and built-in knowledge that we have as humans.
In the world of finance, at least, there will still be a need for humans to provide domain expertise.
Machine learning algorithms suffer from the curse of dimensionality. We can’t simply throw in the entire kitchen sink of data and hope to get meaningful results back. Neural networks still require vast amounts of data to properly learn from their provided examples. Humans will need to identify useful data using their domain expertise and will have to guide how data will be used to train new algorithms.
Humans will need to interpret the output of many machine learning applications.
Humans will need to disregard spurious correlations.
Humans will need to distinguish leading indicators from a lagging indicators.
Humans will need to identify new causes and effects.
Ultimately, humans will need to be heavily involved in finance. Their roles may change and new roles will emerge, but firms that leverage the power of human + machine will be the long-term winners.
In writing this, I hope to have provided some additional context on one of the biggest problems facing the financial industry today, and to have provided some context on new and exciting technologies that are being applied to solve those problems.
If you’ve made it this far and are excited about what we’re up to, we’re also hiring. Currently we have an open role for a Sr. Software Engineer and would love to hear from you if you’re interested in working on some cool data and analytics challenges.
If you think we’re completely off-base and have a bunch of great ideas about how to make our product better, we’re always looking for feedback. Feel free to email me at firstname.lastname@example.org.
If you think you could use our product for your own business, feel free to reach out: email@example.com.
Shanif Dhanani is the co-founder & CEO of Apteo. Prior to Apteo, Shanif was a data scientist and software engineer at Twitter, and prior to that he was the lead engineer and head of analytics at TapCommerce, a NYC-based ad tech startup acquired by Twitter. He has a passion for all things data and analytics, loves adventure traveling, and generally loves living in New York City.