Erez Katz, CEO and Co-founder Lucena Research
Prolong the predictive value of Alt Data for Investment
While the big-data revolution is well underway, one of the biggest threats (and there are quite a few) a data provider business is bound to face is the pace by which their value proposition may diminish. In most businesses, commoditization of alternative data is a major challenge. In the financial markets, this commoditization challenge is even greater since it stems from two main factors.
News of solid and predictive data travels fast. With most data originating from the public domain, it's only a matter of time before someone else jumps on the bandwagon and sets up shop to deliver similar data at a lower price.
The more customers license the same data, the faster its predictive value degrades. This conundrum has forced data providers to establish more granular and innovative pricing schemes in order to allow for a broader distribution of their data offerings.
For example, rather than selling data that covers the S&P 500 to many customers, a data provider may sell their data on a per ticker basis. This allows such data to be offered, theoretically, to 500 distinct users without any data sharing.
No matter how you slice it, unless the data provider has proprietary data or a unique curation algorithm, outstanding customer service, and first-to-market advantage, it's just a matter of time before it faces the threat of overexposure and potential obsolescence.
Prolonging the lifespan of alt data
At Lucena Research, we’ve thought long and hard about this challenge and believe we have found a way not only to extend data’s distribution reach, but more importantly to protect the data provider’s IP (the raw data).
Incorporating a new data set onto our platform is guided by a regimented process which ingests and validates the data readiness for machine learning research. Once the data has been validated, we swiftly move into feature engineering. Feature engineering is the process of creating derived features, friendly to machine learning. An added benefit is that derived features effectively hide our data partners’ IP. Below is a visual example of the on-boarding process for a new alternative data source.
For example, imagine you have daily price to earnings ratio (PE ratio) for every constituent in the S&P 500. An asset’s PE value of 12, for example, is not as important compared to how the asset’s PE ratio ranks compared to its peers (or the entire S&P 500 universe). Ranked PE features are derived features from the raw PE values. Beyond how a constituent ranks relative to its peers, we are also interested in learning how such ranking transforms over time.
If IBM’s PE ratio ranking changes from position 150 to position 2 over a 10 day period, that’s meaningful information. The machine can use this information to detect abnormal or unexpected behavior, which could very well turn out predictive.
How to minimize data overexposure
In practice, feature engineering is not as straightforward as depicted in the example above. There are quite a few techniques by which we implement; bias removal, gap fill, aggregation, interpolation, and normalization. I won't go into detail on each of these techniques here but feel free to comment below if you would like more information.
So far we’ve covered feature engineering which protects our data partner’s IP, but we haven’t solved the overexposure problem. After all, if we expose the feature engineered data to too many buy side clients, we face the very same original challenge of alpha decay stemming from overexposure.
In general, there should be a much wider selection of engineered features compared to the raw features. More importantly, Lucena’s derived offerings -- QuantDesk, Smart Data Feeds, and Model Portfolios -- all embed a feature selection process set to identify a subset of features most relevant to a specific mandate.
The goal is to take a data consumer through a set of guided questions set to outline the problem he/she is trying to solve. Subsequently, use our machine learning classifier which sifts through a myriad of data sets and factors and construct a “custom” data feed most suitable for the investment style of interest. This process incorporates multiple orthogonal data sets and factors into a single cohesive data feed designed specifically for the scenario outlined.
With such a wide of array of combination of all 950 factors in our database, the opportunities are virtually endless and the risk of overexposing one model to the masses is greatly reduced.
Repackage your offerings to monetize your data
With advanced feature engineering and machine learning model diversity, data can be repurposed to a wide array of users without running the risk of commoditization due to overexposure. The number of permutations of combining multiple features from orthogonal data sets is endless. Ultimately the user receives a customized and cohesive data feed per specifications.
More importantly, this delivery mechanism does not jeopardize the data providers’ IP.
Lucena’s platform creates an event study analysis graph, a backtest report, and even a perpetually monitored model portfolio all based on a set of derived features from multiple, unrelated, data providers.
In turn, the model can be used for daily data feed delivery so that it can be incorporated into a portfolio manager’s research process.
Using Lucena's investment research platform QuantDesk, you have the ability to select or upload a constituent universe, choose an investment style, investment time horizon, and send our ML classifier on an “expedition” geared to construct a best-of-breed data feed specific to your needs.
Want to know more about our derived offerings?
Comment below or reach out here.