Train and test ML algo on CQ

lh466822 · October 2019

I’m interested in implementing an ML algorithm on the platform. This would involve the following:

Loading all historical data for a large number of stocks. I would need as much daily data as possible for up to 50 symbols. Ideally 6 or more years data for training and testing, then another 2 years of unseen data to backtest. This data would need to be loaded into a pandas df.
A model from Keras or sklearn would then be trained, and the model saved to a pickle file.
The .pkl model file would be loaded during the backtest, unseen data would be processed by the model, and a signal generated.

Can this be done on CQ, and if so, can someone kindly outline how to go about it? It would be great, in particular, to get historical data in a dataframe, and create a kind of research environment. Thanks for any advice!

ptunney · October 2019

We do encourage ML on our system, however, due to various limits and restrictions the usage is extremely limited for external users.

The difficulty with Training and Testing ML is that it requires two things, large amounts of data and large amounts of processing power. Due to licensing agreements we cannot allow users to download large swaths of data. At the same time, as a free service, we have CPU constraints per user that do not make it practical to run ML on the platform. As a third issue, even if you were able to train a model locally with a small amount of data it will likely result in a pickle file. Pickle files are among the restricted upload file types for our system.

jdessain · December 2021

Some initial reaction. I hope it helps

You need much more than 6 years for training & testing ! You want to test your algorithm with all sorts of environment: strong rally, crash, high & low volatility, trend reversal,... I strongly advise to us 20+ years of data with AT LEAST 5 years of testing (2016-2020 gives you a crisis, a recovery, high and low vols....)
For model, SKLearn does not offer a lot of flexibility. I advise you to invest time & energy in Tensorflow 2.0 (with Keras back-end) or Pytorch.
Make sure that your model is reproducible and that the same data get the same results... as soon as you use LSTM or dropout, it is harder to achieve... That is a reason I prefer Pytorch (easier to achieve reproducibility)
NEVER use a result based on statistical evaluation (like MSE, MAE, RMSE or accuracy, F1, ...). Test it with real life investment strategy.
Don't look only at the results, also look at Sharpe (or Sortino) ratio, as you want to consider return AND risk... Even better the "D ratio" that is explained on https://ssrn.com/abstract=3927058 and for which the code in Python is available on Github
Read, read a lot. And CHECK, as all readings are not valid

Train and test ML algo on CQ

Comments