The Automated Stock Selection Process

The stock investment universe is huge. However, for an automated trading strategy, it is even bigger since every day will offer this entire universe to choose from. The buy $ \& $ holder will make one choice on a number of stocks and stick to it. Whereas for the automated trader, each day requires a trading decision for each stock in the portfolio, that it be to hold, buy, or sell from this immense opportunity set. There is a need to reduce it to a more manageable size.

Say, you buy all the stocks in $ \mathbf{SPY} $ equally (bet size $ \$ $20k per trade on a $ \$ $10M portfolio). That strategy is $ \sum_1^{500}(h_{0j} \cdot \Delta P) $ where $ h_{0j} $ is the initial quantity bought in each stock $ j=1, \dots, 500 $.

Evidently, the outcome should tend towards the same as having bought $ \mathbf{SPY} $ alone $ \sum_1^{500}( h_{0j} \cdot \Delta P) \to \sum (H_{spy} \cdot \Delta P) $. At least, theoretically, that should be the long term expectation, and this looking backward or forward for that matter.

$\quad \quad \mathsf{E} [ \sum_1^{500}(h_{0j} \cdot \Delta P) ] \to \sum (H_{spy} \cdot \Delta P) $

Saying that the expected long-term profit should tend to be the same as if having bought $ \$ $10M of $ \mathbf{SPY} $ and held on. Nothing more than a simple Buy & Hold scenario.

However, you decide to trade your way over to this decade long interval with the expectation that your kind of expertise will give you an edge and help you generate more than having held on to $\mathbf{SPY}$. But, the conundrum you have to tackle is this long term notion which says that the longer you play, the more your outcome will tend to market averages.

Your Game Expectations

To obtain more, you change your "own" expectation for the game you intend to play to: $ \mathsf{E} [ \sum (H_k \cdot \Delta P) ] > \sum (H_{spy} \cdot \Delta P) $ by designing your own trading strategy $ H_k $. I would anticipate for you to want more than just outperforming the averages, much more, as in:

$\quad \quad \displaystyle{ \mathsf{E} [ \sum (H_k \cdot \Delta P) ] \gg \sum (H_{spy} \cdot \Delta P)} $

It is not the market that is changing in this quest of yours, it is you by considering your opportunity set of available methods of play. You opt to reconfigure the game to something you think you can do.

You have analyzed the data and determined that if you had done this or that in a simulation, it would have been more profitable than a Buy $ \& $ Hold scenario.

Fortunately, you also realized soon enough that past data is just that past data, and future data has no obligations to follow "your" expectations. Nevertheless, you can study the past, observe what worked and what did not, and from there design better systems by building on the shoulders of those you appreciate the most.

Can past data have some hidden gems, anomalies? Yes. You can find a lot of academic papers on that very subject. But those past "gems" might not be available in future data or might be such rare occurrences that you could not anticipate the "when" they will or might happen again. Another way of saying that your trading methods might be fragile, to put it mildly.

A Unique Selections

The content of $ \mathbf{SPY} $ is a unique set of stocks. In itself, a sub-sample of a much larger stock universe. Taking a sub-sample of stocks from $ \mathbf{SPY} $ (say 100 of its stocks) will generate another unique selection.

There are $ C_{100}^{500} = \frac{500!}{100! \cdot 400!} = 2.04 \times 10^{107} $ combinations in taking 100 stocks out of 500. And yet, the majority of sets will tend to some average performance $ \sum (H_n \cdot \Delta P) \to \sum (H_{spy} \cdot \Delta P) $ where $ n $ could be that 1 set from the $ 2.04 \times 10^{107} $ available. Such a set from the $ \mathbf{SPY} $ would have passed other basic selection criteria such as: high market caps, liquidity, trading volume, and more.

No one is going to try testing all set samples based on whatever criteria or whatever method. It would take millions of lifetimes and a lot more than all the computing power on the planet. $ 10^{107} $ is a really really huge number. The only choice becomes taking a sub-sub-sub-sample of what is available. So small, in fact, that whatever method used in the stock selection process, you could not even express the notion of strict representativeness.

To be representative of the whole would require that we have some statistical measure of some kind on the $ 10^{107} $ possible choices. We cannot express a mean $ \mu $ or some standard deviation $ \sigma $ without having surveyed a statistically significant fraction of the data.

The problem gets worse if you considered 200 stocks out of the 500 in the $ \mathbf{SPY} $. There, the number of combinations would be: $ C_{200}^{500} = \frac{500!}{200! \cdot 300!} = 5.05 \times 10^{144} $! This is not a number that is 35$ \% $ larger than the first one. It is $ 10^{37} $ times larger.

We are, therefore, forced to accept a very low number of stock selection sets in our simulations. Every time we make a 100-stock or 200-stock selection we should realize that that selection is just 1 in $ 2.04 \times 10^{107} $ or 1 in $ 5.05 \times 10^{144} $ respectively. But, that is not the whole story.

Portfolio Rebalancing

If you are rebalancing your 100 stocks every day, you have a new set of choices which will again result in 1 set out of $ 2.04 \times 10^{107} $. This to say that your stock selection can change from day to day for some reason or other and that that selection is also very very rare. So rare, in fact, that it should not even be considered as a sample, not even a sub-sub-sub-sample. The number of combinations is simply too large for any one selection to be made representative of the whole, even if in all probability it might since the majority of those selections will tend to the average outcome anyway.

As a consequence, people do simplify the problem. For instance, they sort by market capitalization and take the top 100 stocks. This makes it a unique selection too, not a sample, but a 1 in $ 2.04 \times 10^{107} $. Not only that, but it will always be the same for anyone else using the same selection criterion. As such, this "sample" could not be considered as representative of the whole either, but just as a single instance, a one of a kind. It is the selection criteria used that totally determined this unique selection. It is evidently upward biased by design and will also be unique going forward.

Making such a stock selection ignores $ 2.04 \times 10^{107} - 1 $ other possible choices! Moreover, if many participants adopt the same market capitalization sort, they too are ignoring the majority of other possible selection methods, and making them deal with the very same set of stocks over and over again whatever modification they make to their trading procedures.

The notion of market diversity might not really be part of that equation. It is the trading procedures and the number of stocks used that will differentiate those strategies. But, ultimately, it leads to some curve-fitting the data in order to outperform! And that is not the best way to go.

Reducing Volatility

You want to reduce volatility, one of the easiest ways is to simply increase the number of stocks in the portfolio. Instead of dealing with only 100 stocks, you go for 200! Then any stock might start by representing only 0.5$ \% $ of the total and therefore, minimize the impact of any one stock going bad. The converse also applies, those performing better will have their impact reduced too. Diversifying more by increasing the number of stocks will increase the number of possible choices to $ 5.05 \times 10^{144} $. Yet, by going the sorted market capitalization route, you are again left with one and only one set of stocks for each trading day.

If there are $ 2.04 \times 10^{107} $ possible 100-stock portfolios to chose from, then whatever selection method used might be less than representative. We are not making a selection based on the knowledge of the $ 2.04 \times 10^{107} - 1 $ other choices, we are just making one that has some economic rationale behind it. The largest capitalization stocks have some advantage over others for the simple reason they have been around for some time and were, in fact, able to get there, meaning reaching their high capitalization status.

Over the past 10 years, should you have taken the highest capitalization stocks by ranking, you would have found that most of the time the same stocks were jockeying for position near the top. Again, selecting by market capitalization led to the same choice for anyone using that stock selection method. Since ranking by market cap is widespread amongst portfolio managers, we should expect to see variations based on the same general theme.

Selection Consistency

Here is a notion I have not seen often or that I consider as neglected in automated trading strategies. Automation is forcing us to consider everything as numbers: how many of this or that, what level is this or that, what are the averages of this or that, always numbers and numbers.

If you want to express some kind of sentiment or opinion, it has to be translated into some numbers. Your program does not answer with: I think, ..., you should take that course of action. It simply computes the data it is given and takes action accordingly based on what it was programmed to do. Nothing more, but also nothing less. It is a machine, a program. You are the strategy designer, and your program will do what you tell it to do.

All this to lead to the notion that your stock selection process should be consistent with your trading methods.

For instance, if you design a trend-following system, then you should select trending stocks and not mean-reversing ones which would tend to be counterproductive to your set objectives. Trend-following goes for the continuation of the price move whereas mean-reversing goes in the opposite direction. Therefore, your stock selection method should aim to capture stocks having demonstrated this trend-following ability over its past. Otherwise, you do have a serious stock selection problem.

And if you cannot distinguish trending stocks from those mean-reverting, then you are in even more trouble. You would be playing a game where you are making bets without following the very nature of your strategy design. All because your stock selection process was not consistent with your trading procedures. If your trading strategy cannot identify mean-reversing stocks then why play a mean-reversing gig?