Abstract
We give complete algorithms and source code for constructing
(multilevel) statistical industry classifications, including methods for fixing
the number of clusters at each level (and the number of levels). Under the hood
there are clustering algorithms (e.g., k-means). However, what should we
cluster? Correlations? Returns? The answer turns out to be neither and our
backtests suggest that these details make a sizable difference. We also give an
algorithm and source code for building "hybrid" industry classifications
by improving off-the-shelf "fundamental" industry classifications by
applying our statistical industry classification methods to them. The
presentation is intended to be pedagogical and geared toward practical
applications in quantitative trading.