Wednesday, April 13, 2016

Gap analysis of stocks: clustering and visualization

We have about a year's-worth of top 5s stocks from the markets as data available for analysis:
The Top5s stocks graph database

This is great! It tells us which stocks were the market leaders (and market losers) for today in the categories of market capitalization, price, and volume (just leaders for volume). And, because these data form a graph over time, you can query the system as to how many times such-and-so stock has been on a top5s-list, which days, then, going meta: what gaps in appearances for stocks.

We've done studies before showing the frequency of a stock on the top5s-lists and then correlating that to the probability that a stock would show up more than once, ever (not a very good probability for most stocks, as was shown).

Stock Frequency, in general

Uh, actually, that study has not yet been published here, but here are a couple of charts showing frequencies of stocks as top5s shows:
Vast majority of stocks show up on the top5s less than ten times

Here we zoom on the distribution of stocks showing less than ten times on the top5s

Okay, so we are able to see 'stocks' as a 'thing' in its behavior, but what if we wish to focus on individual stocks? What makes one stock in particular like, or not like, others? What are the measure that determine 'likeness'?

Likeness

For this study we measure the data we have: stocks, their appearances on the top5s-lists, and gaps in those appearances. So, for each stock we queried the system for this information (which resulted in this horribly-quoted mess ... lesson learned: just query the endpoint and work with data directly as opposed to exporting via the clunky CSV-export interface). From that query, we reduced those raw data down to a set of 'ScoreCard' metadata, resulting in the following ScoreCard CSV-output.

Much better: much more concise.

But what does it all mean? Yes, individually we know the min, max and mean gaps in appearances on the top5s for a particular stock. That's fine. Are there other stocks like this one? If so, are there groups?

Yes, there are.

Clustering

A good way to cluster related data is to use the K-Means algorithm. The k-means algorithm we use uses ScoreCards (which, as you see above, are simply vectors) to classify the data.

So, step 1: Convert the stocks' data into scorecards:


Step 2: label these scorecards:


Step 3: compact


... all the above is render using a system that converts ScoreCards to Cells in SVG.

Now we're ready for clustering. As you can see above, there are lots of scores of similar ilk (and similar ink). So now we cluster, then upload these clusters to our graph database for visualization:
Stocks clustered by gap-likeness

And, now that these clusters are in the graph database, you can query this system, asking, e.g.: show me the clusters that contain AAPL, NFLX and TWTR. It turns out there are two such clusters:


We can now find 'like'-stocks based on similarity from shows (and gaps) on the top5s-lists.

Future Work

There are several directions we can go from here, both in data analysis and in data visualization. For data analysis: is shows on the top5s telling? To me, this is not a very fruitful area of research: AAPL shows up to be 'like' 85 other stocks? TWTR is 'like' 110 other stocks? This is unhelpful for me. These stocks are outstanding in their own right and so should be in a rarified cluster, not a common one. So other indicators, outside top5s appearances, are what makes these stocks special: beta? p/e? EPS? market cap? div/yield? I don't know. This requires further research and insights, but it appears not from the top5s,* which are helpful to show a stock is outstanding by the number of its appearances (more frequent appearances tend to show better performers).

For data visualization, the clusters themselves can be distinguished visually by their size (number of members) and by their 'heat' (the common cluster color). Furthermore, each stock has it's own 'color' based on its score and so each stock can be colored individually. I am researching if neo4j can programmatically color nodes based on indicated properties (e.g.: a property, such as { color: 13541067 }). If they don't provide it as a plugin or unmanaged extension, the I'll have to use another framework, e.g. Linkurious, sigma,js, or D3.

Other areas of improvement entail incorporating the scorecard metadata into the cell nodes here. There exists work along this line already as ColoredScoreCard, but it needs to be brought up to date with the new indexible scorecard properties, in this way color and the other properties are automatically and meaningfully labeled. The next step is then to link the scorecard node to the actual security, so that once a desirable stock scorecard is found in a cluster, the stock itself can be further explored directly from the linked relationship to that scorecard. These two improvements are simple extensions: the work just simply needs be updated or undertaken.

Update: Scaling Scoring Factors

* Epilogue: I spoke too quickly. The score cards measure a data point (a stock) along several factors which may (and, in fact, do) have significantly different ranges from each other. This being the case, a factor with a large magnitude can dominate the clustering of the score cards leading to clusters informed only by these large values.

But is that what we want? No. Large values are just large valued, it doesn't mean that they are more important for clustering.

So, to linearize the impact of the factors, so that each factor contributes weight to the clustering algorithm, I've employed a scaling algorithm that maps each factor to the range [0..1].  When I apply scaling, we see an entirely new set of clusters:
rescaled clusters

The smallest cluster has all the heavy-hitters in the Markets, and we see this simply from analytics of their shows on the top5s-lists, whereas before clustering was not yielding meaningful results.

Scale factors to get meaningful input to clustering from all factor-types.

No comments:

Post a Comment