Modeling Relevant Sources for an Equity: AMZN

Introduction: Relevance-as-a-Service (RaaS)

Right Relevance (RR) provides curated relevant information and intelligence via API on ~50 thousand topics. This includes:

  • Topic relationships including related topics & semantic information like synonyms.
  • Topical influencers (~2.5M) with score and rank.
  • Topical content and information in the form of articles, videos and conversations.

Additionally, Right Relevance provides an Insights offering that combines the above Topics and Influencers information with real time conversations to provide actionable intelligence with visualizations to enable decision making. The Insights service is applicable to emerging events like elections, conferences, product launches, breaking news developments, outbreaks like Ebola etc.

This report is part of a series to apply the Relevance-as-a-Service (RaaS) Insights technology to the financial markets intelligence esp. financial instruments like equities, commodities, bonds and forex to begin with. The focus of this report is to understand how to model an equity for information with AMZN as the example.

Hypothesis for Application to Financial Instruments

The scale and availability of data is increasing exponentially. This is a boon overall but exposes some serious problems like the lack of ability to extract relevance and intelligence from data at this scale, high cost, misinformation and even more seriously disinformation (aka fake news) at scale.

We’ve applied the Right Relevance RaaS Insights technology to several scenarios listed below with great success.

Also, we’ve previously outlined how trust from influencers can be inductively applied to the fake news problem by providing a measure of verifiability along with our core value prop of relevance.

This is our attempt to apply the same set of technologies and approaches to financial instruments, starting with equities, with AMZN (Amazon, Inc.) as the example.

In the financial domain, several complex models exist, from high speed low latency quant trading to longer term analysis to back testing among others. Most of these models struggle to handle data at the current scale. Cost and latency are growing problems as the scale continues to increase. Errors due to misinformation and disinformation are increasing risks.

The hypothesis for this analysis rests on identifying relevant and verifiable/trustworthy sources (via influencers) to monitor for any given financial security (AMZN equity in this case) such that we can reduce users/accounts to monitor by 3 magnitudes, which inductively applies to 3 magnitudes less data that needs to be analyzed.

We’ll outline three distinct ways to find relevant sources via our analysis that can then be used as a superset of subsets based on specific needs.

Data & Duration

The report leverages tweets sampled from January 1st to March 22nd 2017 and along with Right Relevance topics, topical communities’ and articles data form the basis for the analysis.

The phrases used for gathering tweets are: “$amzn”, “amzn stock”,  “amzn stocks”, “amazon stock”, “amazon stocks”, “amzn equity”, “amzn equities”

Most of the summary report is extracted from the analysis collateral in the form of:

  1. Gephi Communities Graph Visual: Extracts are shown below.
  2. Tableau Online Dashboard: Visualizes graph analysis results, including flocks, top trending terms, top hashtags, top Users/accounts, RR topics, top tweets and several other measures in the form of tables and charts. Faceting is supported per flock, RR topic and Twitter/RR account.

For access to Tableau data and the complete graphs please send email to

The analysis methodology is outlined at

Communities Graph & RR Topics-based Identification

Community detection graph algorithms like Walktrap and InfoMap are used to identify communities (as sub-graphs) in our engagements graph built using Neo4j & R. Graph visualizations are done via Gephi.

The all engagements graph (Fig 1), which includes mentions, shows varied conversations with several communities considering Amazon is the largest online retailer and has a lot of buzz including deals, advertising, marketing, reviews etc. Most of this is noise from the financial analysis perspective. But, we can see that the graph has self-organized itself into distinct sub-graphs or communities. Several viable financial communities can be seen clustered around well known accounts like Wall Street Journal, CNBC, Jim Cramer, The Street, FOX Business, The Motely Fool, StockTwits, Yahoo Finance along with several smaller but well respected experts wrt financial analysis. This is very useful for cross referencing as we single out communities that are relevant for our scenario.


Figure 1: All Engagements Communities Gephi Graph

The RTs-only graph, is more sparse as expected and with more well defined communities. This is also useful for pinpointing the right flocks.

Figure 2: RTs-only Communities Gephi Graph

For our immediate need, we superimposed two Right Relevance topics, ‘stock trading’ & ‘stock markets’ (examples for this case study), over the graph to highlight nodes (aka users/accounts) that have influence in these topics. Fig 3 shows that this brings out the core finance, business, stock trading accounts that have engaged in ‘$amzn’ related conversations with at least a basic amount of engagements.


Figure 3: All Engagements Graph with relevant RR Topics Superimposed

The data from above graph visuals can be extracted from Tableau tables using the RR Topics as facets. Select the top topics (seen in Fig 4) relevant for our scenario and then click the ‘Top Tables’ tab next to ‘Dashboard’ in the top menu.

Figure 4: Top RR Topics as Facets

This leads to the lists of top accounts by several measures. This data is available via Right Relevance Insights API.

Figure 5: Top Accounts using RR Topics As Facets

This list of RR influencers from the graph, that are RR influencers in relevant topics like ‘stock trading’ and ‘stock markets’, forms the first set of accounts that we believe need to be monitored for $amzn related news and information.

Network Connectors-based Identification

Right Relevance ‘engagement influence’ measures are calculated by a set of graph analysis algorithms that measure the quality and quantity of engagements (RTs, mentions, replies), reach of tweets etc. within the context of a subject (event, trend etc.).

We apply several methods including PageRank and Betweeness centrality to measure Flock influence. The meaning of rankings within this methodology are documented at Twitter Conversation Performance Measures.

Prior work has repeatedly shown us the susceptibility of PageRank to high engagements and high followers count.

Betweeness centrality, which is a measure of the degree to which a node forms a bridge or critical link between all other users, leads to our top network connectors list. It is a measure of influence wrt value in being information and/or communication hubs. We have found this to be the leading way to identify valuable accounts as it bubbles up accounts with potentially real influence in terms of news and information dissemination on a given subject.

Figure 6: Top 50 accounts Connectors

Fig 6 is the list of top 50 connector accounts.

This forms the second set of accounts we use to monitor ‘$amzn’ related information.

 Flocks based Identification

The engagements or “flocking” in the context of a subject (topic, event etc.) can lead to building of temporal communities with local influence that is not obvious by the standalone influence of the individuals or without the context of the event. The subgraphs aka communities formed by applying community detection graph algorithms are termed as ‘Flocks’.

In this approach, we’ll pinpoint the flocks that are engaged in financial news and information dissemination and use the accounts forming those flocks as our sources.

Fig 7 lists the top 10 flocks in the context of ‘$amzn’ related conversations. The Twitter handle of the top PageRank account that is part of a flock is used as the flock name.


Figure 7: Top Flocks

As is obvious from the flock names, several are noisy and not relevant to our financial markets specfic context.

Let’s take the first flock ‘FatKidDeals’ as an example. From the graphs in Fig 1 & 2, it’s obvious that this flock forms an engaged community but is self-organized away from the main set of subgraphs. That is a good first indicator that this is probably not a relevant flock.

The top trending terms, hashtags (#fkd) and RR topics (coupons, retailing) in Fig 8 confirm the above and we can safely skip this flock.

Figure 8: Top Trending Terms, Hashtags, RR Topics for flock ‘FatKidDeals’

The next 2 flocks, ‘Walmart’ & ‘amazon’, can be skipped due to similar objections as ‘FatKidDeals’

We move on and analyze data around the next flock ‘davidmoadel’.

Figure 9: Top Trending Terms, Hashtags, RR Topics for flock ‘davidmoadel

From the trending terms (equity symbols amzn, aapl, nflx, googl), hashtags (#stocks, #stockmarket, #investing, #trading, #options etc.) and RR topics (‘forex market’, ‘stock markets’, stock trading’, ‘options trading’, ‘daytrading’  etc.) for this flock (Fig 9) it’s clear this is a finance/markets focused flock.

We select the top users using Top Overall, PageRank and Top Connectors measures as our relevant accounts for the selected flock ‘davidmaodel‘.

Figure 8: Top Accounts for flock ‘davidmoadel’

Following a similar process, we selected 4 of the top 10 flocks, ‘jimcramer’, ‘SharePlanner’ and ‘alphatrends’ in addition to ‘davidmoadel’. The rest are eliminated as noise.

The set of unique accounts using PageRank, Betweenness Centrality and Top Overall measures for each flock constitute the third set of accounts we use to monitor ‘$amzn’ related news & information.


Wrt our hypothesis, relevance and verifiability have been deeply tested using the Right Relevance platform for over 3 yrs. Injecting relevant topics and influencers’ graphs from the core platform and cross-checking every account selection by Betweenness Centrality and Flocks provides another layer of confirmation.

Wrt scale, using Twitter sampled data as our testbed, we’re looking at a potential initial scale of ~250M accounts and ~1B Tweets. Our relevant sets are reduced to 5-10K range for users and less than 5K tweets/day for ‘$amzn’ and related terms. Even assuming simple algorithms reducing the original scale by a magnitude, we’re looking at another 3-scale reduction in magnitude wrt accounts and tweets while providing relevance, verifiability and engagement-based trust. As shown in the US election analysis, we can further mitigate bot impact (some is ingrained in our initial cutoffs). Also, we can dynamically update these sets on a daily, weekly, monthly basis as required.

Our contention is that this would lead to:

  1. Reduction in noise due to much higher relevance
  2. Higher degree of trust and verifiability in data leading to fewer esp. catastrophic errors
  3. Substantial reduction in cost for data and processing
  4. Decisive latency edge since both simple and complex models can rapidly churn through the far less data and produces results before others
  5. Enabling more complex models which can do deeper analysis as noise is massively reduced and with much higher signal to noise ratio in the data

Next, we’re building feeds for ‘$amzn’ equity using the sets of accounts above and providing access to our users for trial purposes.

Please contact for more details.

Leave a Reply