Modeling Relevant Sources for a Commodity: Crude Oil

Introduction

Right Relevance (RR) provides curated information and intelligence on ~50 thousand topics. This includes:

  • Topic relationships including related topics & semantic information like synonyms.
  • Topical influencers (~2.5M) with score and rank.
  • Topical content and information in the form of articles, videos and conversations.

Additionally, Right Relevance provides an Insights offering that combines the above Topics and Influencers information with real time conversations to provide actionable intelligence with visualizations to enable decision making. The Insights service is applicable to events like elections, emerging technologies, activism, conferences, product launches etc.

This report is part of a series to apply the Relevance-as-a-Service (RaaS) Insights technology to financial markets intelligence esp. financial instruments like equities, commodities, bonds and forex to begin with. This is the second report in the series after $AMZN. The focus of this report is to model the most relevant sources for a commodity with ‘crude oil‘ as the example.

Hypothesis for Application to Financial Instruments

The scale and availability of data is increasing exponentially. This is a boon overall but exposes some serious problems like the lack of ability to extract relevance and intelligence from data at this scale, high costs, misinformation and even more seriously disinformation (aka fake news) at scale.

We’ve previously outlined how trust from influencers (trusted sources) can be inductively applied to the fake news problem by providing a measure of trust & verifiability in addition to our core value prop of relevance.

We’ve applied the Right Relevance RaaS Insights technology to several scenarios listed below with great success.

This is our attempt to apply the same set of technologies and approaches to financial instruments, which in this report, is a commdity: ‘crude oil‘.

In the financial domain, several complex models exist, from high speed low latency quant trading to longer term analysis to back testing among others. Most of these models struggle to handle data at the current scale. Cost and latency are growing problems as the scale continues to increase. Errors due to misinformation and disinformation are increasing risks.

The hypothesis for this analysis rests on identifying relevant and verifiable/trustworthy sources (via influencers) to monitor for any given financial security such that we can reduce users/accounts to monitor by 3 magnitudes, which inductively applies to 3 magnitudes less data that needs to be analyzed.

We’ll outline three distinct ways to find relevant sources via our analysis that can then be used as a superset of those sets based on specific needs.

Data & Duration

The report leverages tweets sampled from April 1st to April 30th 2017 and along with Right Relevance topics, topical communities’ and articles data form the basis for the analysis.

The phrases used for gathering tweets are: “crudeoil”, “crude oil”, “$cl”, “wti”, “brentoil”, “brent oil”, “nymex oil”, “oilfutures”

Most of the summary report is extracted from the analysis collateral in the form of:

  1. Gephi Communities Graph Visual: Extracts are shown below.
  2. Tableau Online Dashboard: Visualizes graph analysis results, including flocks, top trending terms, top hashtags, top Users/accounts, RR topics, top tweets and several other measures in the form of tables and charts. Faceting is supported per flock, RR topic and Twitter/RR account.

For access to Tableau data and the complete graphs please send email to biz@rightrelevance.com.

The analysis methodology is outlined at https://info.rightrelevance.com/insights

Communities Graph & RR Topics-based Identification

Community detection graph algorithms like Walktrap and InfoMap are used to identify communities (as sub-graphs) in our engagements graph built using Neo4j & R. Graph visualizations are done via Gephi.

The all engagements graph (Fig 1), which includes mentions, shows one large diffuse subgraph comprised of business and oil/gas news accounts. The several colors within this mass denote contiguous and closely related communties. There are 3-4 other subgraphs that are distinct and separate from the main group. These will be visited and analyzed in the flocks section to understand applicability.

Figure 1: All Engagements Graph for ‘crude oil’

For Zoomable clickable link here.

For our immediate need, we superimposed four Right Relevance topics; ‘commodity markets’, ‘oil and gas industry’, ‘energy industry’ and ‘crude oil’  (relevant topics for this case study); over the graph to highlight nodes (aka users/accounts) that have influence in these topics. Fig 2 visualizes the resultant graph.

Figure 2: All Engagements Graph with relevant RR Topics Superimposed

For Zoomable clickable link here.

The list of accounts from the graph in Fig 2 can be extracted from Tableau using the RR Topics as facets. Select the top topics relevant to ‘crude oil’; for e.g. ‘commodity markets’, ‘oil and gas industry’, ‘energy industry’ and ‘crude oil’ (Fig 3);  then click the ‘Top Tables’ tab next to ‘Dashboard’ in the top menu.

RelatedRRTopics.png
Figure 3: Top related RR Topics for ‘crude oil’

The RR Topics faceting leads to the lists (Fig 4) of top accounts for ‘crude oil’ by several measures. This data is available via Right Relevance Insights API.

TopUsers_ByGraphRRTopics.png
Figure 4: Top ‘crude oil’ Accounts using RR Topics As Facets

Some of the top users by this method are:

Figure 5: Top 4 ‘crude oil’ Accounts using RR Topics As Facets

These lists of RR influencers from the graph and Tabeau, that are influencers in relevant topics for our scenario, form the first set of accounts that we believe need to be monitored for ‘crude oil’ related news and information.

Network Connectors-based Identification

Right Relevance ‘engagement influence’ measures are calculated by a set of graph analysis algorithms that measure the quality and quantity of engagements (RTs, mentions, replies), reach of tweets etc. within the context of a subject (event, trend etc.).

We apply several methods including PageRank and Betweeness centrality to measure Flock influence. The meaning of rankings within this methodology are documented at Twitter Conversation Performance Measures.

Prior work has repeatedly shown us the susceptibility of PageRank to high engagements and high followers count. Betweeness centrality, which is a measure of the degree to which a node forms a bridge or critical link between all other users, leads to our top network connectors list. It is a measure of influence wrt value in being information and/or communication hubs.

BwC_All.png
Figure 6: Top 40 Connectors

Fig 6 is the list of top 40 connector accounts. The top 2 accounts, Javier Blas (@JavierBlas2and Christopher Johnson (@chris1reuters) are usual suspects from the first method above. Samir Madani (@Samir_Madani), Giovanni Staunovo (@staunovo) and TankerTrackers.com (@TankerTrackersround up the top 5.

Figure 7: 3 of the Top 5 Connector Accounts

Interestingly, @Samir_Madani is one of the two founders of @TankerTrackers and the primary driver behind the #OOTT hashtag, which is the second most important in our hashtags list after #oil.

We have found Betweenness Centrality to be the leading way to identify valuable accounts as it bubbles up accounts with potentially real influence in terms of news and information dissemination on a given subject.

This forms the second set of accounts we use to monitor ‘crude oil’ related information.

Flocks based Identification

The engagements or “flocking” in the context of a subject (topic, event etc.) can lead to building of temporal communities with local influence that is not obvious by the standalone influence of the individuals or without the context of the event. The subgraphs aka communities formed by applying community detection graph algorithms are termed as ‘Flocks’.

In this approach, we pinpoint the most important accounts for ‘crude oil’ news and information via flocks analysis. Flocks analysis also helps identify accounts for more fine grained information feeds within the overall crude oil domain.

Fig 8 lists the top 10 flocks in the context of ‘crude oil’ related conversations. The Twitter handle of the top PageRank account that is part of a flock is used as the flock name. The full list is available via the public Tableau dashboard.

TopFlocks_All
Figure 8: Top 10 ‘crude oil’ Flocks

Top 3 flocks are reviewed below from selection pov. Same methodology can be applied to other flocks.

Flock: ‘PrestigeEcon’

The top trending terms (Fig 9) for this flock are all crude oil pricing focused including terms like prices, stocks, futures.

Fl_PrestigeEcon_Trending.png
Figure 9: Top Trending Terms for flock ‘PrestigeEcon’

The top hashtags and RR topics (Fig 10) confirm the relevance of this flock to our subject. The second most important hashtag #OOTT for this flock, is also the second most important for the entire April analysis. It surfaces an extremely interesting community on Twitter, called ‘Organization of -Trading Tweeters‘, that’s focused on discussing ‘crude oil’ and related topics. Discussions are around prices, production, inventories, export and anything that has trading impact.

 

Fl_PrestigeEcon_Top2.png
Figure 10: Top Hashtags, RR Topics & Users for flock ‘PrestigeEcon’

@Samir_Madani and @TankerTrackers are among the top 5 users, they’re also the primary drivers behind the #OOTT hashtag explaining why it’s a top hashtag for this flock. The top 2 users (Fig 11) of this flock are @PrestigeEcon and @BV .

Figure 11: Top 2 Accounts for flock ‘PrestigeEcon’

Flock: ‘Amaka_Ekwo’

The top trending terms for the flock are fairly political in nature with no direct reference to oil.

Fl_Amaka_Trending
Figure 12: Top Trending Terms for flock ‘Amaka_Ekwo’

Top hashtags and RR topics show a similar marginal relation of this flock to crude oil from trading/pricing point of  view mainly because Nigeria is a top oil producing nation.

Figure 13: Top RR Topics & Hashtags for flock ‘Amaka_Ekwo’

Top tweets show fairly concentrated and repetitious conversations alleging British-Nigerian collusion and corruption around crude oil from ‘biafraland’. The self-organizing nature of graph analysis has isloated this with clear separation from trading/pricing information flocks.

This flock should be considered for Nigeria specific oil & politics feed but doesn’t seem relevant from broader oil pricing/trading pov.

Flock: ‘JavierBlas2’

The most important user of this flock, @JavierBlas2, has been surfaced by all the measures above so applicability confidence is high to begin with. The top trending terms (Fig 14) confirm the high relevance of this flock to our subject ‘crude oil’.

Figure 14: Top trending terms for flock ‘JavierBlas2’

Top hashtags, RR topics and top tweet (Fig 15) clearly show that this flock is relevant leading to the selection of the accounts of this flock.

Figure 15: Top Hashtags, RR Topics & Tweet for flock ‘JavierBlas2’

Fig 16 lists the top accounts for this flock.

Fl_Javier_Users.png
Figure 16: Top Accounts for flock ‘JavierBlas2’

Conclusions

Wrt our hypothesis, relevance and verifiability have been deeply tested using the Right Relevance platform for over 3 yrs. Injecting relevant topics and influencers’ graphs from the core technology platform and cross-checking every account selection by Betweenness Centrality and Flocks provides another layer of confirmation.

Wrt scale, using Twitter sampled data as our testbed, we’re looking at a potential initial scale of ~300M accounts and ~1B Tweets. Our relevant sets are reduced to 5-10K range for users and less than 5K tweets/day for related terms. Even assuming simple algorithms reducing the original scale by a magnitude, we’re looking at another 3-scale reduction in magnitude wrt accounts and tweets while providing relevance, verifiability and engagement-based trust. As shown in the US election analysis, we can further mitigate bot impact (some is ingrained in our initial cutoffs). Also, we can dynamically update these sets on a daily, weekly, monthly basis as required.

Our contention is that this would lead to:

  1. Reduction in noise due to much higher relevance
  2. Higher degree of trust and verifiability in data leading to fewer esp. catastrophic errors
  3. Substantial reduction in cost for data and processing
  4. Decisive latency edge since both simple and complex models can rapidly churn through the far less data and produces results before others
  5. Enabling more complex models which can do deeper analysis as noise is massively reduced and with much higher signal to noise ratio in the data

Next, we’re building feeds for ‘crude oil’ commodity using the sets of accounts above and providing access to our users for trial purposes.

Please contact biz@rightrelevance.com for more details.

Leave a Reply