US Presidential Election 2016: Final Analysis

 

Introduction & Background

Update Dec 2ndMIT Tech Review published an article based on this report’s analysis.

Right Relevance (RR) provides curated information and intelligence on ~50 thousand topics. This includes:

  • Topic relationships including related topics & semantic information like synonyms, acronyms.
  • Topical influencers (~2.5M) with score and rank.
  • Topical content and information in the form of articles, videos and conversations.

Additionally, Right Relevance provides an Insights offering that combines the above Topics and Influencers information with real time conversations to provide actionable intelligence with visualizations to enable decision making. The Insights service is highly applicable to emerging events like elections, conferences, product launches, breaking news developments, outbreaks like Ebola etc.

Neo Technology (NeoTech) is creators of Neo4j, the world’s leading graph database.

In June 2016, Right Relevance in partnership with Neo Technology decided to analyze the 2016 US Presidential Election at scale. Three reports along with a talk at Graph Connect 2016 (links in Appendix) were released prior to this as part of that effort. This report covers analysis of Twitter tweets and Right Relevance topics & influencers from Nov 1 & 2 along with interesting snippets from prior reports.

Interest in this effort came about from our prior work on UK’s EU referendum aka Brexit. While analyzing Brexit, we didn’t believe our own results since both John (Product Manager, Data Science@RR) and Vishal (CEO@RR) thought Remain was going to win. This was against the results from our analysis so even our final Brexit tweet, the night before the vote, was muted. We were too trigger shy to not add a question mark. The experience gave us more confidence heading into the US elections. Please check the Appendix for all Brexit related analysis reports.

Data Analyzed for US Elections

Overall the published reports used over 20M tweets sampled from June to Nov 8th. The final report & graph used over 2M tweets from Nov 1 & 2. Along with these tweets, Right Relevance topics and topical influencer communities’ form the basis for the analysis.

The phrases used for gathering tweets are:

“clinton”, “hillaryclinton”,”realDonaldTrump”, “trump”, “imwithher”, “obama”, “potus”, “hillary2016”, “StrongerTogether”, “campaign hillary”, “ClintonKaine”, “clintonkaine16”, “clintonkaine2016”, “nevertrump”,  “trumppence”, “trumppence16”, “trumppence2016”, “#pence”, “makeamericagreatagain”, “maga”, “neverhillary”, “trump2016”, “campaign donald”, “elections2016”, “electionvoices”, “vote2016”, “GOP”, “RNC”, “DNC”

The US Election analysis collateral is available in the form of:

  1. Tableau Online Dashboard
  2. Gephi Communities Graph Visual

The analysis methodology is outlined at http://54.244.44.22/insights

Notable Analysis Prior to the Final Report

Below are 2 events of note leading up to the final week and our analysis of the election.

John Swain’s Graph Connect Talk

Firstly, RR Product Manager, John Swain’s talk on Oct 13th at Graph Connect 2016 on the US election work done by RR in partnership with Neo Tech. This talk covered election data analysis, using the Neo4j graph database.

The video from the talk has 2 timeframes that are extremely relevant. Some snippets paraphrased are below:

38- 38:45 min: Similarity to Brexit very stark; anybody who was anybody supported Remain; no-one, including main proponents of Brexit, believed it was going to happen but it did; there was complacency; turnout low on Remain side; potential it could happen in the US election; low voter turnout risk leading to getting people to register to vote effort being important.

36 -38 min: Establishment commentators (established media, metropolitan elite, whatever you want to call them) more and more prevalent of Hillary Clinton’s side of the graph leaving Trump’s side quite sparse. No judgement on topics of influence. What’s potentially happening is everyone on Hillary Clinton’s side is starting to listen more and more to those people, becomes a filter bubble, self re-enforcing, stops looking out, start to believe what they’re hearing. Very similar to what we detected on Brexit.

Needless to say, the observations above were bang on the money.

Presidential Debate 3 Report Warning

Secondly, the Presidential Debate 3 report, released on Oct 26th, clearly outlined how the “Elites” (influencers) were loaded on one side (Clinton) but higher support and engagement persisted on the Trump (plebs aka “deplorables”) side of the graph. This was visible even excluding the Bot effect.

Also, we noted the following as a conclusion:

There is a common perception that Hillary Clinton is winning the election comfortably. The assertion that there are a large number of bots ‘supporting’ Donald Trump plays to this perception by suggesting that the noisy support for Trump is not real. Based on what we observed in the Brexit election where there is a larger ‘hidden’ support for one side over the other, we would advise some caution over thinking the election is already won by Hillary Clinton.

Communities Graph Visualization

One of the most important step involves building the engagements graph on retweets, mentions and replies in Neo4j and run several iterations of community detection graph algorithms using R and igraph. The results are then visualized using the Gephi graph visualization tool.

The retweets graph below (green-Clinton, pink-Trump), based on tweets from Nov 1 & 2, shows a strongly partisan split between the Trump and Clinton supporters. It also clearly shows a noticeable skew towards Trump. This is similar to the Brexit graph analysis, where Leave conversations had more traction and engagement than Remain.

BrexitLike_Nov1&2_Election16
Clinton (green) Vs Trump (pink) Twitter Engagements Graph

Looking at the previous report from Oct 26th, we notice that this didn’t change substantially from the skew seen in the post Presidential Debate3 analysis.  But, one major difference noted is how @wikileaks & FBI took over the conversation in late October and early November (Nov 1 & 2) compared to the prior iterations. This is confirmed by the Flocks analysis below.

There are several other highly influential accounts as seen in the graph that either didn’t get any or very little coverage in the mainstream news media but seem to have made major impact on the overall engagement for Trump. We’ll cover them in greater detail below in “Influential” Users section.

Top Conversational Themes

Trending terms and hashtags are used in the context of the overall conversations to identify top conversational themes.

We use latent Dirichlet allocation (LDA) based text analysis of the tweets for identifying high value trending terms.

MainThemes_HashtagsTrendingTerms.png
Trending Terms & Hashtags from Nov 1&2

The trending terms lists clearly show how FBI (fbi, emails, comey, investigation), wikileaks, Podesta emails, Clinton foundation, corruption took over the overall conversation and engagements.

Hashtags analysis also show #podestaemails26 and #wikileaks bubbling to the top.

The broader news media (aka main stream media or MSM) covered the FBI decision to reopen the Hillary Cinton investigation so that wasn’t unexpected. But, the coverage provided to @wikileaks release of John Podesta’s emails outing the internal functioning of the Clinton foundation were not as widely reported. These emails, as seen above, saw very high coverage and engagement on Twitter compared to the MSM and potentially ended up having a much bigger impact on the election than one would gauge from following the news.

“Influential” Users

We apply several methods including PageRank and Betweeness centrality to calculate this by the measuring quality and quantity of engagements (RTs, mentions, replies), reach of tweets, connections etc. within the context of an event. This leads to a measure of Influence which is in the context of the event being monitored and is thus temporal in nature.

Significance of each method is documented at: http://54.244.44.22/insights/twitter-conversation-performance-measures/

Rank based Influence Measure

The first two lists below are of the top 25 accounts by PageRank and overall rank on engagements. Overall rank is a normalized rank to reduce the skew towards users with large numbers of followers or a single tweet having a large number of engagements/RTs (often referred to as becoming ‘viral’).

Top_Overall_PageRank.png
Top Users Overall & by PageRank

PageRank brings out the usual suspects like @wikileaks, @HillaryClinton, @realDonaldTrump, @FoxNews, @politico, @CNNPolitics, @FoxNewsInsider, @ABCPolitics, @CNN,  @foxnewspolitics showing the susceptibility of PageRank to high followers/reach.

The top overall, due to the normalized nature, usually dampens that bias. In this case however, it doesn’t look very different as the main themes and conversations and thus engagements were very focused around a small number of trending terms (FBI investigation & wikileaks).

But, since the analysis is based on engagements, in this case on Nov 1 & 2, several interesting users cropped up based on high engagements with their tweets.

For e.g. @FillWerrel, @girlposts, @Solano66, @vivelafra and @deadparrish (changed handle to @softpaint). These accounts don’t have much to do directly with the news media or the US elections but have been able to create high engagements either in the very short term (1-tweet) or longer term based on a series of tweets. Some tweets are listed in the top tweets section.

Centrality based Influence Measure

Connectors and Brokers are measures based on node centrality in graphs. Connectors is based on Betweeness centrality, which is a measure of the degree to which a node forms a bridge or critical link between all other users. Brokers are based on links between identified sub-graphs or communities in our community detection analysis.

We use it as a measure of influence of users wrt their value in being information communication hubs.

TopConnectorsBrokers.png
Top Graph Connectors & Brokers

Right off the bat, it’s obvious that most MSM accounts are not part of this list. These accounts act as news dissemination and communication hubs and are potentially hard to identify a priori but crucial from the point of view of spreading information/messaging broadly.

Reach Vs Rank

Another view to understand influence is to plot reach (followers) against a normalized measure we call rank, which is based on quality and quantity engagements. This is another great way to dampen pure follower based metrics and to bring out users that hold more sway within the community itself.

@wikileaks, @HillaryClinton, @realDonaldTrump, @FoxNews, @politico were trimmed to obtain the screenshot since they are outliers on the upper side. They have both, very high Reach and Rank. The Tableau Online Dashboard has the complete chart.

ReachVsRank
Trimmed Reach Vs Rank chart

The Reach (X-axis) Vs Rank (Y-axis) graph throws up a couple of interesting things immediately:

  1. The high follower accounts like CNN, NY Times, CBS News, ABC News, the White House, Drudge Report, The Hill, NBC News etc. show up with high Reach but are either below or very close to the line diving Reach from Rank. Their influence, as defined by Rank is marginal at best when it came to impacting the US election audience. But, when they speak they get a lot more audience as expected.
  2. The “connectors” group of bloggers etc. like Shane Goldmacher, ViveLaFrance, Tracee Ellis Ross, Jonathan Martin, Jonah Goldberg, Lisa Lerner, MicroSpookyLeaks, Charlie Mahtesian, and focused political accounts like CNN Politics & NYT Politics show higher traction wrt Rank even with relatively lower Reach (in terms of followers) sometimes. This could be due to higher trust and more focused messaging.

Interesting FLOCKS

Flocks are people engaging in conversations around events esp. in context of a specific subject, which is the US Presidential election in this case. This “flocking” can lead to building of temporal communities with local influence that can lead to virality not obvious by the standalone influence of the individuals or without the context of the event.

The flock names below are named on the user/account with the most influence in the flock.

Considering the extremely bipolar nature of the graph, all flocks fall in one group or the other. One interesting thing to note is that @wikileaks leads the Donald Trump flocks confirming the impact it had late in the election campaign cycle.

FlocksUsers.png
Wikileaks Flock & Top Flock Users

HillaryClinton flock leads the flocks for Clinton and the realDonaldTrump flock forms the second biggest Trump flock. This is again noticeably different from the Oct 26th analysis for Debate 3, where HillaryClinton & TheEconomist formed the top 2 flocks and 3 out of top 5 were Clinton supportive flocks.

Ratio of Influencers & Skew with Metrics

The overall graph shows higher support for Trump wrt users. We created ‘influencer’-only metrics especially to compare this with our experience with Brexit, where Remain had a visible ‘influencer/Elite’-skew.

Authority-based Measure

At Right Relevance, we measure the influence of users on social media in over 50k Topics. The measure of a user’s influence within a given topic is called their Authority. Using this definition of Authority, we can filter the two groups of users to show just those users with influence scores above a specified threshold and compare those with the number of users below the threshold (non-influencers).

If we set the score for Authority at 70 the ratios are as follows:

Ratio_Influencers

This shows that there are almost double the number of Influencers (pro rata) on the Clinton side than on the Trump side. We’re not making any judgement about in which Topics the user has Authority in terms of worthy or valuable Topics. We are just measuring the minimum Authority score a user has within any Topic. For example, in this context, a high score in a Topic like TCOT has the same value than a high score in Journalism. You can find a full list of Topics we found in this conversation in our Tableau dashboard.

Voice-based Measure

Another metric we use within a given conversation (in this case US election), based on several graph algorithms, gives us the influence of a user termed as their Voice. If we just look at those users with the highest Voice in the conversation (by removing all uses with a low Voice score) we can make a qualitative assessment of the types of users that are associated with each side by a visual inspection of the users on both sides.

InfluencersOnly_Graph.png
@rightrelevance Influencers only Graph

Looking at the graph visualization above, it is clear that there is a much higher proportion of mainstream media and establishment users on the Clinton side. The Trump side has a much higher proportion of users who are specifically Trump supporting accounts.

Role of BOTs

One of the arguments for the high number of users associated with Trump is the theory that there are a large number of bots tweeting pro Trump messages.

We decided to use Prof. Howard’s definition of a Bot (more than 50 Tweets in the day) to observe the ratio of Bots to non-Bots in the two sides of the conversation in our analysis and compare with their analysis.

Process: our data set includes a total of 780k Twitter users. Before our initial analysis we filter out small users and those we identify as bots/spammers with a very high probability. This reduces the number of users to 550k. We use a combination of different techniques to identify these simple bots at this initial stage. We then reduce the size of the data set for visualization to a much smaller 17k. These are shown in the maps in our post. The 17k contains just the most important users in the network by various measures including pagerank & betweenness centrality.

What is interesting, therefore is that there are still a significant number of bots, by Prof. Howards’s definition which remain in our data set. We would characterize these as more ‘sophisticated’ bots.

The ratios of bots to all users are as follows.

Bots_Ratio

In the tweets we analyzed there are 3x as many bots on the Trump side as the Clinton side. Therefore, our analysis also provides evidence in support of Professor Howard’s assertion that there are a much higher proportion of Bots supporting the Trump campaign than those supporting Clinton.

Identifying Reliable Verifiable Sources in Advance

In our Early thoughts analysis report released on October 2nd (based on data from Sept 11th), outside of noticing the impact of Clinton’s “Deplorables” comment and “health issues” due to the stumble outside the world trade center memorial, we provided a list of, what we identified as the most crucial sources of election news, esp. related to Trump financials, in advance.

In that report we noted the following:

With the benefit of hindsight, the information discussed here seems quite understandable, even obvious, however at the time when the stories are developing the ability to rapidly assimilate and understand what is important and relevant in a very crowded environment is a crucial asset.

Here is the table showing those users who are involved in reporting the stories regarding Trump in general and financial matters in partcular. This list was discovered entirely with machine learning and no human curation up to this point.

Sept11_TopUsers
Top Users from Analysis on Tweets from Sept 11th

The graph shows some of the users highlighted on the network map.

Graph showing influential verifiable news makes for US Election'16
Graph showing influential verifiable news makes for US Election’16

@fahrenthold featured prominently on all those lists.

Fahrenthold

@fahrenthold broke the Trump tapes story on October 7th. In an update to the initial election analysis we noted the below:

Fahrenthold_TrumpTapeRelease.png
Fahrenthold tweet breaking Trump tape story

The ability to identify verifiable hubs of reliable information in advance is of extremely high value considering the scale & availability of information and the massive growth in propaganda and fake news.

Top Interesting Tweets from Nov 1 & 2

@FillWerrel (RT’d by @Solano66) and @girlposts were responsible for the most interesting tweets from Nov 1&2 with the highest engagement among Election’16 crowd.

Some Conclusions

  1. Influencers: There were double the number of Right Relevance Influencers (pro rata) on the Clinton side than on the Trump side.
  2. Media Skew: There was a much higher proportion of mainstream media and establishment users on the Clinton side
  3. Filter Bubble: Strongly partisan groups forming their own insular bubbles. Since news media esp. journalists were largely in the Clinton bubble, it’s not hard to understand why the election result came as such a surprise to most.
  4. Bots: There was much higher proportion (factor of 3) of Bots supporting the Trump campaign than those supporting Clinton.
  5. Fake News: The ability to identify hubs of relevant information & intelligence in advance is of extremely high value considering the scale & availability of information and the massive growth in propaganda and fake news.

One more thing of note is the multiple failures of Polling, including the UK General Election, Brexit and the US Presidential Election. We’re working on a detailed blog post on how, in the age of social media, polling is potentially nowhere as effective compared to a data science, ML/AI based analysis of social signals at an extremely high scale.

Appendix

US Election Reports

Brexit Reports