Thursday, May 30, 2019

Exploring Entities-viewed Graph Data

Let's say you have a graph database of entities viewed in a periodical over a five-month span (here, for example). How do you explore the data? Well, we can explore it visually, traversing node-to-node in the graph, or we can get the results of data analysis in a tabular form, because analysts' bread and butter is the spreadsheet, don't you know. Let's explore both approaches.

I. Visually

Case Study 1: April 2019 entity views

Let's say you want to see the top entities for a particular month. And we just so happen to know in which year that month is that we wish to explore, so that's a plus.

First, we request the year node:

match (y:Year) return y

With that query, you get back the Year node (or nodes).



Select it to view the months of that year:



Ah! I'm really interested in the top entities viewed in April. Let's see what they are. Select the April Month node to view them:





And up pops the entities and their view-counts for April. But note something else. Any entities that have shown up in any other month have the view-counts for those months linked to them, as well. Because those other month-nodes are present in this graph-view, you get those linkages manifested "fo' free!"

Case Study 2: From an Entity's Perspective

Let's now explore the graph from the opposite perspective. That is to say, instead of drilling down from the year to the month to the entities of that month, let's look at the graph starting from an entity. Choosing one, let's go with "donald trump" (because I just so happen to know this entity shows up in more than one month, oddly enough).

Our Cypher query to get a particular entity is:

match (e:Entity { name: "donald trump" }) return e

Executing that query, we get:



And – boom! – there you have it, "donald trump." Let's see the months this entity has been viewed, and the view-counts by month. To expand the entity we select it to get:



Well, how many times has this entity been viewed in total? That is to say, across the entire data set? Well, we can eyeball it and say "eh, more than 12?" and we'd not be wrong.

II. Cypher Queries

Case Study 1: April's Entities as a Spreadsheet

We've already been doing some rudimentary cypher queries to get our starting-point nodes. Let's use Cypher to get our data back in spreadsheet-form, because we all so love spreadsheets!

Let's do a simple enough query, repeating what we already saw, visually, above, and query the entities viewed in April. The Cypher query to return that (sorted) result in tabular form is:

match (e:Entity)<-[c:Count]-(m:Month { month: "April"})
return e.name as name,toInteger(c.count) as count
order by count descending

And we get a nice little spreadsheet back:



And in the upper right-hand corner is an 'export CSV'-button, so you can actually download the data as CSV and import that to your spreadsheet; lolneat!

Case Study 2: Donald Trump's Views

Let's answer the previous question: how many views did the "donald trump"-entity get across the entire duration of the data period? The Cypher query that gets us that information is:

match (e:Entity { name: "donald trump"})<-[c:Count]-()
return e.name as name, sum(toInteger(c.count)) as views

This returns the result of:



So, clocking in at 11,709 views, we were correct in saying the entity "donald trump" has more than 12 views in total.

Case Study 3: All Entities Viewed

Okay, we know the view-counts for the entity "donald trump." Great! How about a query that gives us all the entities viewed, sorted by most viewed first. The Cypher query to do this is:

match (e:Entity)<-[c:Count]-()
return e.name as name, sum(toInteger(c.count)) as views
order by views descending

And the results that query shows us is:



Conclusion

So there you have it! Exploring a graph both visually and by extracting data via a cypher query.

Uploading a Graph Database to the Web

Okay, you have a graph database you've created locally, and you now wish to share it on the web. How do you go about doing that? There are several approaches: you can use AWS' graph database, Neptune. You can use GraphGist from neo4j. This article will go with graphenedb.com's graph database.

Simply follow the instructions on their site (here), and you have a graph database uploaded for sharing and collaboration.

Representing Relational Data as a Graph

Let's say you have entities stored in a relational database, and you wish to view these entities as a graph.

First, you do a data extract of the relevant data:

select e.entity as entity,sum(1) as count,'January' as month, 2019 as year
from entity_select es
join entity e on e.id=es.entity_id
where es.selected_dttm > '2019-01-01' and es.selected_dttm < '2019-02-01'
group by e.entity
order by count desc
limit 20;
Save these data out as CSV (may sure it has headers), 

The resulting saved file looks something like this, doesn't it:

entity,count,month,year
donald trump,1478,January,2019
nancy pelosi,413,January,2019
ralph northam,362,January,2019
department of homeland security,357,January,2019
twitter,312,January,2019
barack obama,246,January,2019
michael cohen,228,January,2019
robert mueller,189,January,2019
mitch mcconnell,178,January,2019
fox news,160,January,2019
jim mattis,155,January,2019
u.s.-mexico border,151,January,2019
jason leopold,147,January,2019
pharrell williams,147,January,2019
islamic state,146,January,2019
kirk cox,142,January,2019
mark warner,132,January,2019
moscow,129,January,2019
house of representatives,128,January,2019
elaine luria,128,January,2019

then import these data into your graph datastore:

LOAD CSV WITH HEADERS FROM "file:///Users/geophf/daters/entities-2019-01.csv" AS row
MERGE (y:Year { year: row.year })
MERGE (m:Month { month: row.month })
MERGE (y)<-[o:Of]-(m)
CREATE (e:Entity { name: row.entity })
CREATE (e)<-[c:Count { count: toInt(row.count) }]-(m)

Query your newly created database with the Cypher query:

MATCH (n) RETURN n LIMIT 25

You get the following result from the above data:



Let's repeat this for months February, March, April, and May, remembering to replace the CREATE keyword with MERGE for entities, now that we may have entities crossing months.

Looking at just the January and February data (and replacing the Count label from its type to its count), we see entities crossing months:



When we bring in March's data, we see an entity shared across the three months: Donald Trump. Surprising, n'est-ce pas?



But between March and April, there are several entities in common:



Okay, now let's tie in May. And, when we do tie in May, we observe an interesting phenomenon: entities 'leapfrogging' over months, as interest in them waxes and wanes.



Question: from the above graph, can you compute which entities are most viewed over the entire span of observation?