Thursday, May 30, 2019

Representing Relational Data as a Graph

Let's say you have entities stored in a relational database, and you wish to view these entities as a graph.

First, you do a data extract of the relevant data:

select e.entity as entity,sum(1) as count,'January' as month, 2019 as year
from entity_select es
join entity e on e.id=es.entity_id
where es.selected_dttm > '2019-01-01' and es.selected_dttm < '2019-02-01'
group by e.entity
order by count desc
limit 20;
Save these data out as CSV (may sure it has headers), 

The resulting saved file looks something like this, doesn't it:

entity,count,month,year
donald trump,1478,January,2019
nancy pelosi,413,January,2019
ralph northam,362,January,2019
department of homeland security,357,January,2019
twitter,312,January,2019
barack obama,246,January,2019
michael cohen,228,January,2019
robert mueller,189,January,2019
mitch mcconnell,178,January,2019
fox news,160,January,2019
jim mattis,155,January,2019
u.s.-mexico border,151,January,2019
jason leopold,147,January,2019
pharrell williams,147,January,2019
islamic state,146,January,2019
kirk cox,142,January,2019
mark warner,132,January,2019
moscow,129,January,2019
house of representatives,128,January,2019
elaine luria,128,January,2019

then import these data into your graph datastore:

LOAD CSV WITH HEADERS FROM "file:///Users/geophf/daters/entities-2019-01.csv" AS row
MERGE (y:Year { year: row.year })
MERGE (m:Month { month: row.month })
MERGE (y)<-[o:Of]-(m)
CREATE (e:Entity { name: row.entity })
CREATE (e)<-[c:Count { count: toInt(row.count) }]-(m)

Query your newly created database with the Cypher query:

MATCH (n) RETURN n LIMIT 25

You get the following result from the above data:



Let's repeat this for months February, March, April, and May, remembering to replace the CREATE keyword with MERGE for entities, now that we may have entities crossing months.

Looking at just the January and February data (and replacing the Count label from its type to its count), we see entities crossing months:



When we bring in March's data, we see an entity shared across the three months: Donald Trump. Surprising, n'est-ce pas?



But between March and April, there are several entities in common:



Okay, now let's tie in May. And, when we do tie in May, we observe an interesting phenomenon: entities 'leapfrogging' over months, as interest in them waxes and wanes.



Question: from the above graph, can you compute which entities are most viewed over the entire span of observation?

No comments:

Post a Comment