Saturday, February 13, 2016

Upgrading the Graph Database using ETL

Upgrading to Neo4J-2.3.2 database. ETL-in-action

We have the following etl.perl script:
geophf:1HaskellADay geophf$ cat ~/bin/etl.perl 
#!/usr/bin/perl

$junque = <>;

while(<>) {
   chomp;
   $day = $_;
   $one = <>;
   chomp $one;
   $two = <>;
   chomp $two;
   $three = <>;
   chomp $three;
   print `jsoner.sh $day "$one" "$two" "$three"`
}

which calls this shell script:

geophf:1HaskellADay geophf$ cat ~/bin/jsoner.sh 
#!/bin/bash

JSON=`jsonify $1 "$2" "$3" "$4"`

# echo $JSON

jsoneronious.sh $CYPHERDB_USER $CYPHERDB_PASSWD $CYPHERDB_URL "$JSON" $1

# echo my curl command has $CYPHERDB_USER:$CYPHERDB_PASSWD and $CYPHERDB_URL

echo
echo
echo Saved $1 top 5s to GrapheneDB

We've created a new GrapheneDB DaaS as (initialization-step):



See? It's really empty (verification-step):



Okay, let's update the access key-set with our new database name, then let's do this.

[updates database access key-set]

Then we make sure our top5s data source is up-to-date: http://lpaste.net/raw/4714982275408723968 by checking the most-recent (last) date:

2016-02-12
Mkt_Cap:JPM,WFC,PTR,GE,ATVI|LMCB,MOG.B,RAI,LPLA
Price:GRPN,ICPT,TCK,AXL,AVP|MOG.B,LPLA,JW.B,LMCB,P
Volume:BAC,GE,CSCO,CHK,QQQ,FCX,ATVI,AAPL,P,C

Okay, let's run this thing!

geophf:1HaskellADay geophf$ etl.perl Seer/data/top5s.csv 
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  2801  100   125  100  2676    557  11935 --:--:-- --:--:-- --:--:-- 11946
HTTP/1.1 100 Continue

HTTP/1.1 200 OK
Server: nginx
Date: Sun, 14 Feb 2016 06:33:58 GMT
Content-Type: application/json
Content-Length: 125
Connection: keep-alive
Access-Control-Allow-Origin: *

{"results":[{"columns":[],"data":[]},{"columns":[],"data":[]},{"columns":[],"data":[]},{"columns":[],"data":[]}],"errors":[]}

Saved 2016-02-12 top 5s to GrapheneDB

I didn't time it, but it took less than a minute to upload nine months of top5s data. Let's check our results:

Okay, great; let's do a specific query against $NFLX, on which I'm currently writing a case-study.

One last check. How much space did we take up in the new database? How much space do we have still available to us? Both for nodes (not so much a worry) and for storage:




As you can see, 9 months of data doesn't even scratch the surface of the developers' edition of the database. We have, oh, 500 or so years-worth of data before we need to start looking at other options. Sweetness.

BAM! The old database is therefore ready for decommissioning. We just have to remember now before we do the scrape to do a backup manually and confirm the new database is receiving new top5s information as we scrape it.

No comments:

Post a Comment