Friday, January 15, 2016

What is HBase for?

Is HBase a good, flexible environment where you can do any kind of exploratory data mining that you need to do? Well, I can answer the question by the way we did it. It may not apply to your scenario, but then again, I don't know what your scenario is.

Have one scenario, was our scenario.

Let me explain.

HBase is wonderful, blazingly fast, and very specific. So, we store the stock market ('we' did, when I worked on that project, it's been a year since then, so, yes: I'm old). We stored it on GreenPlum at first. Why? Because we had to be able to handle any kind of query from the users, because that's what they said they want, and the users are always right.

That's the flexibility approach. What does that buy you? A limit of 200k-rows returned, max, and four hours per query with a maximum of 16 queries going on at any time.

We had 1600 users, so, yeah, that worked ... greeeeeeeat!

Or not.

So, we did studies, and we found that 85% of the queries were around a very specific query: the order lifecycle. That is, for any stock order (buy or sell), there was an order number for it, and that order number carried through all transactions for that order (placing the order, consolidating the order, sending it from a brokerage to the exchange, then executing the order). It would take hours to get 1 order out of GreenPlum ... because we'd get 6 billion orders every day, and the query to GreenPlum would reconstruct the order in the SQL. Ugh.

So, that's the 85%-rule. So, we simply took that one query, and built our HBase database around that. With order id being part of the index, bam, you got your order, any order, back in seconds.

Great.

But. A year later, people said: oh, I want to do research, I want to scan by date.

Hm. Problem. Date is not in the key, it's in the data, so to get date, you have to know the order (they don't) or your have to do a full table scan, 6 billion rows per day, 5 years of data.

Impossible. ... that is, now we're doing prototypes with Pegasus and HortonWorks by creating new indices on dates, but it took a month to do this prototype on a month's worth of data, and the database DOUBLED it size to accommodate this new index.

The date queries now go blazingly fast, but we're still undecided as to whether we want to bite the bullet on the agony of doubling our cluster size, our database size, just so somebody, if they want, can query by date.

Do you see what I'm saying here? HBase is NOT a general-purpose tool. It solves indexed data problems and it solves them blazingly fast. You start doing general purpose query and start having to scan values, then you may as well pack up and go home... OR create a new database where your sought values are now part of the indexed sets. So, yes, you can do that, but there's a cost in time (prototyping to ensure you're getting what you need, and then in creating the new database from the old database ... row by row) and in space, because now you have a new HBase database sharing the space with your old HBase database. And they are going to share space, even if just for a while, because if the new one blows up, you have to go back to the old, working one, so ... doubling your cost is the least expense you can hope for.

My experience: don't do exploratory querying against HBase outside the indices, that's not what HBase is for. HBase relieves agony that you had been having against the massive amount of data that you have and the set of VERY standardized queries you go against those data. 

Somebody always says: oh, that makes my life so much easier! So, can I do this one time thing that I'll never do again, but I'm just curious, and I don't care that it's an extreme boundary case that nobody cares about, I absolutely NEED this query because reasons.

Yeah. Be firm. This is what this database does. This is what this database DOES NOT do. If you want to do exploratory data mining, give me a start key and and end key and I'll give you a block of data to play with, otherwise the door is over there and here's a quarter to call somebody who cares.

My experience. HBase is awesome for what it is for. HBase sucks for things that it's not for, so don't use it for what it's not. Use it for what it is, and then trumpet your successes, harp on them, because people forget that they couldn't even think about exploratory data mining before you had the HBase database giving them the necessary answers first and in a timely fashion.

No comments:

Post a Comment