HBase and Pig and Titanic

Since NoSQL is the future of humanity and will save the Universe, I've thrown together this quick tutorial on how to use it in a (semi) practical sense.

I’ve used Ambari, locally, to run this experiment.  Although I can’t give a full tutorial on Ambari or Hortonworks, I will provide the following links.  You’ll need to download two files (one giant), and there’s plenty of great documentation for installing and using them:

Hortonworks Data Flatform

https://hortonworks.com/downloads/

 

Oracle VirtualBox (the latest version is 6.0, however I had some problems with this and have reverted down to 4.5)

https://hortonworks.com/downloads/

 

For the sake of simplicity I’m using the Titanic data set (the train.csv file) which you can get from Kaggle:

https://www.kaggle.com/c/titanic

 

The first thing you’ll want to do is upload this dataset into your HDFS files in Ambari (go to HDFS and files view).  I’ve put mine in a ‘titanic’ directory. You can do this with the command line too, I found it easier just to do the dashboard thing for such a relatively small file.

You’ll need to SSH into your local Ambari, being on Windows I’m using Putty.

Once you have a nice connection, you can start checking out your HBase situation.  To get to the shell just type:

hbase shell

At the HBase prompt you can try a couple of things.  First just type ...

list

… to see a list of current tables.  Ambari automatically installs a few examples for you, but we’ll need to make a new one for our Titanic data.  So just type …

create ‘titanic’, ‘passengers’

… which creates a new table called ‘titanic’ with a column family called ‘passengers’.  If you’re not sure what a column family is then you might want to do a bit of research on hBase and NoSQL in general.  It’s not very difficult, but some background will help when you take a look at the final product.

Now for fun, type…

scan ‘titanic’

… which should show you a new table with zero rows.

Now type …

exit

… to exit from the HBase shell and get back into your normal Linux prompt.  You’re going to need to get a Pig script into this location. The Pig file is as follows.

A = LOAD '/user/maria_dev/titanic/train.csv'
USING PigStorage(',')
AS (PassengerId:int, Survived:boolean, Pclass:int, Name:chararray, Sex:chararray, Age:int, SibSp:int, Parch:int, Ticket:chararray, Fare:float, Cabin:chararray, Embarked:chararray);
users = FILTER A by $3 != 'Name';
DUMP A;
DESCRIBE users;
DUMP users;
STORE users INTO 'hbase://titanic'
USING org.apache.pig.backend.hadoop.hbase.HBaseStorage (
'passengers:Survived, passengers:Pclass, passengers:Name, passengers:Sex, passengers:Age, passengers:SibSp, passengers:Parch, passengers:Ticket, passengers:Fare, passengers:Cabin, passengers:Embarked');

For convenience sake I’ve uploaded it to my server so you can get the file into your Ambari by typing the following (you lazy bugger) …

wget http://www.matthewhughes.ca/titanic.pig

Now you SHOULD be ready to go!  Simply type …

pig titanic.pig

… and watch the magic happen.  It can take a while, so go get a coffee.

Once it’s done (successfully we hope), go back into the hBase shell, and scan your titanic table (as per instructions above).  You’re titanic data is now in hBase!  (or HBase, or hbase, or HBaSe, who knows...)

 

Leave a Reply

Your email address will not be published. Required fields are marked *