We all know the answer to life is 42. But that's the rounded Integer. What does the Float look like?
MapReduce. The name almost makes sense, which is unusual in this field; and immediately makes me suspicious.
Key-Value pairing is the bread and butter of MapReduce.
When I first learned about key-value pairs I thought “interesting” in that condescending ‘i dunno gettit’ sort of way; it seemed too arbitrary. How could a simple list of two-stuffs be important? (Not to get too technical, but the “two-stuffs” are often referred to as tuples.).
For some context a set of key-value pairs looks like this:
(‘matt’. 10) ('jane', 3) (‘steve’, 6)
In the above example the keys are matt, jane and steve.
The values are 10, 3 and 6.
It seems a bit too rudimentary to be useful. But if you have millions these pairs you can figure out a lot of stuff.
MapReduce is a method of turning any set of data into a set of these key-value pairs.
I will demonstrate with my King James Bible example. (It’s my example because I both invented counting AND wrote the King James Bible.) The purpose of this exercise is to count every word in the bible and figure out how often each one is represented.
I’ll go through the two steps:
This step goes through the entire book, extracts every word and creates a key-value for it.
(‘the’, 1) (‘as’, 1) (‘who’, 1) (‘the’, 1)
Notice that the Map step doesn’t discriminate; it pulls every single word out and pairs it with the number 1. This blows the database up very quickly. It makes a new thing containing every word in the Bible, PLUS an integer value of 1.
This is the part where all the fun stuff happens and where our new, giant data set gets filtered down into something useful.
Every KEY (example ‘the’) that matches with any other KEY (example ‘the’) gets mushed together, and their values get added up.
So the above example would result in:
(‘the’, 2) (‘as’, 1) (‘who’, 1)
… etc ...
This is a tiny sample (for learnink purposes), in reality you’d end up with something like this:
(‘the’, 2314) (‘as’, 1265) (‘who’, 576)
… etc …
At first this may seem like a bit of a roundabout way of doing a fairly simple thing. After all, what can be so hard about counting up a bunch of words? It's a computer. But that's exactly why this works the way it does.
It's easy to forget that computers are binary machines. They can be very fast, but they're also limited by calculations involving electrical impulses. When you click on 'count up words' in MS Word, it just does it. But of course behind the scenes it's doing something much like the example above to get your answer. When you can start splitting this work up over many processors (see cluster computing) you can do insane things; like mapping human DNA, or counting galaxies in the Universe.
Wanna keep learning stuff?
Of course you do. Check out my attempt at summarizing the Python data ecosystem: