Cleaning Data 102 – Pesky Texty

If you’re going to be doing any analysis or machine learning with your data, it’s very important to make sure that your data is readable by … a machine!  Imagine that. This often means getting rid of, or imputing (smart speak) any data that isn’t in a numerical format.

Dames. Poison.

Computers love numbers.  I also love numbers. And flowery women with black glasses and a random lust for life.

Also booze.

But that is irrelevant because it turns out that computers still hate words.  And pictures. And babies.

Especially pictures of babies.

Sorry love.  I hate your new baby.

So in order to start, its important to try and turn your pesky letters into numbers.

Luckily Python can help.

Quote of the CenturyLike a lot of data science stuff, you have to make more stuff first before you can make less stuff later.

That’s a sick quote!

Let’s just try dealing with Sex.  This is a good attribute to start with because it can only consist of 2 values, male or female.  Well… for the sake of this lecture let’s just make that assumption.  And don’t get all political on me my pretties.

We’ll use the Titanic data again.

When we originally take a look at the Sex column of the data, this is what we have:

0 male
1 female
2 female
3 female
4 male
5 male
6 male
7 male
8 female


Essentially what we want to do is convert the ‘male’ and ‘female’ values  into 1’s and 0’s.  Because that’s what computers like.  

They also like taking over the Earth and destroying all life.

But that's also irrelevant.

Pandas is beautiful and thankfully it gives us a very simple way of doing what we need doing here with the pd.get_dummies function.

What this function does is create a new dataset and splits all the possible values of your input data into new columns containing numerical data:

sex = pd.get_dummies(titanic['Sex'],drop_first=True)

In the above example, the new sex dataset will look like this:

M  | F
1  | 0
0  | 1
0  | 1


We can then remove the pre-existing Sex column...


... and replace it with the new sex column of zeros and ones.

titanic = pd.concat([titanic,sex],axis=1)

If you didn't notice, you also dropped one of the two columns in the new dataset you created, because you don't need a male and female column, since the two values are mutually exclusive (You can't be male AND female.  Well... in the Titanic times you couldn't be)

So what you end up with is a new column in your main dataset called 'male' that looks like this:

0 1
1 0
2 0
3 0
4 1
5 1
6 1
7 1
8 0

Done like dinner....



Leave a Reply

Your email address will not be published. Required fields are marked *