Cleaning Data 101 – Imputing NULLs

Even though it seems like a bit of a grind, cleaning your data can be the most creative part of the job.

Fun Times

If you’re doing any sort of machine learning with your data, NULL values in your set are going to drive you mental

So, my pretties, let’s start at the beginning and impute the empty data from a set.  (For those of you not as smart as me, imputing is just a fancy way of saying ‘replace’.)

I’ve been using the Titanic data, which is a fairly popular learning set, you can find it here: Titanic Data

I’ve already imported the csv file:

titanic=pd.read_csv("titanic_train.csv")

The first thing to do is create a nice little heat map to see where the NULLS are:

sns.heatmap(titanic.isnull(),yticklabels=False,cbar=False,cmap='cubehelix')

The 'cmap' value in the above command will determine the color pallet used in your heat map.  Feel free to search for other options if you don't like the white on black.

We see in the above plot that there are several NULL values (in white).  The first one's we'll tackle will be in the Age column.

There are lots of ways to impute these values.  I've decided to find the average age of each of the 3 possible Cabin values, and apply this average to each of the missing values determined by which Cabin the missing passenger was traveling in.  The function is a bit sloppy, as was my description, but here's what I concocted:

def impute_age(cols):
  Age = cols[0]
  Pclass = cols[1]
  ageAv = titanic.groupby('Pclass', as_index=False)['Age'].mean()
  if pd.isnull(Age):
    if Pclass == 1:
      return ageAv.loc[0][1]
    elif Pclass == 2:
      return ageAv.loc[1][1]
    else:
      return ageAv.loc[2][1]
    else:
      return Age
Amendum - There was an error in my the above code in my original post.  It has now been corrected.  Now I must commit ritual Seppuku to clear my family name.

And to apply it to the Age values in your data use this:

titanic['Age'] = titanic[['Age','Pclass']].apply(impute_age,axis=1)

Now your heatmap should look like this:

The next column to tackle is the Cabin value.  Since there are tonnes of NULL values in this, and since we don't really need it anyway, let's just drop the whole thing:

titanic.drop('Cabin',axis=1,inplace=True)

Your plot should now look like this:

That little one remaining guy we'll just scrap too:

titanic.dropna(inplace=True)

And voila:

No more NULL values!  And you've still got a solid set of data to use for more exciting things to come.

Leave a Reply

Your email address will not be published. Required fields are marked *