Machine Learning 102 – Supervised Stylingz

Learning! Apparently computers can do it now. Kind of.

Well, not really.

A Cute Ape

In fact a 4 year old can still learn much more efficiently and with a fraction of the the input. Even with an orangutan for a teacher.

But let’s not get political my pretties. Shame on you.

There are currently 3 ways to educate a computer. The first one is called Supervised Learning, and that’s what we’ll be talking about today.

Supervised Learning involves taking a set of data, splitting it up (randomly) into a test set and a teaching set. The teaching set is then fed into an algorithm, which attempts to extrapolate enough data to be able to predict what’s in the testing set. If it comes out with a decent divination then the algorithm is solid.

Very Simplified

This is why it’s called supervised; because we’re essentially showing the algorithm a ‘correct’ outcome, before then feeding it unknown stuff.

For now let's just look at how to set this scenario up.  The sklearn library makes all of this super easy to do. For example let’s just look at splitting up the data (step 1).

American House

Assume that you’ve imported some US housing data and you want to train your algorithm to predict the price of a house based on several other variables. First split the dataset into two other sets, one with your prediction variable of choice (y) and the other with everything else (X)

 

X = USAhousing[['Avg. Area Income', 'Avg. Area House Age', 'Avg. Area Number of Rooms', 'Avg. Area Number of Bedrooms', 'Area Population']]
y = USAhousing['Price']

Now make sure that you've imported this curious little library :

from sklearn.model_selection import train_test_split

The train_test_split function is specifically designed to chop up your data, randomly, into 4 other datasets, both an X and y for training, and an X and y for testing.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

The test_size parameter in the above line indicates how much of your data you want to put in the test set.  I usually use around 30% (0.3) but this can be adjusted accordingly.

Now fit the data:

lm = linear_model.LinearRegression()
model = lm.fit(X_train, y_train)
predictions = lm.predict(X_test)

You now have this nifty little array called predictions that contains a whole bunch of information about how well your algorithm did in testing the data.

What you do with this data is for another blog...

But in the meantime I know you're smart my pretties, so see what you can discover.

 

Leave a Reply

Your email address will not be published. Required fields are marked *