Where to Start with a New Data Problem

Brain Death

So I get a data file, CSV, text, etc…. and my usual first step is to stare at the file in my Downloads folder for a few minutes.  Then maybe change the file name. Then go make some coffee. Then come back and read the name of the file again. Maybe change it back.

I’ll open up some IDE and make a new python file.  Save it.  Stare at that. Import some libraries… that name sucks I should change it.

CNN is on, I should probably see what's happening in the world...

My point is that it’s hard to start.  And the best way to start is just to start.  Here’s a good list of things to put in your py file to at least get a handle on what you’re dealing with and hopefully get some juices flowing.

Import Libraries

You might not need them all, but you can always remove them later.  This tactic is probably bad form, but I don't care, it helps...

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.datasets import load_boston
from sklearn import metrics

Import Your Data

Without importing your data you're bound to have a tough time figuring out what you're dealing with.  And for some reason, once I've inaugurated a Panadas dataset I feel like I'm on the way...

customers = pd.read_csv("Ecommerce Customers.csv")

Get Some Visualizations Going

I like to start with the very basics.  Just these 4 lines will give you a tonne of information about your data and where you should start probing...

print(pdf.head())
print(pdf.info())
print(pdf.describe())
print(pdf.columns)
print(pdf.shape)
print(pdf.dtypes)

Print Some Nice Plots

Everyone likes a good visualization.  It gives you a quick feeling of accomplishment and a head start toward finding gabs, dead-ends, etc...

snsData = sns.load_dataset('tips')
print(snsData.head())
print(sns.pairplot(snsData))
print(sns.distplot(snsData['some_column']))
sns.heatmap(snsData.corr(), annot=True)

Although none of these things are the answer to your underlying problem, they are a sure-fire way to get the coffee brewed, the TV turned off and your project underway.

Leave a Reply

Your email address will not be published. Required fields are marked *