I invented the term Python Data Ecosystem.  I also once made love to Barbara Bush. Believe what you will.

 

A Bunch of Stuff.

I’ve had a problem with the Python data ecosystem.  Mainly because I’ve ignored the basics and winged it as work came in

When I realized that I didn’t really understand what a Numpy array was vs a Pandas dataframe, I knew it was time to move backwards.

If you look at the image to the left you will notice how everything i stacked.  You have Python at the bottom, and a bunch of other things built on top.  That's an important concept to recognize.

This is not meant to be an exhaustive outline for use in Data Science (I'm not qualified) but simply a cheat sheet for the main bits.  Where applicable I'll include links to more proficient resources.

Native Python Stuff

It's built in dummy.  Python comes replete with the following data types.  It's where you should start.

Lists

 = [1,2,3,4]

In my universe this was called an array.  In Python it’s called a list. As far as I can tell they act the same.

More on Lists...

Dictionary

 = {‘key’:’value’, ‘key’:’value’}

This is what Python likes to call a key-value pair.  Comes in very handy in Data Science.

More on Dictionaries...

Tuple

 = (1,2,3)

There are 2 main differences between this and a List.  Firstly it’s immutable; once you create it you can’t change it.  The other has more to do with the way its used.  Usually tuples are used when the order is important.  For example maybe the first item is always the year, the second the month, etc.

More on Tuples...

Set

 = {1,2,3,4}

This is sort of a cool little feature, but I’m not sure how useful it would be.  Each item in the set is always unique.  So if you tried to add a 2 to the above example it would simply ignore you because there’s already a 2 in there.  Add a 5 and it will huck it on the end no problem.

More on Sets...

Imported libraries.

This is some code that you can import to make your life much easier.  It's been developed and tested for years.  You will definitely need these libraries.

NumPy

NumPy is built directly on top of Python.  It's designed to make many data processing tasks faster and much easier to implement..

import numpy as np
my_list = [1,2,3]
nparray = np.array(my_list)

For example, you can theoretically define a matrix of values with a Python List object.  But NumPy makes it much easier to do and also provides built in functions for performing complicated matrix math.

The other big advantage is that a NumPy array allows you to broadcast stuff to ALL of the entities in the array with one command.

= nparray([0, 1, 2, 3, 4, 5])
nparray[:]=99
= nparray([99, 99, 99, 99, 99, 99])

More on NumPy...

Pandas

Pandas is another library, this one is built on numpy.  So if you want to use it you have import both:

import pandas as pd
import numpy as np

If you're at all familiar with R, Pandas will appear very similar. Mainly because that's what is was built to be.

There are two basic data types:

pd.Series

A Series is like an array but with built in indexes.

pd.Dataframe

The Dataframe is really the bread and butter of Pandas.  In truth it's an object that is just a list of Series'.  

The first thing you should do whenever starting a project with a Pandas dataframe, here are some useful commands for getting your head around the data:

pdf.head()

Displays the first 5 rows of your data.

pdf.info()

Displays the number of columns, entries, and some clues as to what they contain.

pdf.describe()

Displays a list of statistical info about all of your data, example mean, std, etc...

pdf.columns

Displays your colomn names (I use this constantly)

More on Pandas...

 

Wanna keep learning stuff?

Of course you do.  Let's tackle the R data ecosystem...