Some Useful (and Simple) PySpark Functions

I’ve been to Spark and back.  But I did leave some of my soul.

According to Apache, Spark was developed to “write applications quickly in Java, Scala, Python, R, and SQL”

And I’m sure it’s true.  Or at least I’m sure their intentions were noble.

I’m not talking about Scala yet, or Java, those are whole other language.  I’m talking about Spark with python. Or PySpark, as the Olgivy inspired geniuses at Apache marketing call it.

The learning curve is not easy my pretties, but luckily for you, I’ve managed to sort out some of the basic ecosystem and how it all operates.  Brevity is my goal.

This doesn’t include MLib, or GraphX, or streaming, just the basics

Show pairwise frequency of categorical data

train.crosstab('matchType', 'headshotKills').show()

This exports something like this:

+-----------------------+----+----+---+---+---+---+---+---+
|matchType_headshotKills| 0| 1| 2| 3| 4| 5| 6| 8|
+-----------------------+----+----+---+---+---+---+---+---+
| duo-fpp|3762| 608|127| 31| 7| 6| 0| 1|
| solo-fpp|1955| 331| 77| 28| 6| 2| 2| 0|
| normal-duo-fpp| 19| 4| 1| 0| 0| 0| 0| 0|
| crashtpp| 1| 0| 0| 0| 0| 0| 0| 0|
| squad-fpp|6547|1032|216| 56| 14| 4| 1| 0|
| crashfpp| 35| 1| 0| 0| 0| 0| 0| 0|
| normal-squad-fpp| 50| 9| 1| 1| 4| 1| 2| 0|
| normal-solo-fpp| 4| 1| 3| 0| 1| 0| 0| 0|
| squad|2397| 345| 70| 24| 5| 2| 1| 0|
| flarefpp| 5| 0| 0| 0| 0| 0| 0| 0|
| solo| 644| 98| 13| 4| 4| 0| 0| 0|
| normal-duo| 0| 0| 0| 0| 1| 0| 0| 0|
| duo|1198| 159| 55| 9| 3| 2| 0| 0|
| flaretpp| 6| 1| 1| 0| 0| 0| 0| 0|
| normal-squad| 1| 1| 0| 0| 0| 0| 0| 0|
+-----------------------+----+----+---+---+---+---+---+---+

Returns a dataframe with all duplicate rows removed

train.select('matchType','headshotKills').dropDuplicates().show()

Drop any NA rows

train.dropna().count()

Fill NAs with a constant value

train.fillna(-1)

A very simple filter

train2 = train.filter(train.headshotKills > 1)

Get the Mean of a Category

train.groupby('matchType').agg({'kills': 'mean'}).show()

Get a count of distinct categories in a Column

train.groupby('matchType').count().show()

Get a 20% sample of a dataframe

t1 = train.sample(False, 0.2, 42)

Create a tuple set from Columns.  Note that dataframes do NOT support mapping functionality, so you have to explicitly convert it to an RDD first (it's in the .rdd call below)

train.select('matchType').rdd.map(lambda x:(x,1)).take(5)

Order by a Column

train.orderBy(train.matchType.desc()).show(5)

Add a new Column based on the calculation of another Column

train.withColumn('boosts_new', train.boosts /2.0).select('boosts','boosts_new').show(50)

Drop a Column

train.drop('boosts').columns

Using SQL

train.registerAsTable('train_table')
# sqlContext.sql('select Product_ID from train_table').show(5)

That's it for now!

 

 

 

 

 

1 thought on “Some Useful (and Simple) PySpark Functions”

Leave a Reply

Your email address will not be published. Required fields are marked *