This is an important concept when performing any kind of predictive analysis. All it means is that it’s imperative that the variable you are attempting to predict has decent balance between binary values.

So if you’re attempting to predict, let’s say, cancer, your data must have a fair balance between positive cancer results and negative results. If your data has 10 positive results and a million negatives, you will probably not be able to form a useful algorithm.

Luckily, I found this little function that will go through your data and give you the balance in your data.

def print_dx_perc(data_frame, col): dx_vals = data_frame[col].value_counts() dx_vals = dx_vals.reset_index() f = lambda x, y: 100 * (x / sum(y)) for i in range(0, len(dx)): print('{0} accounts for {1:.2f}% of the diagnosis class'.format(dx[i], f(dx_vals[col].iloc[i], dx_vals[col]))) print_dx_perc(breast_cancer, 'diagnosis')