This is an important concept when performing any kind of predictive analysis. All it means is that it’s imperative that the variable you are attempting to predict has decent balance between binary values.

So if you’re attempting to predict, let’s say, cancer, your data must have a fair balance between positive cancer results and negative results. If your data has 10 positive results and a million negatives, you will probably not be able to form a useful algorithm.

Luckily, I found this little function that will go through your data and give you the balance in your data.

def print_dx_perc(data_frame, col): dx_vals = data_frame[col].value_counts() dx_vals = dx_vals.reset_index() f = lambda x, y: 100 * (x / sum(y)) for i in range(0, len(dx)): print('{0} accounts for {1:.2f}% of the diagnosis class'.format(dx[i], f(dx_vals[col].iloc[i], dx_vals[col]))) print_dx_perc(breast_cancer, 'diagnosis')

I am reaⅼly impressed with ʏour writing abilities ɑnd also with the

layout to your blog. Is this a paiԀ topic or did you custⲟmize it yourself?

Either way stay up the nice quality writing, it is rare to seе a great webloɡ like this

one nowadays..