Hypothesis testing- All Data scientists should know about.

Types of Hypothesis testing along with code samples.

Source: https://online.stat.psu.edu/stat100/book/export/html/698

In this excerpt, I am going to talk about hypothesis testing that every Data Scientist should know. I will be going through the most used hypothesis testing techniques along with python code. I hope it helps…….

Table of Contents:

  1. Hypothesis testing and its details
  2. Level of significance and P-value
  3. One-tailed and Two-tailed test
  4. Permutation testing
  5. Kolmogorov–Smirnov test
  6. T-test
  7. Z-test
  8. ANOVA-test
  9. Chi-square-test
  10. Conclusion
  11. Reference

Hypothesis testing is an act in statistics whereby an analyst tests an assumption regarding a population parameter. The methodology employed by the analyst depends on the nature of the data used and the reason for the analysis. Hypothesis testing is used to assess the plausibility of a hypothesis by using sample data. Such data may come from a larger population or a data-generating process.

For example, Company C1 produces a medicine that reduces fever in 4 hours, whereas company C2 produces a medicine that reduces fever faster than C1. Now we want to determine if the claims raised are true or not. This is the hypothesis we are making. Continue reading if you want to see how testing is done!

Before diving into the definition, let's observe the tabular image above. Let there be 100 patients, and we randomly select 50 patients and provide medicine by C1 and C2 to another set of 50 patients. We find the meantime taken by C1 is 4 hours tested on 50 patients. C2 takes 2 hours lesser than C1 on another set of 50 patients as well.

Null hypothesis(Ho): A null hypothesis is a type of hypothesis used in statistics that proposes no difference between certain characteristics of a population (or data-generating process). For example, we state that Company C1 and C2 take the same time to reduce fever.

Alternative hypothesis(H1): It is just the opposite of the null hypothesis. The alternate hypothesis is usually what you will be testing in hypothesis testing. It’s a statement that you or another researcher thinks is true and can ultimately lead you to reject the null hypothesis and replace it with the alternate hypothesis. For example, in our situation, We state that Company C1 and C2 do not take the same time to reduce fever.

Level of significance refers to the degree of significance by which we accept or reject the null hypothesis. As we know, nothing is 100% correct. Hence, we select a significance level, say 10% or 15%, or anything. These values signify how much important your null or alternate hypothesis is. So we select an important value called P-value.

P-value: In statistics, the p-value is the probability of obtaining results at least as extreme as the observed results of a statistical hypothesis test, assuming that the null hypothesis is correct. The p-value is used as an alternative to rejection points to provide the smallest level of significance at which the null hypothesis would be rejected. A smaller p-value means that there is stronger evidence in favor of the alternative hypothesis.

Let the test statistic(X): (4–2) = 2 hours for the above problem. We want to find the probability that X is greater than or equal to 2 given null hypothesis is true. Let the level of significance or P-value is 5%. Also, let P(X≥ 2hr |Ho) is 1%. Therefore, we reject the null hypothesis and accept the alternative hypothesis. I hope you got basic intuition. It will be more clear if we see more tests below.


One-Tailed test: A one-tailed test is a statistical test in which the critical area of a distribution is one-sided so that it is either greater than or less than a certain value, but not both. If the tested sample falls into the one-sided critical area, the alternative hypothesis will be accepted instead of the null hypothesis.

Two-Tailed test: In statistics, a two-tailed test is a method in which the critical area of a distribution is two-sided and tests whether a sample is greater than or less than a certain range of values. It is used in null-hypothesis testing and testing for statistical significance. If the sample being tested falls into either of the critical areas, the alternative hypothesis is accepted instead of the null hypothesis.

Permutation and resampling technique: A permutation test (also called a randomization test, re-randomization test, or an exact test) is a type of statistical significance test in which the distribution of the test statistic under the null hypothesis is obtained by calculating all possible values of the test statistic under all possible rearrangements of the observed data points.

Let us think of two classes of 50 students each. We are interested in finding the average height of both the class and any difference in heights between both the class.

This example is totally based on approximation. The average mean for 1st class is

The average mean for 2nd class is

Hence, the test statistic is

Step 1: let us combine all the points in a single set.

Step 2: Randomly select 50 points from the set and divide them into set 1 and set 2.

step 3: Resample the dataset, say for 1000 times, and compute delta

u2 - u1 = 3cm
u2 - u1 = 5cm
u2 - u1 = 6cm
u2 - u1 = 4cm (1000 times)

Sort the 1000 mean differences in ascending order

Step 4: Now Ho = There is no difference in heights, and H1: There is a difference in height between the two classes, and P-value is 5%.

step 5: Generally, there will be one hypothesis, but let us assume the two case

Case1: Let say, among 1000 points, 200 points are above 5cm. The probability of the difference greater than 5 cm given null hypothesis is 200/1000, 20% > 5%. This indicates that our assumption is correct. We accept the null hypothesis.

Case2: Let say 30 points among 1000 is greater than 5cm. The probability of the difference greater than 5 cm given null hypothesis is 30/1000, 3% < 5%. Hence, we fail to accept the null hypothesis.

Now let's see a real-life case study. Assume, A survey shows that the average black Friday sales of males are much higher(500$) when compared to that of females. A company that is planning for its black Friday sales wants to know if this is true and hence wanted to take data from samples of different sizes such as 100,500,1000 from the population and note their black Friday spending details. The company wants to know if there is really any difference in spending or it is just by chance(with a significance level of 15%). Can you help the company come to a conclusion on this with the help of data provided about different samples?

Stating Null Hypothesis and Alternative Hypothesis:

  • Null Hypothesis H_0:The average spending of males and females is the same i.e., mu_m= mu_f
  • Alternative Hypothesis H_a: The average spending of males is greater than that of females, i.e., mu_m > mu_f

Choosing significance level:

  • As it was not mentioned in the problem, we are taking the standard significance level alpha=0.15

Setting up Test Statistic:

a. It is the evidence that we look for to prove our null hypothesis
b. The most natural choice for a test statistic of the difference in population means is the difference in the sample mean mu_m-mu_f.

df = pd.read_csv('train.csv')
data_female = np.array(df[df['Gender']=='F']['Purchase'].values)
data_male = np.array(df[df['Gender']=='M']['Purchase'].values)
sample_sizes = [100, 500, 1000]
alpha = 0.15
fig, axs = plt.subplots(1, 3,figsize=(15, 5))
for j, i in enumerate(sample_sizes):
print("For Sample Size: ", 2*i)
female_sample=data_female[random.sample(range(0, data_female.shape[0]), i)]
male_sample=data_male[random.sample(range(0, data_male.shape[0]), i)]
diff_in_mean = diff_in_samples(male_sample,female_sample, "male", "female")
#Step 1- Combine both samples of size 50 each to a large sample of size 100 to simulate null hypothesisdifferences = calculate_p_value(male_sample, female_sample,diff_in_mean, alpha)plt_cdfplot_withthreshold(j, colrs[j],differences,threshold=diff_in_mean,sample=i)

In this code snippet, we are reading the dataset. Randomly sampling data for males and females according to given sizes.

def calculate_p_value(sample1, sample2, diff, alpha):#Step 2- Create list to store the average values of both the samples and the difference of themdifference=[]#Sampling the data for 1000 timestotal_sample = list(sample1)total_sample.extend(sample2)total_sample = np.array(total_sample)for i in range(0,1000):#Picking 100 random numbers
samples = random.sample(range(0, len(total_sample)), 100)
#First 50 random numbers are taken as set 1
set1 = total_sample[samples[:50]].mean()
#Next 50 random numbers are taken as set 2
set2 = total_sample[samples[50:]].mean()
#Taking the differnce between the two sets
difference.append(set1 - set2)
#Step3- Sorting the values and counting the number of values greater than the thresholddifference.sort()
count = sum(((i > diff) and (i>0)) for i in difference)
pValue = count/len(difference)
print("Percentage of values greater than the difference",diff," =",pValue*100,"%")print("The pValue = ",pValue, "and the significance P(Reject H0 when H0 is true)=",alpha)if pValue>alpha:
print("We fail to reject the null hypothesis")
print("We can reject the null hypothesis")
return difference

This is the function that finally creates P-value. It mixes up the men's and female's data points. It randomly draws 50 points for men and 50 points for women and takes the difference of there for 1000 iteration. It then counts for how many numbers observed difference is greater. Hence finally computes the p-value.

def plt_cdfplot_withthreshold(j,c,difference,threshold,sample):
sns.kdeplot(difference,cumulative=True,color=c, ax=axs[j])
axs[j].axvline(threshold, linestyle="--", color='r', label=int(threshold))
axs[j].set_title("CDF of differences for " + str(sample)+" samples")
def diff_in_samples(dist1, dist2, gender1, gender2):
print("The average spendings "+str(len(dist1))+" "+gender1+" =",dist1.mean())
print("The average spendings "+str(len(dist2))+" "+gender2+"=",dist2.mean())
diff_in_mean = dist1.mean()-dist2.mean()
print("The difference between mean of "+gender1+" spending and "+gender2+" spendings (diff_"+str(len(dist2))+")=",diff_in_mean)
return diff_in_mean

The 1st function creates histograms, 2nd function finds actually observed mean.

The Final result of the above problem is

For Sample Size: 200 The average spendings 100 male = 10656.42 The average spendings 100 female= 9134.68 The difference between mean of male spending and female spendings (diff_100)= 1521.7399999999998 count 49 1000 Percentage of values greater than the difference 1521.7399999999998 = 4.9 % The pValue = 0.049 and the significance P(Reject H0 when H0 is true)= 0.15 We can reject the null hypothesis __________________________________________________

For Sample Size: 1000 The average spendings 500 male = 9309.158 The average spendings 500 female= 8644.422 The difference between mean of male spending and female spendings (diff_500)= 664.735999999999 count 254 1000 Percentage of values greater than the difference 664.735999999999 = 25.4 % The pValue = 0.254 and the significance P(Reject H0 when H0 is true)= 0.15 We fail to reject the null hypothesis __________________________________________________

For Sample Size: 2000 The average spendings 1000 male = 9637.522 The average spendings 1000 female= 8711.2 The difference between the mean of male spending and female spendings (diff_1000)= 926.3220000000001 count 186 1000 Percentage of values greater than the difference 926.3220000000001 = 18.6 % The pValue = 0.186 and the significance P(Reject H0 when H0 is true)= 0.15 We fail to reject the null hypothesis

The cumulative distribution function for different sizes.

The Kolmogorov-Smirnov test (KS-test) tries to determine if two datasets differ significantly. The KS-test has the advantage of making no assumption about the distribution of data.

KS Test can be performed for two types of problems.

  1. There’s the one-sample Kolmogorov-Smirnov test for testing if a variable follows a given distribution in a population. This “given distribution” is usually -not always- the normal distribution, hence the “Kolmogorov-Smirnov normality test.”
  2. There’s also the independent samples Kolmogorov-Smirnov test for testing if a variable has identical distributions in 2 populations.

1. State the Null hypothesis that both the random variables come from the same distribution

2. State the Alternative hypothesis that both the random variables do not come from the same distribution

3. Setup a confidence interval value

4. Calculate the D value using the following formula

5. The null hypothesis is rejected at level alpha if

n,m =number of points in samples.

We were given to solve:

A company wants to know if black Friday spendings of males follow a normal distribution. A sample of 30 males was asked about their spending, and their answers were recorded. Determine if this sample comes from a normal distribution with a 5% significance level.

Step 1: Stating null hypothesis and Alternative hypothesis.

Null Hypothesis H0:The black Friday spendings of males follow a normal distribution.
Alternative Hypothesis Ha: The black Friday spendings of males do not follow a normal distribution.

Let us see the distribution of data visually:

# Taking one sample of size 500 from unknown disbsamples = np.array(random.sample(range(0, data_male.shape[0]), 30))
samples = (samples-samples.mean())/samples.std()
# Taking a sample of size 1000 from known disbnorm_samples=np.random.normal(loc=0.0, scale=1.0, size=30)
sorted_data = np.sort(samples)
norm_sorted_data = np.sort(norm_samples)
plt.plot(sorted_data,yvals,label='Sample dist')
plt.plot(norm_sorted_data,norm_yvals,label='Normal disb')
distribution of data for a sample and Normal distribution

Step 2:Calculating the test statistic

  • Here the test statistic we are dealing with is D_N,_M
  • D_N,_M is the maximum distance between CDF of two distributions
from scipy.stats import ks_2samp
print('The D value when calculated using scipy.stats api is',d)
print('Corresponding P value for the D is ',p)

The D value when calculated using scipy.stats API is 0.13333333333333333. The corresponding P-value for the D is 0.9578462903438838.

We can clearly observe that P-value 0.95> 0.05, and hence we fail to reject the null hypothesis H0 that both the first sample comes from a normal distribution.

A t-test is a type of inferential statistic used to determine if there is a significant difference between the means of two groups, which may be related to certain features. It is mostly used when the data sets, like the data set recorded as the outcome from flipping a coin 100 times, would follow a normal distribution and may have unknown variances. A t-test is used as a hypothesis testing tool, which allows testing of an assumption applicable to a population.

T-test has 2 types: 1. one sampled t-test 2. two-sampled t-test.

One sample t-test: The One-Sample t-Test determines whether the sample mean is statistically different from a known or hypothesized population means. The One-Sample t-test is a parametric test.

Example:- you have 10 ages, and you are checking whether avg age is 30 or not. (check code below for that using python)

from scipy.stats import ttest_1samp
import numpy as np
ages = [32,34,29,22,39,38,37,38,36,30,27,22,22]
ages_mean = np.mean(ages)
tset, pval = ttest_1samp(ages, 30)
if pval < 0.05: # alpha value is 0.05 or 5%
print(" we are rejecting null hypothesis")
print("we are accepting the null hypothesis")

The result is as follows:


p-values 0.50330

we are accepting the null hypothesis.

Two-sampled T-test: The Independent Samples t-Test or 2-sample t-test compares the means of two independent groups in order to determine whether there is statistical evidence that the associated population means are significantly different. The Independent Samples t-Test is a parametric test. This test is also known as Independent t-Test.

Example: is there any association between blood pressure before and blood pressure after( code is given below in python)

from scipy.stats import ttest_ind
import numpy as np
bp = pd.read_csv("blood_pressure.csv")
bpbefore = list(bp["bp_before"])
bpafter = list(bp["bp_after"])
week1_mean = np.mean(bpbefore)
week2_mean = np.mean(bpafter)
print("bp_before mean value:",week1_mean)
print("bp_after mean value:",week2_mean)
week1_std = np.std(bpbefore)
week2_std = np.std(bpafter)
print("bp_before std value:",week1_std)
print("bp_after std value:",week2_std)
ttest,pval = ttest_ind(bpbefore,bpafter)
if pval <0.05:
print("we reject null hypothesis")
print("we accept null hypothesis")

bp_before mean value: 156.45 bp_after mean value: 151.35833333333332 bp_before std value: 11.342288128944706 bp_after std value: 14.118425215141935 p-value 0.002412277478078891 we reject null hypothesis

Paired sampled t-test:- The paired sample t-test is also called the dependent sample t-test. It’s a univariate test that tests for a significant difference between 2 related variables. An example of this is if you were to collect the blood pressure for an individual before and after some treatment, condition, or time point.

H0:- means the difference between two samples is 0

H1:- mean the difference between the two samples is not 0

import pandas as pd
from scipy import stats
df = pd.read_csv("blood_pressure.csv")
ttest,pval = stats.ttest_rel(df['bp_before'], df['bp_after'])
if pval<0.05:
print("reject the null hypothesis")
print("accept null hypothesis")

0.0011297914644840823 reject the null hypothesis

A z-test is a statistical test used to determine whether two population means are different when the variances are known, and the sample size is large. The test statistic is assumed to have a normal distribution, and nuisance parameters such as standard deviation should be known in order for an accurate z-test to be performed.

We would use Z-test only if:

  • Your sample size is greater than 30. Otherwise, use a t-test.
  • Data points should be independent of each other. In other words, one data point isn’t related or doesn’t affect another data point.
  • Your data should be normally distributed. However, for large sample sizes (over 30) this doesn’t always matter.
  • Your data should be randomly selected from a population, where each item has an equal chance of being selected.
  • Sample sizes should be equal if at all possible.

One-sample Z test:

For example, again we are using z-test for blood pressure with some mean like 156 (python code is below for same)

import pandas as pd
from scipy import stats
from statsmodels.stats import weightstats as stests
df = pd.read_csv("blood_pressure.csv")
ztest ,pval = stests.ztest(df['bp_before'], x2=None, value=156)
if pval<0.05:
print("reject the null hypothesis")
print("accept the null hypothesis")

0.6651614730255063 accept the null hypothesis

Two-sample Z test- In a two-sample z-test, similar to a t-test here we are checking two independent data groups and deciding whether the sample mean of the two groups is equal or not.

H0: mean of two groups is 0

H1: mean of the two groups is not 0

For example, we are checking in blood data after blood and before blood data. (code in python below)

ztest ,pval1 = stests.ztest(df['bp_before'], x2=df['bp_after'], value=0,alternative='two-sided')
if pval<0.05:
print("reject the null hypothesis")
print("accept the null hypothesis")

0.002162306611369422 accept the null hypothesis.

ANOVA (F-TEST):- The t-test works well when dealing with two groups, but sometimes we want to compare more than two groups at the same time. For example, if we wanted to test whether voter age differs based on some categorical variable like race, we have to compare the means of each level or group the variable. We could carry out a separate t-test for each pair of groups, but when you conduct many tests you increase the chances of false positives. The analysis of variance or ANOVA is a statistical inference test that lets you compare multiple groups at the same time.

F = Between-group variability / Within-group variability

Source: https://towardsdatascience.com/hypothesis-testing-in-machine-learning-using-python-a0dc89e169ce

Unlike the z and t-distributions, the F-distribution does not have any negative values because between and within-group variability are always positive due to squaring each deviation.

One Way F-test(Anova):- It tells whether two or more groups are similar or not based on their mean similarity and f-score.

Example: there are 3 different categories of plant and their weight and need to check whether all 3 groups are similar or not (code in python below)

df_anova = pd.read_csv('PlantGrowth.csv')
df_anova = df_anova[['weight','group']]
grps = pd.unique(df_anova.group.values)
d_data = {grp:df_anova['weight'][df_anova.group == grp] for grp in grps}
F, p = stats.f_oneway(d_data['ctrl'], d_data['trt1'], d_data['trt2'])
print("the p-value for significance is: ", p)
if p<0.05:
print("reject the null hypothesis")
print("accept the null hypothesis")

the p-value for significance is: 0.0159099583256229 reject the null hypothesis

Two Way F-test:- Two-way F-test is an extension of the 1-way f-test, it is used when we have 2 independent variables and 2+ groups. 2-way F-test does not tell which variable is dominant. if we need to check individual significance then Post-hoc testing needs to be performed.

Now let’s take a look at the Grand to mean crop yield (the mean crop yield not by any sub-group), as well the mean crop yield by each factor, as well as by the factors grouped together.

result of two way F-test

Chi-Square Test- The test is applied when you have two categorical variables from a single population. It is used to determine whether there is a significant association between the two variables.

For example, in an election survey, voters might be classified by gender (male or female) and voting preference (Democrat, Republican, or Independent). We could use a chi-square test for independence to determine whether gender is related to voting preference

check the example in python below

df_chi = pd.read_csv('chi-test.csv')
contingency_table=pd.crosstab(df_chi["Gender"],df_chi["Like Shopping?"])
print('contingency_table :-\n',contingency_table)
result of the above code
#Observed Values
Observed_Values = contingency_table.values
print("Observed Values :-\n",Observed_Values)

Observed Values :- [[2 3] [2 2]]

Expected_Values = b[3]
print("Expected Values :-\n",Expected_Values)

Expected Values :- [[2.22222222 2.77777778] [1.77777778 2.22222222]]

Might be wondering how did these values come. watch this.

print("Degree of Freedom:-",df11)
alpha = 0.05

Degree of Freedom:- 1

from scipy.stats import chi2
chi_square=sum([(o-e)**2./e for o,e in zip(Observed_Values,Expected_Values)])
print("chi-square statistic:-",chi_square_statistic)

chi-square statistic:- 0.09000000000000008


critical_value: 3.841458820694124

print('Significance level: ',alpha)
print('Degree of Freedom: ',df11)
print('chi-square statistic:',chi_square_statistic)

Significance level: 0.05 Degree of Freedom: 1 chi-square statistic: 0.09000000000000008 critical_value: 3.841458820694124 p-value: 0.7641771556220945

if chi_square_statistic>=critical_value:
print("Reject H0,There is a relationship between 2 categorical variables")
print("Retain H0,There is no relationship between 2 categorical variables")
if p_value<=alpha:
print("Reject H0,There is a relationship between 2 categorical variables")
print("Retain H0,There is no relationship between 2 categorical variables")

Retain H0, There is no relationship between 2 categorical variables Retain H0, There is no relationship between 2 categorical variables.

Overall, we saw a lot of techniques that can be implemented to test hypotheses. For example, permutation techniques can be implemented when we do not know how to compute the distribution of a test statistic. KS tests are employed to check if the distributions are similar or not. We could also perform a Q-Q plot. However, each test has its own specific criteria. Thank you for showing patience to read this long chapter.

Please feel free to comment if there are any suggestions or queries. You can also knock me on Linkedin. Here is a link to my GitHub. Happy reading……….



Machine learning enthusiast

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store