ANOVA: Not Just for Stat Geeks! How This One Test Can Help you make informed decisions
ANOVA stands or analysis of variance. Its statistical method used to check if the means of two or more groups are significantly different from each other.
Code: kshirsagarsiddharth/Anova-Demo-Medium (github.com)
Let’s say you want to buy watermelons and you have 3 options farm1, farm2 or farm3, and you have to decide if there is any difference between weights of melons from these farms. Being a data geek, you have recorded the weights of each melon every time you purchased from a farm. You can use ANOVA to determine if there is a significant difference in weights of these melons.
For example, Mean weight from farm1 was 1.2Kg farm2 was 1.4Kg and farm3 was 0.9Kg you can use ANOVA to determine if there is a significant difference in weights. We can define hypothesis for this problem.
Null Hypothesis: There is no significant difference in mean weight of the fruit.
Alternative Hypothesis: There is a significant difference in mean weights.
Let’s look at it practically, I was looking to invest into some Mutual funds and after doing some research I found 3 potential funds I can invest into, but I had enough disposable cash to invest into only 1 fund. And to make a decision I used ANOVA. My question was “If the mean return of all 3 funds is same, I can invest into any of the fund.”
Let’s define a null and alternative hypothesis.
Null: There is no significant difference in mean return of 3 funds.
Alternative: There is a significant difference in mean return of all 3 funds.
Let’s look at the data. I did some web scraping and extracted data for the three funds. As observed in the plot. (NAV — Adjusted Net Asset Value)
I have named the table as master_df
Let's plot the returns over time.
import plotly.express as px
fig = px.line(df_master, x="Date", y="NAV", color='AMC', template='none')
fig.show()
Now let's use statsmodels API to get p-value for this data.
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols
# Left should be the numerical vaue and right should be categories
model = ols('NAV ~ AMC', data=df_master).fit()
anova = sm.stats.anova_lm(model)
print(anova)
model = ols('NAV ~ AMC', data=df_master).fit()
: This creates an OLS model object by fitting a linear regression of NAV on AMC using the data from df_master. The fit method returns a Regression Results object that contains various information about the model fit.anova = sm.stats.anova_lm(model)
: This performs an analysis of variance (ANOVA) on the OLS model object using the anova_lm function from statsmodels. It returns a DataFrame that contains the ANOVA table with sum of squares, degrees of freedom, F-statistic and p-value for each factor in the model.
Let’s look at the result.
Over here P value is less than 0.05 hence we reject null hypothesis and there is a significant difference in mean returns of the three funds.
So, in conclusion ANOVA can be used to make informed decision if we want to compare means of 3 or more different groups.