grocery-website-data_ab-testing
In [1]:
#https://www.kaggle.com/code/songulerdem/a-b-testing-on-grocery-website-data
Project Goal¶
In the dataset here, we will observe whether a change made in the web interface of a market increases the number of clicks.
- LoggedInFlag => 1 - When user has an account and logged in
- ServerID => One of the servers user was routed through
- VisitPageFlag => 1 - When user clicked on the loyalty program page
In [2]:
import pandas as pd
In [3]:
df= pd.read_csv('grocerywebsiteabtestdata.csv')
In [4]:
df
Out[4]:
RecordID | IP Address | LoggedInFlag | ServerID | VisitPageFlag | |
---|---|---|---|---|---|
0 | 1 | 39.13.114.2 | 1 | 2 | 0 |
1 | 2 | 13.3.25.8 | 1 | 1 | 0 |
2 | 3 | 247.8.211.8 | 1 | 1 | 0 |
3 | 4 | 124.8.220.3 | 0 | 3 | 0 |
4 | 5 | 60.10.192.7 | 0 | 2 | 0 |
... | ... | ... | ... | ... | ... |
184583 | 184584 | 114.8.104.1 | 0 | 1 | 0 |
184584 | 184585 | 207.2.110.5 | 0 | 2 | 1 |
184585 | 184586 | 170.13.31.9 | 0 | 2 | 0 |
184586 | 184587 | 195.14.92.3 | 0 | 3 | 0 |
184587 | 184588 | 172.12.115.8 | 0 | 2 | 1 |
184588 rows × 5 columns
In [5]:
df = df.groupby(["IP Address", "LoggedInFlag", "ServerID"])["VisitPageFlag"].sum()
In [6]:
df
Out[6]:
IP Address LoggedInFlag ServerID 0.0.108.2 0 1 0 0.0.109.6 1 1 0 0.0.111.8 0 3 0 0.0.160.9 1 2 0 0.0.163.1 0 2 0 .. 99.9.53.7 1 2 0 99.9.65.2 0 2 0 99.9.79.6 1 2 0 99.9.86.3 0 1 1 99.9.86.9 0 1 0 Name: VisitPageFlag, Length: 99763, dtype: int64
In [7]:
df = df.reset_index(name="VisitPageFlagSum")
df.head()
Out[7]:
IP Address | LoggedInFlag | ServerID | VisitPageFlagSum | |
---|---|---|---|---|
0 | 0.0.108.2 | 0 | 1 | 0 |
1 | 0.0.109.6 | 1 | 1 | 0 |
2 | 0.0.111.8 | 0 | 3 | 0 |
3 | 0.0.160.9 | 1 | 2 | 0 |
4 | 0.0.163.1 | 0 | 2 | 0 |
In [8]:
df["VisitPageFlag"] = df["VisitPageFlagSum"].apply(lambda x: 1 if x != 0 else 0)
df.head()
Out[8]:
IP Address | LoggedInFlag | ServerID | VisitPageFlagSum | VisitPageFlag | |
---|---|---|---|---|---|
0 | 0.0.108.2 | 0 | 1 | 0 | 0 |
1 | 0.0.109.6 | 1 | 1 | 0 | 0 |
2 | 0.0.111.8 | 0 | 3 | 0 | 0 |
3 | 0.0.160.9 | 1 | 2 | 0 | 0 |
4 | 0.0.163.1 | 0 | 2 | 0 | 0 |
In [9]:
#df['VisitPageFlag'].value_counts()
In [10]:
df['group'] = df['ServerID'].map({1:'Test', 2:'Control', 3:'Control'})
df.drop(['ServerID','VisitPageFlagSum'],axis=1, inplace=True)
In [11]:
df.head()
Out[11]:
IP Address | LoggedInFlag | VisitPageFlag | group | |
---|---|---|---|---|
0 | 0.0.108.2 | 0 | 0 | Test |
1 | 0.0.109.6 | 1 | 0 | Test |
2 | 0.0.111.8 | 0 | 0 | Control |
3 | 0.0.160.9 | 1 | 0 | Control |
4 | 0.0.163.1 | 0 | 0 | Control |
In [12]:
df.to_csv('new.csv') #download this data for AB testing
In [ ]:
In [13]:
# data Manipulation - first we check information about data if any problems we will fix it.
# import data_manipulation from AB_test
from AB_experiment import data_manipulation
#create alias to call data_manipulation
dm=data_manipulation()
data='new.csv'
column1="group"
column2=["VisitPageFlag"]
quartile1=0.25
quartile3=0.75
info = True
download_df=False
filename='new'
dm.data_info(data,column1,column2,quartile1,quartile3,info,download_df,filename)
Out[13]:
{'1': ['dataframe_shape', {'Observations': 99763, 'Column': 5}], '2': ['missing_data_info', {'No missing values'}], '3': ['outliers_info', [{'variable_name': 'VisitPageFlag', 'lower_fence': 0.0, 'upper_fence': 0.0, 'Number_of_obs_less_than_lower_fence': 0, 'Number_of_obs_greater_than_upper_fence': 9978, 'lower_array': array([], dtype=int64), 'upper_array': array([ 7, 13, 16, 29, 34, 50, 74, 77, 95, 120], dtype=int64)}]], '4': ['data_types', [{'object_values': "['IP Address', 'group']"}, {'float_values': '[]'}, {'int_values': ['Unnamed: 0', 'LoggedInFlag', 'VisitPageFlag']}, {'bool_val': []}]], '5': ['numerical_Variables', ['Unnamed: 0', 'LoggedInFlag', 'VisitPageFlag']], '6': ['Categorical_variables', ['IP Address', 'group']], '7': [{'Unique values count for variable': LoggedInFlag 1 50250 0 49513}, {'Unique values count for variable': VisitPageFlag 0 89785 1 9978}, {'Unique values count for variable': group Control 66460 Test 33303}], '8': [['Descriptive statistics-numerical_Variables', Unnamed: 0 LoggedInFlag VisitPageFlag count 99763.00000 99763.000000 99763.000000 mean 49881.00000 0.503694 0.100017 std 28799.24179 0.499989 0.300024 min 0.00000 0.000000 0.000000 25% 24940.50000 0.000000 0.000000 50% 49881.00000 1.000000 0.000000 75% 74821.50000 1.000000 0.000000 max 99762.00000 1.000000 1.000000, '********************'], ['Descriptive statistics-Categorical_variables', IP Address group count 99763 99763 unique 99516 2 top 146.14.105.1 Control freq 2 66460, '********************']], '9': ['category_stats', [ VisitPageFlag count median mean std min max group Control 66460 0.0 0.092251 0.289382 0 1 Test 33303 0.0 0.115515 0.319647 0 1]], '10': ['Dataframe', Unnamed: 0 IP Address LoggedInFlag VisitPageFlag group 0 0 0.0.108.2 0 0 Test 1 1 0.0.109.6 1 0 Test 2 2 0.0.111.8 0 0 Control 3 3 0.0.160.9 1 0 Control 4 4 0.0.163.1 0 0 Control]}
In [ ]:
In [14]:
# Above output datatypes of variables LoggedInFlag and VisitPageFlag define wrong
#hence we change datatypes of variables using change_data_type function
data='new.csv'
variables=['LoggedInFlag','VisitPageFlag']
dtype=['bool','bool']
drop_variables=['Unnamed: 0']
download_df=True
filename='new'
dm.change_variables(data,variables,dtype,drop_variables,download_df,filename)
Out[14]:
{'Variable1': ['LoggedInFlag', dtype('bool')]}
In [ ]:
In [15]:
# After changing data types we chacking agian data_info
data='new.csv'
column1="group"
column2=["VisitPageFlag"]
quartile1=0.25
quartile3=0.75
info = True
download_df=False
filename='new'
dm.data_info(data,column1,column2,quartile1,quartile3,info,download_df,filename)
Out[15]:
{'1': ['dataframe_shape', {'Observations': 99763, 'Column': 4}], '2': ['missing_data_info', {'No missing values'}], '3': ['outliers_info', []], '4': ['data_types', [{'object_values': "['IP Address', 'group']"}, {'float_values': '[]'}, {'int_values': []}, {'bool_val': ['LoggedInFlag', 'VisitPageFlag']}]], '5': ['numerical_Variables', []], '6': ['Categorical_variables', ['IP Address', 'LoggedInFlag', 'VisitPageFlag', 'group']], '7': [{'Unique values count for variable': LoggedInFlag True 50250 False 49513}, {'Unique values count for variable': VisitPageFlag False 89785 True 9978}, {'Unique values count for variable': group Control 66460 Test 33303}], '8': [['Descriptive statistics-numerical_Variables', IP Address LoggedInFlag VisitPageFlag group count 99763 99763 99763 99763 unique 99516 2 2 2 top 146.14.105.1 True False Control freq 2 50250 89785 66460, '********************'], ['Descriptive statistics-Categorical_variables', IP Address LoggedInFlag VisitPageFlag group count 99763 99763 99763 99763 unique 99516 2 2 2 top 146.14.105.1 True False Control freq 2 50250 89785 66460, '********************']], '9': ['category_stats', []], '10': ['Dataframe', IP Address LoggedInFlag VisitPageFlag group 0 0.0.108.2 False False Test 1 0.0.109.6 True False Test 2 0.0.111.8 False False Control 3 0.0.160.9 True False Control 4 0.0.163.1 False False Control]}
In [ ]:
In [16]:
# From above output info we can say that in our data there is no outliers , no missing values present
# and datatypes of all variables correct
#Now we findout sample size
In [17]:
#fist we findout baseline conversion rate
# import stats_test from AB_test
from AB_experiment import stats_test
#create alias to call stats_test
st=stats_test()
data='new.csv'
column1="group"
column1_value='Control'
column2='VisitPageFlag'
st.baseline_conversion_rate(data,column1,column1_value,column2)
Out[17]:
{'Baseline conversion rate(p1) of group Control': 0.0923}
In [18]:
#Sample size using baseline conversion rate.
p1= 0.0923
mde=0.007
alpha=0.05
power=0.8
n_side=2
st.sample_size(p1,mde,alpha,power, n_side)
Out[18]:
{'Sample size': 27111}
In [ ]:
In [19]:
# Now we check assumptions for all combinations to perform statistical tests for AB testing
# import stats_test from AB_test
from AB_experiment import stats_test
#create alias to call stats_test
st=stats_test()
data='new.csv'
sample_size=27111
column1="group"
column1_value1='Control'
column1_value2='Test'
column2="VisitPageFlag"
alpha=0.05
paired_data=False
st.AB_Test_assumption(data, sample_size, column1, column1_value1, column1_value2, column2, alpha, paired_data)
Out[19]:
({'Target variable is boolean data type': 'Use Chi-Squared Test'}, {'Note': 'If our data involve time-to-event or survival analysis (e.g., time until a user completes a task), we can use methods such as the log-rank test'})
In [ ]:
By checking assumptions we use Chi-Squared Test for AB Testing¶
Define the null and alternative hypotheses :
- Null hypothesis (H0): There is no significant difference in the proportions of the two variables.
- Alternative hypothesis(Ha): There is significant difference in the proportions of the two variables.
In [20]:
# import stats_test from AB_test
from AB_experiment import stats_test
#create alias to call stats_test
st=stats_test()
# perform chi-square test
data='new.csv'
sample_size=27111
column1='group'
column1_value1='Control'
column1_value2='Test'
column2='VisitPageFlag'
alpha=0.05
reverse_experiment=False
st.chi_squared_test(data, sample_size, column1, column1_value1, column1_value2, column2, alpha, reverse_experiment)
Out[20]:
{'Test name': 'Chi-square test', 'Control group': 'Control', 'Treatment group': 'Test', 'Timestamp': '2023-08-29 19:34:11', 'Sample size': 27111, 'Status': 'We can reject H0 => group Test is more successful', 'P-value': 0.0, 'alpha': 0.05, 'Test Statistic': 81.13291660783852, 'Proportion of group Control': 0.0917, 'Proportion of group Test': 0.1153, 'Confidence interval of group Control': (0.0883, 0.09517), 'Confidence interval of group Test': (-0.02873, -0.01848), 'Confidence interval of difference in groups': (0.11154, 0.11914)}
In [ ]:
Conclusion :¶
- When we look directly at the click rates of two groups is 'Control:0.0917, Test:0.1153', we see that there is a difference between the two groups. It seems that the new feature applied to the test group is getting more clicks.
- Also by AB testing we proved that there is significant difference in the proportions of the two variables and rate of clicking on the link was 9.17% in the Control group, this rate increased to 11.53% in the Test group.
In [ ]: