Market | Example | Size Range |
---|---|---|
Underwear | bras | 32A to 54G |
Footwear | shoes | 5.5 to 13 |
Apparel | jeans | 28x30 to 44x38 |
Accessories | rings | 5 to 13 |
ss = pd.DataFrame({
'size':["XS","S","M","L","XL"],
'ratio1':[2,10,40,28,20],
'ratio2':[10,15,35,25,15],
})
ax = plt.gca()
ss.plot(kind='line', x='size', y='ratio1',ax=ax)
ss.plot(kind='line', x='size',y='ratio2',color='red',ax=ax)
plt.show()
Count | Height (ft) | Weight (lbs) | BMI | |
---|---|---|---|---|
Male | 9754 | 5'9" | 190 | 27.6 |
Female | 11423 | 5'4" | 157 | 27.2 |
df_crosstab_age = pd.crosstab(df['SRAGE_P1'], df['OVRWT'] == 1, normalize=True)
df_crosstab_age.plot(kind='bar', stacked=True)
# The overweight situation is self-reported, not computed by BMI.
<matplotlib.axes._subplots.AxesSubplot at 0x1a1be8a950>
df_crosstab_sex = pd.crosstab(df['SRSEX'], df['OVRWT'] == 1, normalize=True)
df_crosstab_sex.plot(kind='bar', stacked=True)
<matplotlib.axes._subplots.AxesSubplot at 0x1a1dd327d0>
BODY MASS INDEX | DEFINITION | FREQ | % | |
---|---|---|---|---|
1 | 0 - 18.49 | UNDERWEIGHT | 529 | 2.50 |
2 | 18.5 - 22.99 | ACCEPTABLE RISK | 4415 | 20.85 |
3 | 23.0 - 27.49 | INCREASED RISK | 7700 | 36.36 |
4 | 27.5 OR HIGHER | HIGH RISK | 8533 | 40.29 |
#Insert image data for female shirt sizing chart
import matplotlib.image as mpimg
img=mpimg.imread('Womens size chart.png')
imgplot = plt.imshow(img)
plt.show()
#Insert image data for male shirt sizing chart
import matplotlib.image as mpimg
img=mpimg.imread('Sizing-shirts.png')
imgplot = plt.imshow(img)
plt.show()
Health Survey Data conversion
Sizing Chart Data conversion
female = dfs.loc[dfs.Sex == 2]
male = dfs.loc[dfs.Sex == 1]
ax = sns.kdeplot(female.Weight, female.Height,
cmap="Reds", shade=True, shade_lowest=False, cut = 0.5)
ax = sns.kdeplot(male.Weight, male.Height,
cmap="Blues", shade=True, shade_lowest=False, cut = 0.5, cbar = True)
Size ~ Height + Weight
Steps
#Step1: Remove any "NA" values from the dataset
dfsf_naout = dfsf.loc[dfsf['Size']!='NA']
dfsm_naout = dfsm.loc[dfsm['Size']!='NA']
#Step2: Split out Data into Train and Test
df_train, df_test = train_test_split(dfsf_naout, test_size=0.2, random_state=2020,stratify=dfsf_naout['SizeCode'])
dm_train, dm_test = train_test_split(dfsm_naout, test_size=0.2, random_state=2020,stratify=dfsm_naout['SizeCode'])
#Step3: Define predictors and outcome variables
#For women train set
yf = df_train['SizeCode']
Xf = df_train[['Height','Weight']]
#For men train set
ym = dm_train['SizeCode']
Xm = dm_train[['Height','Weight']]
#Step4: Specify a set of variety of number of trees in forest, to determine
#how many to use based on leveling-off of OOB error:
n_trees = [50, 100, 250, 500, 1000, 1500, 2500]
# For Women
rf_f_dict = dict.fromkeys(n_trees)
# For Men
rf_m_dict = dict.fromkeys(n_trees)
#Step5: Create Random Forest in the specified number of trees
#For Women
for num in n_trees:
print(num)
rf_f = RandomForestClassifier(n_estimators=num,
oob_score=True,
max_leaf_nodes=8,
random_state=2020,
class_weight='balanced',
verbose=1)
rf_f.fit(Xf, yf)
rf_f_dict[num] = rf_f
#For Men
for num in n_trees:
print(num)
rf_m = RandomForestClassifier(n_estimators=num,
oob_score=True,
max_leaf_nodes=4,
random_state=2020,
class_weight='balanced',
verbose=1)
rf_m.fit(Xm, ym)
rf_m_dict[num] = rf_m
50
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers. [Parallel(n_jobs=1)]: Done 50 out of 50 | elapsed: 0.1s finished
100
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers. [Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed: 0.3s finished
250
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers. [Parallel(n_jobs=1)]: Done 250 out of 250 | elapsed: 0.7s finished
500
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers. [Parallel(n_jobs=1)]: Done 500 out of 500 | elapsed: 2.3s finished
1000
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers. [Parallel(n_jobs=1)]: Done 1000 out of 1000 | elapsed: 4.0s finished
1500
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers. [Parallel(n_jobs=1)]: Done 1500 out of 1500 | elapsed: 3.8s finished
2500
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers. [Parallel(n_jobs=1)]: Done 2500 out of 2500 | elapsed: 8.1s finished
50 100
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers. [Parallel(n_jobs=1)]: Done 50 out of 50 | elapsed: 0.1s finished [Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers. [Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed: 0.2s finished [Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
250
[Parallel(n_jobs=1)]: Done 250 out of 250 | elapsed: 0.5s finished
500
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers. [Parallel(n_jobs=1)]: Done 500 out of 500 | elapsed: 1.4s finished
1000
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers. [Parallel(n_jobs=1)]: Done 1000 out of 1000 | elapsed: 2.1s finished
1500
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers. [Parallel(n_jobs=1)]: Done 1500 out of 1500 | elapsed: 5.1s finished
2500
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers. [Parallel(n_jobs=1)]: Done 2500 out of 2500 | elapsed: 6.8s finished
#Step6: Graph the number of tree OOB error rate:
# For Women
oob_error_list_f = [None] * len(n_trees)
# Find OOB error for each forest size:
for i in range(len(n_trees)):
oob_error_list_f[i] = 1 - rf_f_dict[n_trees[i]].oob_score_
else:
# Visulaize result:
plt.plot(n_trees, oob_error_list_f, 'bo',
n_trees, oob_error_list_f, 'k')
# It is a weird curve, but it looks like 100 trees has the best performance
forest_f = rf_f_dict[100]
#For Men
oob_error_list = [None] * len(n_trees)
# Find OOB error for each forest size:
for i in range(len(n_trees)):
oob_error_list[i] = 1 - rf_m_dict[n_trees[i]].oob_score_
else:
# Visulaize result:
plt.plot(n_trees, oob_error_list, 'bo',
n_trees, oob_error_list, 'k')
# A very different tree size impact on the predicting performance.
# The smallest number of trees and biggest are about the same performance.
# I'd go with smaller number of tree as model choice.
forest_m = rf_m_dict[50]
forest_m2 = rf_m_dict[2500]
#Step7.1: Evaluate Model - set up test data
#For women
yf_test = pd.DataFrame(df_test['SizeCode'])
Xf_test = pd.DataFrame(df_test[['Height','Weight']])
#For men
ym_test = pd.DataFrame(dm_test['SizeCode'])
Xm_test = pd.DataFrame(dm_test[['Height','Weight']])
# Step7.2 - Evaluate Model: compare predict value with test value
# For Women
yf_pred_rf = pd.DataFrame(forest_f.predict(Xf_test))
yf_pred_size_rf = round(yf_pred_rf,0)
# For Men
ym_pred_rf = pd.DataFrame(forest_m.predict(Xm_test))
ym_pred_size_rf = round(ym_pred_rf,0)
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers. [Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed: 0.0s finished [Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers. [Parallel(n_jobs=1)]: Done 50 out of 50 | elapsed: 0.0s finished
F-1 Score
Linear Regression | Random Forest | |
---|---|---|
F-1 Women (micro) | 0.79 | 0.96 |
F-1 Men (micro) | 0.89 | 0.88 |
F-1 Women (macro) | 0.41 | 0.96 |
F-1 Men (macro) | 0.70 | 0.78 |
Confusion Matrix
conf_mat_m_pct_rf = pd.DataFrame(conf_mat_m_rf/conf_mat_m_rf.sum(axis=0))
round(conf_mat_m_pct_rf,2)
0 | 1 | 2 | 3 | |
---|---|---|---|---|
0 | 0.96 | 0.05 | 0.0 | 0.00 |
1 | 0.04 | 0.88 | 0.0 | 0.00 |
2 | 0.00 | 0.07 | 1.0 | 0.73 |
3 | 0.00 | 0.00 | 0.0 | 0.27 |
conf_mat_f_pct_rf = pd.DataFrame(conf_mat_f_rf/conf_mat_f_rf.sum(axis=1))
round(conf_mat_f_pct_rf,2)
0 | 1 | 2 | 3 | 4 | 5 | 6 | |
---|---|---|---|---|---|---|---|
0 | 1.0 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
1 | 0.0 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
2 | 0.0 | 0.24 | 0.93 | 0.00 | 0.00 | 0.00 | 0.00 |
3 | 0.0 | 0.00 | 0.00 | 0.98 | 0.02 | 0.00 | 0.00 |
4 | 0.0 | 0.00 | 0.00 | 0.02 | 0.94 | 0.00 | 0.14 |
5 | 0.0 | 0.00 | 0.00 | 0.00 | 0.00 | 0.98 | 0.03 |
6 | 0.0 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 |
XXS | XS | S | M | L | XL | XXL | |
---|---|---|---|---|---|---|---|
Male | 0 | 0 | 29 | 39.2 | 29.6 | 2.3 | 0 |
Female | 1.3 | 6.6 | 22.5 | 38.9 | 22.1 | 5.1 | 3.4 |
sizeratio = pd.DataFrame({
'size':["XXS","XS","S","M","L","XL","XXL"],
'female':[1.3,6.6,22.5,38.9,22.1,5.1,3.4],
'male':[0,0,29,39.2,29.6,2.3,0]
})
ax = plt.gca()
sizeratio.plot(kind='line', x='size',y='female',color='red',ax=ax)
sizeratio.plot(kind='line', x='size', y='male',ax=ax)
plt.show()
sns.distplot(dfsm_naout['SizeCode'])
sns.distplot(dfsf_naout['SizeCode'])
<matplotlib.axes._subplots.AxesSubplot at 0x1a1d7644d0>
Men | Women | |
---|---|---|
F-1 (micro) | 0.88 | 0.96 |
Mean Age | Sample | Population |
---|---|---|
Male | 51 | 35 |
Female | 56 | 37 |
All | 54 | 36 |