Achieve Size-Right Inventory¶

STATS 404¶

Xing Sun¶

3/4/2020¶

>>>>>>> Caliberate Your Inventory with Current Market Size Ratio <<<<<<<¶

Challenges in Size Decisions¶

Market	Example	Size Range
Underwear	bras	32A to 54G
Footwear	shoes	5.5 to 13
Apparel	jeans	28x30 to 44x38
Accessories	rings	5 to 13

Fitting as market segmenting tool
- plus vs. petite
- core vs. fringe

Population size ratio changes
- lifestyle: diet/workout
- demographics: race/age

In [4]:

ss = pd.DataFrame({
    'size':["XS","S","M","L","XL"],
    'ratio1':[2,10,40,28,20],
    'ratio2':[10,15,35,25,15],
})
ax = plt.gca()
ss.plot(kind='line', x='size', y='ratio1',ax=ax)
ss.plot(kind='line', x='size',y='ratio2',color='red',ax=ax)

plt.show()

Costs in Inventory Size Decisions¶

350B est. apparel market size
70B if at 20% size inaccuracy rate
7B can be saved if 10% business caliberates sizes

How to Size Up Inventory¶

Health Survey Data
- Height
- Weight

Apparel Industry Sizing Chart
- For Men
- For Women

Statistical Computing
- Size Inference from Height/Weight
- Size Ratio Inference from Survey Sample

>>>>>>> Pilot Project: T-Shirt Size¶

Health Survey Data¶

Mean Values from CA 2018 Health Survey Data

	Count	Height (ft)	Weight (lbs)	BMI
Male	9754	5'9"	190	27.6
Female	11423	5'4"	157	27.2

California Health Interview Survey
Landline and cellphone interviews
21,177 observations
2001-2018

In [66]:

df_crosstab_age = pd.crosstab(df['SRAGE_P1'], df['OVRWT'] == 1, normalize=True)
df_crosstab_age.plot(kind='bar', stacked=True)
# The overweight situation is self-reported, not computed by BMI.

Out[66]:

<matplotlib.axes._subplots.AxesSubplot at 0x1a1be8a950>

In [68]:

df_crosstab_sex = pd.crosstab(df['SRSEX'], df['OVRWT'] == 1, normalize=True)
df_crosstab_sex.plot(kind='bar', stacked=True)

Out[68]:

<matplotlib.axes._subplots.AxesSubplot at 0x1a1dd327d0>

BMI Reference Chart¶

	BODY MASS INDEX	DEFINITION	FREQ	%
1	0 - 18.49	UNDERWEIGHT	529	2.50
2	18.5 - 22.99	ACCEPTABLE RISK	4415	20.85
3	23.0 - 27.49	INCREASED RISK	7700	36.36
4	27.5 OR HIGHER	HIGH RISK	8533	40.29

Apparel Industry Sizing Chart¶

for men: https://tfs.ucoz.com/index/0-131
for women: https://support.stitchfix.com/hc/en-us/articles/204732770-How-to-find-your-size

In [5]:

#Insert image data for female shirt sizing chart
import matplotlib.image as mpimg
img=mpimg.imread('Womens size chart.png')
imgplot = plt.imshow(img)
plt.show()

In [6]:

#Insert image data for male shirt sizing chart
import matplotlib.image as mpimg
img=mpimg.imread('Sizing-shirts.png')
imgplot = plt.imshow(img)
plt.show()

Data Processing¶

Health Survey Data conversion
- 495 down to 2 variables
- In meter and kilogram
Sizing Chart Data conversion
- Image to function
- Map size to survey data
- Alphatic and numeric sizing

In [36]:

female = dfs.loc[dfs.Sex == 2]
male = dfs.loc[dfs.Sex == 1]
ax = sns.kdeplot(female.Weight, female.Height,
                 cmap="Reds", shade=True, shade_lowest=False, cut = 0.5)
ax = sns.kdeplot(male.Weight, male.Height,
                cmap="Blues", shade=True, shade_lowest=False, cut = 0.5, cbar = True)

Data Limitations¶

10%+ "Off-chart" sizes
- Female: 10.2% (1170 out of 11423)
- Male: 13.6% (1318 out of 9654)

Gender size chart unbalance
- Female: XXS to XXL
- Male: S to XL

Model: Random Forest¶

Size ~ Height + Weight
Steps
- Remove any off-chart sizing
- Set up Training and Test data
- Define predictors and outcome variables
- Build Random Forest and select the optimal number of trees
- Evaluate results

In [37]:

#Step1: Remove any "NA" values from the dataset
dfsf_naout = dfsf.loc[dfsf['Size']!='NA']
dfsm_naout = dfsm.loc[dfsm['Size']!='NA']

In [38]:

#Step2: Split out Data into Train and Test
df_train, df_test = train_test_split(dfsf_naout, test_size=0.2, random_state=2020,stratify=dfsf_naout['SizeCode'])
dm_train, dm_test = train_test_split(dfsm_naout, test_size=0.2, random_state=2020,stratify=dfsm_naout['SizeCode'])

In [39]:

#Step3: Define predictors and outcome variables

#For women train set
yf = df_train['SizeCode']
Xf = df_train[['Height','Weight']]

#For men train set
ym = dm_train['SizeCode']
Xm = dm_train[['Height','Weight']]

In [40]:

#Step4: Specify a set of variety of number of trees in forest, to determine
#how many to use based on leveling-off of OOB error:
n_trees = [50, 100, 250, 500, 1000, 1500, 2500]

# For Women
rf_f_dict = dict.fromkeys(n_trees)

# For Men
rf_m_dict = dict.fromkeys(n_trees)

In [41]:

#Step5: Create Random Forest in the specified number of trees

#For Women
for num in n_trees:
    print(num)
    rf_f = RandomForestClassifier(n_estimators=num,
                                oob_score=True,
                                max_leaf_nodes=8,
                                random_state=2020,
                                class_weight='balanced',
                                verbose=1)
    rf_f.fit(Xf, yf)
    rf_f_dict[num] = rf_f
    
#For Men
for num in n_trees:
    print(num)
    rf_m = RandomForestClassifier(n_estimators=num,
                                oob_score=True,
                                max_leaf_nodes=4,
                                random_state=2020,
                                class_weight='balanced',
                                verbose=1)
    rf_m.fit(Xm, ym)
    rf_m_dict[num] = rf_m

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  50 out of  50 | elapsed:    0.1s finished

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:    0.3s finished

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 250 out of 250 | elapsed:    0.7s finished

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 500 out of 500 | elapsed:    2.3s finished

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 1000 out of 1000 | elapsed:    4.0s finished

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 1500 out of 1500 | elapsed:    3.8s finished

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 2500 out of 2500 | elapsed:    8.1s finished

50
100

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  50 out of  50 | elapsed:    0.1s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:    0.2s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.

[Parallel(n_jobs=1)]: Done 250 out of 250 | elapsed:    0.5s finished

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 500 out of 500 | elapsed:    1.4s finished

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 1000 out of 1000 | elapsed:    2.1s finished

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 1500 out of 1500 | elapsed:    5.1s finished

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 2500 out of 2500 | elapsed:    6.8s finished

In [42]:

#Step6: Graph the number of tree OOB error rate:

# For Women
oob_error_list_f = [None] * len(n_trees)

# Find OOB error for each forest size:
for i in range(len(n_trees)):
    oob_error_list_f[i] = 1 - rf_f_dict[n_trees[i]].oob_score_
else:
    # Visulaize result:
    plt.plot(n_trees, oob_error_list_f, 'bo',
             n_trees, oob_error_list_f, 'k')
    
# It is a weird curve, but it looks like 100 trees has the best performance

forest_f = rf_f_dict[100]

In [43]:

#For Men

oob_error_list = [None] * len(n_trees)

# Find OOB error for each forest size:
for i in range(len(n_trees)):
    oob_error_list[i] = 1 - rf_m_dict[n_trees[i]].oob_score_
else:
    # Visulaize result:
    plt.plot(n_trees, oob_error_list, 'bo',
             n_trees, oob_error_list, 'k')
    
# A very different tree size impact on the predicting performance.
# The smallest number of trees and biggest are about the same performance.
# I'd go with smaller number of tree as model choice.

forest_m = rf_m_dict[50]
forest_m2 = rf_m_dict[2500]

In [44]:

#Step7.1: Evaluate Model - set up test data

#For women
yf_test = pd.DataFrame(df_test['SizeCode'])
Xf_test = pd.DataFrame(df_test[['Height','Weight']])

#For men
ym_test = pd.DataFrame(dm_test['SizeCode'])
Xm_test = pd.DataFrame(dm_test[['Height','Weight']])

In [45]:

# Step7.2 -  Evaluate Model: compare predict value with test value

# For Women

yf_pred_rf = pd.DataFrame(forest_f.predict(Xf_test))
yf_pred_size_rf = round(yf_pred_rf,0)

# For Men

ym_pred_rf = pd.DataFrame(forest_m.predict(Xm_test))
ym_pred_size_rf = round(ym_pred_rf,0)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:    0.0s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  50 out of  50 | elapsed:    0.0s finished

F-1 Score

	Linear Regression	Random Forest
F-1 Women (micro)	0.79	0.96
F-1 Men (micro)	0.89	0.88
F-1 Women (macro)	0.41	0.96
F-1 Men (macro)	0.70	0.78

Confusion Matrix

For Women
For Men

In [64]:

conf_mat_m_pct_rf = pd.DataFrame(conf_mat_m_rf/conf_mat_m_rf.sum(axis=0))
round(conf_mat_m_pct_rf,2)

Out[64]:

	0	1	2	3
0	0.96	0.05	0.0	0.00
1	0.04	0.88	0.0	0.00
2	0.00	0.07	1.0	0.73
3	0.00	0.00	0.0	0.27

In [65]:

conf_mat_f_pct_rf = pd.DataFrame(conf_mat_f_rf/conf_mat_f_rf.sum(axis=1))
round(conf_mat_f_pct_rf,2)

Out[65]:

	0	1	2	3	4	5	6
0	1.0	0.00	0.00	0.00	0.00	0.00	0.00
1	0.0	1.00	0.00	0.00	0.00	0.00	0.00
2	0.0	0.24	0.93	0.00	0.00	0.00	0.00
3	0.0	0.00	0.00	0.98	0.02	0.00	0.00
4	0.0	0.00	0.00	0.02	0.94	0.00	0.14
5	0.0	0.00	0.00	0.00	0.00	0.98	0.03
6	0.0	0.00	0.00	0.00	0.00	0.00	1.00

Findings¶

Size Ratio from CA 2018 Health Survey Data

	XXS	XS	S	M	L	XL	XXL
Male	0	0	29	39.2	29.6	2.3	0
Female	1.3	6.6	22.5	38.9	22.1	5.1	3.4

In [62]:

sizeratio = pd.DataFrame({
    'size':["XXS","XS","S","M","L","XL","XXL"],
    'female':[1.3,6.6,22.5,38.9,22.1,5.1,3.4],
    'male':[0,0,29,39.2,29.6,2.3,0]
})
ax = plt.gca()
sizeratio.plot(kind='line', x='size',y='female',color='red',ax=ax)
sizeratio.plot(kind='line', x='size', y='male',ax=ax)

plt.show()

In [63]:

sns.distplot(dfsm_naout['SizeCode'])
sns.distplot(dfsf_naout['SizeCode'])

Out[63]:

<matplotlib.axes._subplots.AxesSubplot at 0x1a1d7644d0>

Next ...¶

Model Improvement
- Complete "off-chart" sizing
- Increase men's size accuracy

	Men	Women
F-1 (micro)	0.88	0.96

Age Adjustment

Mean Age	Sample	Population
Male	51	35
Female	56	37
All	54	36

Size Intelligence¶

Market-Specific Sizes
- Age
- Race
- Education etc

Merchandise Types
- Footwear
- Underwear
- Apparel etc

Regions and Countries
- 50 States
- North America
- Other Markets

Q&A¶

Data Source
Data Processing
Modeling
Findings
Next ...