Achieve Size-Right Inventory

STATS 404

Xing Sun

3/4/2020

>>>>>>> Caliberate Your Inventory with Current Market Size Ratio <<<<<<<

Challenges in Size Decisions

Market Example Size Range
Underwear bras 32A to 54G
Footwear shoes 5.5 to 13
Apparel jeans 28x30 to 44x38
Accessories rings 5 to 13
  • Fitting as market segmenting tool
    • plus vs. petite
    • core vs. fringe
  • Population size ratio changes
    • lifestyle: diet/workout
    • demographics: race/age
In [4]:
ss = pd.DataFrame({
    'size':["XS","S","M","L","XL"],
    'ratio1':[2,10,40,28,20],
    'ratio2':[10,15,35,25,15],
})
ax = plt.gca()
ss.plot(kind='line', x='size', y='ratio1',ax=ax)
ss.plot(kind='line', x='size',y='ratio2',color='red',ax=ax)

plt.show()

Costs in Inventory Size Decisions

  • 350B est. apparel market size
  • 70B if at 20% size inaccuracy rate
  • 7B can be saved if 10% business caliberates sizes

How to Size Up Inventory

  • Health Survey Data
    • Height
    • Weight
  • Apparel Industry Sizing Chart
    • For Men
    • For Women
  • Statistical Computing
    • Size Inference from Height/Weight
    • Size Ratio Inference from Survey Sample

>>>>>>> Pilot Project: T-Shirt Size

Health Survey Data

Mean Values from CA 2018 Health Survey Data
Count Height (ft) Weight (lbs) BMI
Male 9754 5'9" 190 27.6
Female 11423 5'4" 157 27.2
  • California Health Interview Survey
  • Landline and cellphone interviews
  • 21,177 observations
  • 2001-2018
In [66]:
df_crosstab_age = pd.crosstab(df['SRAGE_P1'], df['OVRWT'] == 1, normalize=True)
df_crosstab_age.plot(kind='bar', stacked=True)
# The overweight situation is self-reported, not computed by BMI.
Out[66]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a1be8a950>
In [68]:
df_crosstab_sex = pd.crosstab(df['SRSEX'], df['OVRWT'] == 1, normalize=True)
df_crosstab_sex.plot(kind='bar', stacked=True)
Out[68]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a1dd327d0>

BMI Reference Chart

BODY MASS INDEX DEFINITION FREQ %
1 0 - 18.49 UNDERWEIGHT 529 2.50
2 18.5 - 22.99 ACCEPTABLE RISK 4415 20.85
3 23.0 - 27.49 INCREASED RISK 7700 36.36
4 27.5 OR HIGHER HIGH RISK 8533 40.29
In [5]:
#Insert image data for female shirt sizing chart
import matplotlib.image as mpimg
img=mpimg.imread('Womens size chart.png')
imgplot = plt.imshow(img)
plt.show()
In [6]:
#Insert image data for male shirt sizing chart
import matplotlib.image as mpimg
img=mpimg.imread('Sizing-shirts.png')
imgplot = plt.imshow(img)
plt.show()

Data Processing

  • Health Survey Data conversion

    • 495 down to 2 variables
    • In meter and kilogram
  • Sizing Chart Data conversion

    • Image to function
    • Map size to survey data
    • Alphatic and numeric sizing
In [36]:
female = dfs.loc[dfs.Sex == 2]
male = dfs.loc[dfs.Sex == 1]
ax = sns.kdeplot(female.Weight, female.Height,
                 cmap="Reds", shade=True, shade_lowest=False, cut = 0.5)
ax = sns.kdeplot(male.Weight, male.Height,
                cmap="Blues", shade=True, shade_lowest=False, cut = 0.5, cbar = True)

Data Limitations

  • 10%+ "Off-chart" sizes
    • Female: 10.2% (1170 out of 11423)
    • Male: 13.6% (1318 out of 9654)
  • Gender size chart unbalance
    • Female: XXS to XXL
    • Male: S to XL

Model: Random Forest

  • Size ~ Height + Weight

  • Steps

    • Remove any off-chart sizing
    • Set up Training and Test data
    • Define predictors and outcome variables
    • Build Random Forest and select the optimal number of trees
    • Evaluate results
In [37]:
#Step1: Remove any "NA" values from the dataset
dfsf_naout = dfsf.loc[dfsf['Size']!='NA']
dfsm_naout = dfsm.loc[dfsm['Size']!='NA']
In [38]:
#Step2: Split out Data into Train and Test
df_train, df_test = train_test_split(dfsf_naout, test_size=0.2, random_state=2020,stratify=dfsf_naout['SizeCode'])
dm_train, dm_test = train_test_split(dfsm_naout, test_size=0.2, random_state=2020,stratify=dfsm_naout['SizeCode'])
In [39]:
#Step3: Define predictors and outcome variables

#For women train set
yf = df_train['SizeCode']
Xf = df_train[['Height','Weight']]

#For men train set
ym = dm_train['SizeCode']
Xm = dm_train[['Height','Weight']]
In [40]:
#Step4: Specify a set of variety of number of trees in forest, to determine
#how many to use based on leveling-off of OOB error:
n_trees = [50, 100, 250, 500, 1000, 1500, 2500]

# For Women
rf_f_dict = dict.fromkeys(n_trees)

# For Men
rf_m_dict = dict.fromkeys(n_trees)
In [41]:
#Step5: Create Random Forest in the specified number of trees

#For Women
for num in n_trees:
    print(num)
    rf_f = RandomForestClassifier(n_estimators=num,
                                oob_score=True,
                                max_leaf_nodes=8,
                                random_state=2020,
                                class_weight='balanced',
                                verbose=1)
    rf_f.fit(Xf, yf)
    rf_f_dict[num] = rf_f
    
#For Men
for num in n_trees:
    print(num)
    rf_m = RandomForestClassifier(n_estimators=num,
                                oob_score=True,
                                max_leaf_nodes=4,
                                random_state=2020,
                                class_weight='balanced',
                                verbose=1)
    rf_m.fit(Xm, ym)
    rf_m_dict[num] = rf_m
50
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  50 out of  50 | elapsed:    0.1s finished
100
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:    0.3s finished
250
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 250 out of 250 | elapsed:    0.7s finished
500
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 500 out of 500 | elapsed:    2.3s finished
1000
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 1000 out of 1000 | elapsed:    4.0s finished
1500
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 1500 out of 1500 | elapsed:    3.8s finished
2500
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 2500 out of 2500 | elapsed:    8.1s finished
50
100
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  50 out of  50 | elapsed:    0.1s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:    0.2s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
250
[Parallel(n_jobs=1)]: Done 250 out of 250 | elapsed:    0.5s finished
500
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 500 out of 500 | elapsed:    1.4s finished
1000
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 1000 out of 1000 | elapsed:    2.1s finished
1500
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 1500 out of 1500 | elapsed:    5.1s finished
2500
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 2500 out of 2500 | elapsed:    6.8s finished
In [42]:
#Step6: Graph the number of tree OOB error rate:

# For Women
oob_error_list_f = [None] * len(n_trees)

# Find OOB error for each forest size:
for i in range(len(n_trees)):
    oob_error_list_f[i] = 1 - rf_f_dict[n_trees[i]].oob_score_
else:
    # Visulaize result:
    plt.plot(n_trees, oob_error_list_f, 'bo',
             n_trees, oob_error_list_f, 'k')
    
# It is a weird curve, but it looks like 100 trees has the best performance

forest_f = rf_f_dict[100]
In [43]:
#For Men

oob_error_list = [None] * len(n_trees)

# Find OOB error for each forest size:
for i in range(len(n_trees)):
    oob_error_list[i] = 1 - rf_m_dict[n_trees[i]].oob_score_
else:
    # Visulaize result:
    plt.plot(n_trees, oob_error_list, 'bo',
             n_trees, oob_error_list, 'k')
    
# A very different tree size impact on the predicting performance.
# The smallest number of trees and biggest are about the same performance.
# I'd go with smaller number of tree as model choice.

forest_m = rf_m_dict[50]
forest_m2 = rf_m_dict[2500]
In [44]:
#Step7.1: Evaluate Model - set up test data

#For women
yf_test = pd.DataFrame(df_test['SizeCode'])
Xf_test = pd.DataFrame(df_test[['Height','Weight']])

#For men
ym_test = pd.DataFrame(dm_test['SizeCode'])
Xm_test = pd.DataFrame(dm_test[['Height','Weight']])
In [45]:
# Step7.2 -  Evaluate Model: compare predict value with test value

# For Women

yf_pred_rf = pd.DataFrame(forest_f.predict(Xf_test))
yf_pred_size_rf = round(yf_pred_rf,0)

# For Men

ym_pred_rf = pd.DataFrame(forest_m.predict(Xm_test))
ym_pred_size_rf = round(ym_pred_rf,0)
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:    0.0s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  50 out of  50 | elapsed:    0.0s finished

F-1 Score

Linear Regression Random Forest
F-1 Women (micro) 0.79 0.96
F-1 Men (micro) 0.89 0.88
F-1 Women (macro) 0.41 0.96
F-1 Men (macro) 0.70 0.78

Confusion Matrix

  • For Women
  • For Men
In [64]:
conf_mat_m_pct_rf = pd.DataFrame(conf_mat_m_rf/conf_mat_m_rf.sum(axis=0))
round(conf_mat_m_pct_rf,2)
Out[64]:
0 1 2 3
0 0.96 0.05 0.0 0.00
1 0.04 0.88 0.0 0.00
2 0.00 0.07 1.0 0.73
3 0.00 0.00 0.0 0.27
In [65]:
conf_mat_f_pct_rf = pd.DataFrame(conf_mat_f_rf/conf_mat_f_rf.sum(axis=1))
round(conf_mat_f_pct_rf,2)
Out[65]:
0 1 2 3 4 5 6
0 1.0 0.00 0.00 0.00 0.00 0.00 0.00
1 0.0 1.00 0.00 0.00 0.00 0.00 0.00
2 0.0 0.24 0.93 0.00 0.00 0.00 0.00
3 0.0 0.00 0.00 0.98 0.02 0.00 0.00
4 0.0 0.00 0.00 0.02 0.94 0.00 0.14
5 0.0 0.00 0.00 0.00 0.00 0.98 0.03
6 0.0 0.00 0.00 0.00 0.00 0.00 1.00

Findings

Size Ratio from CA 2018 Health Survey Data
XXS XS S M L XL XXL
Male 0 0 29 39.2 29.6 2.3 0
Female 1.3 6.6 22.5 38.9 22.1 5.1 3.4
In [62]:
sizeratio = pd.DataFrame({
    'size':["XXS","XS","S","M","L","XL","XXL"],
    'female':[1.3,6.6,22.5,38.9,22.1,5.1,3.4],
    'male':[0,0,29,39.2,29.6,2.3,0]
})
ax = plt.gca()
sizeratio.plot(kind='line', x='size',y='female',color='red',ax=ax)
sizeratio.plot(kind='line', x='size', y='male',ax=ax)

plt.show()
In [63]:
sns.distplot(dfsm_naout['SizeCode'])
sns.distplot(dfsf_naout['SizeCode'])
Out[63]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a1d7644d0>

Next ...

  • Model Improvement
    • Complete "off-chart" sizing
    • Increase men's size accuracy
Men Women
F-1 (micro) 0.88 0.96
  • Age Adjustment
Mean Age Sample Population
Male 51 35
Female 56 37
All 54 36

Size Intelligence

  • Market-Specific Sizes
    • Age
    • Race
    • Education etc
  • Merchandise Types
    • Footwear
    • Underwear
    • Apparel etc
  • Regions and Countries
    • 50 States
    • North America
    • Other Markets

Q&A

  • Data Source
  • Data Processing
  • Modeling
  • Findings
  • Next ...

>>>>>>> Caliberate Your Inventory with Current Market Size Ratio <<<<<<<