Skip to content Skip to sidebar Skip to footer

Create A Custom Sklearn Transformermixin That Transforms Categorical Variables Consistently

This question is not a duplicate as someone suggested. Why? Because in that example, all possible values ARE KNOWN. In this example, they aren't. Further, this question - in additi

Solution 1:

I made a blog post to address this. Below is the transformer I built.

classCategoryGrouper(BaseEstimator, TransformerMixin):  
    """A tranformer for combining low count observations for categorical features.

    This transformer will preserve category values that are above a certain
    threshold, while bucketing together all the other values. This will fix issues
    where new data may have an unobserved category value that the training data
    did not have.
    """def__init__(self, threshold=0.05):
        """Initialize method.

        Args:
            threshold (float): The threshold to apply the bucketing when
                categorical values drop below that threshold.
        """
        self.d = defaultdict(list)
        self.threshold = threshold

    deftransform(self, X, **transform_params):
        """Transforms X with new buckets.

        Args:
            X (obj): The dataset to pass to the transformer.

        Returns:
            The transformed X with grouped buckets.
        """
        X_copy = X.copy()
        for col in X_copy.columns:
            X_copy[col] = X_copy[col].apply(lambda x: x if x in self.d[col] else'CategoryGrouperOther')
        return X_copy

    deffit(self, X, y=None, **fit_params):
        """Fits transformer over X.

        Builds a dictionary of lists where the lists are category values of the
        column key for preserving, since they meet the threshold.
        """
        df_rows = len(X.index)
        for col in X.columns:
            calc_col = X.groupby(col)[col].agg(lambda x: (len(x) * 1.0) / df_rows)
            self.d[col] = calc_col[calc_col >= self.threshold].index.tolist()
        return self

Basically, the motivation originally came from me having to handle sparse category values, but then I realized this could be applied to unknown values. The transformer essentially groups sparse category values together, given a threshold, so since unknown values would inherit 0% of the value space, they would get bucketed into a CategoryGrouperOther group.

Here's just a demonstration of the transformer:

# dfs with 100 elements in cat1 and cat2# note how df_test has elements 'g' and 't' in the respective categories (unknown values)
df_train = pd.DataFrame({'cat1': ['a'] * 20 + ['b'] * 30 + ['c'] * 40 + ['d'] * 3 + ['e'] * 4 + ['f'] * 3,
                         'cat2': ['z'] * 25 + ['y'] * 25 + ['x'] * 25 + ['w'] * 20 +['v'] * 5})
df_test = pd.DataFrame({'cat1': ['a'] * 10 + ['b'] * 20 + ['c'] * 5 + ['d'] * 50 + ['e'] * 10 + ['g'] * 5,
                        'cat2': ['z'] * 25 + ['y'] * 55 + ['x'] * 5 + ['w'] * 5 + ['t'] * 10})

catgrouper = CategoryGrouper()
catgrouper.fit(df_train)
df_test_transformed = catgrouper.transform(df_test)

df_test_transformed

    cat1    cat2
0   a   z
1   a   z
2   a   z
3   a   z
4   a   z
5   a   z
6   a   z
7   a   z
8   a   z
9   a   z
10  b   z
11  b   z
12  b   z
13  b   z
14  b   z
15  b   z
16  b   z
17  b   z
18  b   z
19  b   z
20  b   z
21  b   z
22  b   z
23  b   z
24  b   z
25  b   y
26  b   y
27  b   y
28  b   y
29  b   y
... ... ...
70  CategoryGrouperOther    y
71  CategoryGrouperOther    y
72  CategoryGrouperOther    y
73  CategoryGrouperOther    y
74  CategoryGrouperOther    y
75  CategoryGrouperOther    y
76  CategoryGrouperOther    y
77  CategoryGrouperOther    y
78  CategoryGrouperOther    y
79  CategoryGrouperOther    y
80  CategoryGrouperOther    x
81  CategoryGrouperOther    x
82  CategoryGrouperOther    x
83  CategoryGrouperOther    x
84  CategoryGrouperOther    x
85  CategoryGrouperOther    w
86  CategoryGrouperOther    w
87  CategoryGrouperOther    w
88  CategoryGrouperOther    w
89  CategoryGrouperOther    w
90  CategoryGrouperOther    CategoryGrouperOther
91  CategoryGrouperOther    CategoryGrouperOther
92  CategoryGrouperOther    CategoryGrouperOther
93  CategoryGrouperOther    CategoryGrouperOther
94  CategoryGrouperOther    CategoryGrouperOther
95  CategoryGrouperOther    CategoryGrouperOther
96  CategoryGrouperOther    CategoryGrouperOther
97  CategoryGrouperOther    CategoryGrouperOther
98  CategoryGrouperOther    CategoryGrouperOther
99  CategoryGrouperOther    CategoryGrouperOther

Even works when I set threshold to 0 (this will exclusively set unknown values to the 'other' group while preserving all the other category values). I would caution against setting threshold to 0 though, because your training dataset would not have the 'other' category so tweak the threshold to flag at least one value to be the 'other' group:

catgrouper = CategoryGrouper(threshold=0)
catgrouper.fit(df_train)
df_test_transformed = catgrouper.transform(df_test)

df_test_transformed

    cat1    cat2
0   a   z
1   a   z
2   a   z
3   a   z
4   a   z
5   a   z
6   a   z
7   a   z
8   a   z
9   a   z
10  b   z
11  b   z
12  b   z
13  b   z
14  b   z
15  b   z
16  b   z
17  b   z
18  b   z
19  b   z
20  b   z
21  b   z
22  b   z
23  b   z
24  b   z
25  b   y
26  b   y
27  b   y
28  b   y
29  b   y
...... ...
70  d   y
71  d   y
72  d   y
73  d   y
74  d   y
75  d   y
76  d   y
77  d   y
78  d   y
79  d   y
80  d   x
81  d   x
82  d   x
83  d   x
84  d   x
85  e   w
86  e   w
87  e   w
88  e   w
89  e   w
90  e   CategoryGrouperOther
91  e   CategoryGrouperOther
92  e   CategoryGrouperOther
93  e   CategoryGrouperOther
94  e   CategoryGrouperOther
95  CategoryGrouperOther    CategoryGrouperOther
96  CategoryGrouperOther    CategoryGrouperOther
97  CategoryGrouperOther    CategoryGrouperOther
98  CategoryGrouperOther    CategoryGrouperOther
99  CategoryGrouperOther    CategoryGrouperOther

Solution 2:

And like I said, answering my own question. Here's the solution I'm going with for now.

def get_datasets(df):
    trans1= DFTransformer()
    trans2= DFTransformer()
    train = trans1.fit_transform(df.iloc[:, :-1])
    test = trans2.fit_transform(pd.read_pickle(TEST_PICKLE_PATH))
    columns = train.columns.intersection(test.columns).tolist()
    X_train = train[columns]
    y_train = df.iloc[:, -1]
    X_test = test[columns]
    return X_train, y_train, X_test

Solution 3:

If you're worried about your pd.get_dummies() outputting the wrong dimensions you could simply specify the categorical encoding for your columns.

For example:

fit_df = pd.DataFrame({'COUNTRY': ['UK', 'FR', 'IT']}, dtype='category')
fit_categories = fit_df.COUNTRY.cat.categoriespredict_df= pd.DataFrame({'COUNTRY': ['UK']}, dtype='category')
predict_df.COUNTRY = predict_df.COUNTRY.cat.set_categories(fit_categories)
pd.get_dummies(predict_df)

Will return the following table:

   COUNTRY_FR    COUNTRY_IT    COUNTRY_UK
            0             0             1

So in your case, you could simply define your categorical encoding in a config file or have the transformer class track the initial encoding.

This approach could also be extended to handle unseen categorical values by using pd.Series.cat.add_categories

Hope this helps.

See documentation for more information.

Post a Comment for "Create A Custom Sklearn Transformermixin That Transforms Categorical Variables Consistently"