How to include SimpleImputer before CountVectorizer in a scikit-learn Pipeline?

Written by - Aionlinecourse2367 times views

This article will talk about how to include SimpleImputer before CountVectorizer in a scikit-learn Pipeline.

How to include SimpleImputer before CountVectorizer in a scikit-learn Pipeline

Solution 1:

The best solution I have found is to insert a custom transformer into the Pipeline that reshapes the output of SimpleImputer from 2D to 1D before it is passed to CountVectorizer.

Here's the complete code:

import pandas as pd

import numpy as np

df = pd.DataFrame({'text':['abc def', 'abc ghi', np.nan]})

from sklearn.impute import SimpleImputer

imp = SimpleImputer(strategy='constant')

from sklearn.feature_extraction.text import CountVectorizer

vect = CountVectorizer()

#CREATE TRANSFORMER

from sklearn.preprocessing import FunctionTransformer

one_dim = FunctionTransformer(np.reshape, kw_args={'newshape':-1})

# INCLUDE TRANSFORMER IN PIPELINE

from sklearn.pipeline import make_pipeline

pipe = make_pipeline(imp, one_dim, vect)

pipe.fit_transform(df[['text']]).toarray()

It has been proposed on GitHub that CountVectorizer should allow 2D input as long as the second dimension is 1 (meaning: a single column of data). That modification to CountVectorizer would be a great solution to this problem!

Solution 2:

One solution would be to create a class off SimpleImputer and override its transform() method:

import pandas as pd

import numpy as np

from sklearn.impute import SimpleImputer

class ModifiedSimpleImputer(SimpleImputer):

def transform(self, X):


  return super().transform(X).flatten()


df = pd.DataFrame({'text':['abc def', 'abc ghi', np.nan]})

imp = ModifiedSimpleImputer(strategy='constant')

from sklearn.feature_extraction.text import CountVectorizer

vect = CountVectorizer()

from sklearn.pipeline import make_pipeline

pipe = make_pipeline(imp, vect)

pipe.fit_transform(df[['text']]).toarray()

Solution 3:

I use this one dimensional wrapper for sklearn Transformer when I have one dimensional data. I think, this wrapper can be used to wrap the simpleImputer for the one dimensional data (a pandas series with string values) in your case.

class OneDWrapper:

 """One dimensional wrapper for sklearn Transformers"""

    def __init__(self, transformer):

 self.transformer = transformer

    def fit(self, X, y=None):

self.transformer.fit(np.array(X).reshape(-1, 1))

 return self

    def transform(self, X, y=None):

  return self.transformer.transform(

 np.array(X).reshape(-1, 1)).ravel()

    def inverse_transform(self, X, y=None):

return self.transformer.inverse_transform(

 np.expand_dims(X, axis=1)).ravel()

Now, you don't need an additional step in the pipeline.

one_d_imputer = OneDWrapper(SimpleImputer(strategy='constant'))

pipe = make_pipeline(one_d_imputer, vect)

pipe.fit_transform(df['text']).toarray()

# note we are feeding a pd.Series here!

Thank you for reading the article.

Recommended Projects

Recent Articles

How to include SimpleImputer before CountVectorizer in a scikit-learn Pipeline?

Recommended Projects

Topic modeling using K-means clustering to group customer reviews

Automatic Eye Cataract Detection Using YOLOv8

Medical Image Segmentation With UNET

Real-Time License Plate Detection Using YOLOv8 and OCR Model

Voice Cloning Application Using RVC

Build A Book Recommender System With TF-IDF And Clustering(Python)

Optimizing Chunk Sizes for Efficient and Accurate Document Retrieval Using HyDE Evaluation