How to include SimpleImputer before CountVectorizer in a scikit-learn Pipeline?

Written by - Aionlinecourse1880 times views

How to include SimpleImputer before CountVectorizer in a scikit-learn Pipeline?

This article will talk about how to include SimpleImputer before CountVectorizer in a scikit-learn Pipeline.

How to include SimpleImputer before CountVectorizer in a scikit-learn Pipeline

Solution 1:

The best solution I have found is to insert a custom transformer into the Pipeline that reshapes the output of SimpleImputer from 2D to 1D before it is passed to CountVectorizer.

Here's the complete code:

import pandas as pd

import numpy as np

df = pd.DataFrame({'text':['abc def', 'abc ghi', np.nan]})

from sklearn.impute import SimpleImputer

imp = SimpleImputer(strategy='constant')


from sklearn.feature_extraction.text import CountVectorizer

vect = CountVectorizer()


#CREATE TRANSFORMER
from sklearn.preprocessing import FunctionTransformer
one_dim = FunctionTransformer(np.reshape, kw_args={'newshape':-1})

# INCLUDE TRANSFORMER IN PIPELINE
from sklearn.pipeline import make_pipeline
pipe = make_pipeline(imp, one_dim, vect)

pipe.fit_transform(df[['text']]).toarray()

It has been proposed on GitHub that CountVectorizer should allow 2D input as long as the second dimension is 1 (meaning: a single column of data). That modification to CountVectorizer would be a great solution to this problem!


Solution 2:

One solution would be to create a class off SimpleImputer and override its transform() method:

import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer

class ModifiedSimpleImputer(SimpleImputer):  
def transform(self, X):

  return super().transform(X).flatten()


df = pd.DataFrame({'text':['abc def', 'abc ghi', np.nan]})

imp = ModifiedSimpleImputer(strategy='constant')

from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()

from sklearn.pipeline import make_pipeline
pipe = make_pipeline(imp, vect)

pipe.fit_transform(df[['text']]).toarray()

Solution 3:

I use this one dimensional wrapper for sklearn Transformer when I have one dimensional data. I think, this wrapper can be used to wrap the simpleImputer for the one dimensional data (a pandas series with string values) in your case.

class OneDWrapper:
 """One dimensional wrapper for sklearn Transformers"""

    def __init__(self, transformer):
 
 self.transformer = transformer

    def fit(self, X, y=None):
 
self.transformer.fit(np.array(X).reshape(-1, 1))
       
 return self

    def transform(self, X, y=None):
  return self.transformer.transform(
 np.array(X).reshape(-1, 1)).ravel()

    def inverse_transform(self, X, y=None):
return self.transformer.inverse_transform(
 np.expand_dims(X, axis=1)).ravel()

Now, you don't need an additional step in the pipeline.

one_d_imputer = OneDWrapper(SimpleImputer(strategy='constant'))
pipe = make_pipeline(one_d_imputer, vect)
pipe.fit_transform(df['text']).toarray() 
# note we are feeding a pd.Series here!


Thank you for reading the article.

Recommended Projects

Deep Learning Interview Guide

Topic modeling using K-means clustering to group customer reviews

Have you ever thought about the ways one can analyze a review to extract all the misleading or useful information?...

Natural Language Processing
Deep Learning Interview Guide

Medical Image Segmentation With UNET

Have you ever thought about how doctors are so precise in diagnosing any conditions based on medical images? Quite simply,...

Computer Vision
Deep Learning Interview Guide

Build A Book Recommender System With TF-IDF And Clustering(Python)

Have you ever thought about the reasons behind the segregation and recommendation of books with similarities? This project is aimed...

Machine LearningDeep LearningNatural Language Processing
Deep Learning Interview Guide

Automatic Eye Cataract Detection Using YOLOv8

Cataracts are a leading cause of vision impairment worldwide, affecting millions of people every year. Early detection and timely intervention...

Computer Vision
Deep Learning Interview Guide

Crop Disease Detection Using YOLOv8

In this project, we are utilizing AI for a noble objective, which is crop disease detection. Well, you're here if...

Computer Vision
Deep Learning Interview Guide

Vegetable classification with Parallel CNN model

The Vegetable Classification project shows how CNNs can sort vegetables efficiently. As industries like agriculture and food retail grow, automating...

Machine LearningDeep Learning
Deep Learning Interview Guide

Banana Leaf Disease Detection using Vision Transformer model

Banana cultivation is a significant agricultural activity in many tropical and subtropical regions, providing a vital source of income and...

Deep LearningComputer Vision