- How to load a keras model saved as .pb
- How to train new classes on pretrained yolov4 model in darknet
- How To Import The MNIST Dataset From Local Directory Using PyTorch
- how to split up tf.data.Dataset into x_train, y_train, x_test, y_test for keras
- How to plot confusion matrix for prefetched dataset in Tensorflow
- How to Use Class Weights with Focal Loss in PyTorch for Imbalanced dataset for MultiClass Classification
- How to solve "ValueError: y should be a 1d array, got an array of shape (3, 5) instead." for naive Bayes?
- How to create image of confusion matrix in Python
- What are the numbers in torch.transforms.normalize and how to select them?
- How to assign a name for a pytorch layer?
- How to solve dist.init_process_group from hanging or deadlocks?
- How to use sample weights with tensorflow datasets?
- How to Fine-tune HuggingFace BERT model for Text Classification
- How to Convert Yolov5 model to tensorflow.js
- Machine Learning Project: Airline Tickets Price Prediction
- Machine Learning Project: Hotel Booking Prediction [Part 2]
- Machine Learning Project: Hotel Booking Prediction [Part 1]
- Machine Learning Project Environment Setup
- Computer vision final year project ideas and guidelines
- Build Your First Machine Learning Project in Python(Step by Step Tutorial)
How to include SimpleImputer before CountVectorizer in a scikit-learn Pipeline?
This article will talk about how to include SimpleImputer before CountVectorizer in a scikit-learn Pipeline.
How to include SimpleImputer before CountVectorizer in a scikit-learn Pipeline
Solution 1:
The best solution I have found is to insert a custom transformer into the Pipeline
that reshapes the output of SimpleImputer
from 2D to 1D before it is passed to CountVectorizer
.
Here's the complete code:
import pandas as pd
import numpy as np
df = pd.DataFrame({'text':['abc def', 'abc ghi', np.nan]})
from sklearn.impute import SimpleImputer
imp = SimpleImputer(strategy='constant')
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()
#CREATE TRANSFORMER
from sklearn.preprocessing import FunctionTransformer
one_dim = FunctionTransformer(np.reshape, kw_args={'newshape':-1})
# INCLUDE TRANSFORMER IN PIPELINE
from sklearn.pipeline import make_pipeline
pipe = make_pipeline(imp, one_dim, vect)
pipe.fit_transform(df[['text']]).toarray()
It has been proposed on GitHub that CountVectorizer
should allow 2D input as long as the second dimension is 1 (meaning: a single column of data). That modification to CountVectorizer
would be a great solution to this problem!
Solution 2:
One solution would be to create a class off SimpleImputer and override its transform()
method:
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
class ModifiedSimpleImputer(SimpleImputer):
def transform(self, X):
return super().transform(X).flatten()
df = pd.DataFrame({'text':['abc def', 'abc ghi', np.nan]})
imp = ModifiedSimpleImputer(strategy='constant')
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()
from sklearn.pipeline import make_pipeline
pipe = make_pipeline(imp, vect)
pipe.fit_transform(df[['text']]).toarray()
Solution 3:
I use this one dimensional wrapper for sklearn Transformer when I have one dimensional data. I think, this wrapper can be used to wrap the simpleImputer for the one dimensional data (a pandas series with string values) in your case.
class OneDWrapper:
"""One dimensional wrapper for sklearn Transformers"""
def __init__(self, transformer):
self.transformer = transformer
def fit(self, X, y=None):
self.transformer.fit(np.array(X).reshape(-1, 1))
return self
def transform(self, X, y=None):
return self.transformer.transform(
np.array(X).reshape(-1, 1)).ravel()
def inverse_transform(self, X, y=None):
return self.transformer.inverse_transform(
np.expand_dims(X, axis=1)).ravel()
Now, you don't need an additional step in the pipeline.
one_d_imputer = OneDWrapper(SimpleImputer(strategy='constant'))
pipe = make_pipeline(one_d_imputer, vect)
pipe.fit_transform(df['text']).toarray()
# note we are feeding a pd.Series here!
Thank you for reading the article.