- How to Convert an Image to a Tensor Using PyTorch
- One-Hot Encoding with Multiple Labels in Python
- 10 features engineering techniques for machine learning
- 10 Best LLM Project Ideas to Boost Your AI Skills
- [Solved] OpenAI Python Package Error: 'ChatCompletion' object is not subscriptable
- [Solved] OpenAI API error: "The api_key client option must be set either by passing api_key to the client or by setting the OPENAI_API_KEY env variable"
- Target modules for applying PEFT / LoRA on different models
- how to use a custom embedding model locally on Langchain?
- [Solved] Cannot import name 'LangchainEmbedding' from 'llama_index'
- Langchain, Ollama, and Llama 3 prompt and response
- Understanding Tabular Machine Learning: Key Benchmarks & Advances
- How to load a huggingface pretrained transformer model directly to GPU?
- [Solved] TypeError when chaining Runnables in LangChain: Expected a Runnable, callable or dict
- How to Disable Safety Settings in Gemini Vision Pro Model Using API?
- [Solved] Filter langchain vector database using as_retriever search_kwargs parameter
- [Solved] ModuleNotFoundError: No module named 'llama_index.graph_stores'
- Best AI Text Generators for High Quality Content Writing
- Tensorflow Error on Macbook M1 Pro - NotFoundError: Graph execution error
- How does GPT-like transformers utilize only the decoder to do sequence generation?
- How to set all tensors to cuda device?
How to Easily Solve Multi-Class Classification Problems in Python
1. What is Multi-Class Classification?
Multi-class classification refers to a specific machine learning problem where findings are categorized into more than two classes. Multiclass classification means categorizing things based on their distinct features. While binary classification only concerns two labels (for example, 'Yes' or 'No'), multi-class classification deals with three or more categories. For example, determining whether an image depicts jojoba, a lily, or a rose is a multiclass classification problem with three possible classes (jojoba, lily, rose). Whereas determining whether an image contains a rose or not, is a binary classification problem (with two possible classes: rose and no rose).
Real-World Applications
Multi-class classification is a common tool that can be used in everyday problems to make them faster and more automated. It would also enable improved decision-making in different areas because the categorizations made are precise. Hence, the multi-class classification problems have extensive practical use in various industries. Here are some examples:
- Image Recognition: Assigning images to groups. Such as identifying different groups of animals in pictures, such as cats, dogs, birds, etc.
- Text Classification: Subject-based reference classification. News articles are categorized as politics, sports, entertainment, technology, etc.
- Medical Diagnosis: Categorize different types of brain tumors at an early stage helping to detect and treat early
- Sentiment Analysis: Sentiment analysis is more about understanding public opinion, like customer feedback. Customer reviews are divided into positive, negative, and neutral textual sentiments.
- Speech Recognition: We can quickly write any spoken words using speech recognition. Analyzing voice instructions to smart or home intelligent devices (such alarms, music, and lighting).
2. Practical Implementation of Multiclass Classification Using Python: Butterfly Species Classification
Problem Statement
This code focuses on identifying the different species of butterflies by classifying their images. It is a multi-class classification problem as there are multiple species (classes) of butterflies and every image is assigned to one of those classes.
We will construct and train the model using a CNN architecture. The butterfly dataset hosted on Kaggle contains labeled images of various butterflies and will be utilized in the training so that unseen images are classified appropriately.
This code aims at the recognition of the species of butterflies through the classification of their images. This is a multi-class classification since there are many classes of butterflies and each of the images is expected to belong to a particular one.
Next, we will create as well as train our model using a CNN architecture. We collected the dataset from the Kaggle platform.
Dataset Used
The dataset used for this project is the Butterfly Image Classification Dataset from Kaggle, downloaded using:
!kaggle datasets download -d phucthaiv02/butterfly-image-classification
This dataset contains thousands of images of butterflies, categorized into 10 different species.
Step-by-Step Explanation
1. Importing Libraries
To begin, we will import the libraries that we will need the most in this project. We will rely on TensorFlow and Keras to construct the CNN model and Numpy and Matplotlib to manage the dataset and present the output.
import tensorflow as tf
from tensorflow.keras import datasets, layers, models
import matplotlib.pyplot as plt
import numpy as np
import os
2. Downloading and Unzipping the Dataset
The Kaggle dataset is downloaded and extracted from the Kaggle API. This step ensures we have access to the data.
!kaggle datasets download -d phucthaiv02/butterfly-image-classification
!unzip butterfly-image-classification.zip
After this, the dataset folder structure will contain training and testing images categorized by butterfly species.
3. Loading and Preprocessing the Data
Now, we load the images from the extracted dataset folder and preprocess them by resizing them to a fixed shape (e.g., 150x150). We also normalize the pixel values "normalize" the pixel values to a range between 0 and 1 (instead of 0 to 255), which helps the model learn better.
from tensorflow.keras.preprocessing.image import ImageDataGenerator
train_datagen = ImageDataGenerator(rescale=1./255)
test_datagen = ImageDataGenerator(rescale=1./255)
train_generator = train_datagen.flow_from_directory(
'butterfly/train',
target_size=(150, 150),
batch_size=32,
class_mode='categorical')
validation_generator = test_datagen.flow_from_directory(
'butterfly/test',
target_size=(150, 150),
batch_size=32,
class_mode='categorical')
We use flow_from_directory to load images directly from the dataset folder and ImageDataGenerator to handle real-time data augmentation if necessary.
4. Building the CNN Model
The next step is to build the CNN model. We use multiple convolutional and pooling layers, followed by dense layers for classification.
model = models.Sequential([
layers.Conv2D(32, (3, 3), activation='relu', input_shape=(150, 150, 3)),
layers.MaxPooling2D((2, 2)),
layers.Conv2D(64, (3, 3), activation='relu'),
layers.MaxPooling2D((2, 2)),
layers.Conv2D(128, (3, 3), activation='relu'),
layers.MaxPooling2D((2, 2)),
layers.Conv2D(128, (3, 3), activation='relu'),
layers.MaxPooling2D((2, 2)),
layers.Flatten(),
layers.Dense(512, activation='relu'),
layers.Dense(10, activation='softmax') # 10 butterfly species
])
- The input shape is set to 150x150 pixels, with 3 channels for color images (RGB).
- The final dense layer has 10 units with a softmax activation, corresponding to the 10 different species of butterflies.
5. Compiling the Model
We compile the model using the Adam optimizer and categorical cross-entropy as the loss function since this is a multi-class classification problem.
model.compile(optimizer='adam',
loss='categorical_crossentropy',
metrics=['accuracy'])
6. Training the Model
The model is trained on the training set, with a batch size of 32 and for 50 epochs. The validation set is used to evaluate the model during training.
history = model.fit(
train_generator,
steps_per_epoch=100,
epochs=50,
validation_data=validation_generator,
validation_steps=50)
7. Evaluating the Model
After training, the model is evaluated on the test data to see how well it performs on unseen images.
test_loss, test_acc = model.evaluate(validation_generator)
print(f"Test accuracy: {test_acc}")
This step outputs the accuracy of the model when classifying images from the validation set.
8. Visualizing Training and Validation Accuracy
We can visualize the training and validation accuracy over epochs to check if the model is learning correctly or overfitting.
plt.plot(history.history['accuracy'], label='Training Accuracy')
plt.plot(history.history['val_accuracy'], label='Validation Accuracy')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend(loc='lower right')
plt.show()
9. Saving the Model
Once the model has been trained and evaluated, you can save it for future use.
model.save('butterfly_classification_model.h5')
3. Comparing Python Libraries for Multi-Class Classification
Popular Python Libraries
Numerous well-known Python Libraries for multi-class classification have their advantages:
- Scikit-Learn: User friendly. Most popular and maintained.
- XGBoost and LightGBM: Well known for their speed as they are designed to be faster than the rest, especially with large datasets.
- CatBoost: Performs well on category-based features.
- Keras/TensorFlow: Suitable for deep learning use cases
Library | Best For | Strengths | Weaknesses |
Scikit-learn | Classical ML algorithms | It has a simple API as well as a wide range of algorithms. | If you have less deep learning use case |
TensorFlow | Deep learning | Flexibility, Large dataset support | Steeper learning curve |
Keras | Quick prototyping | Easy to use, Builds Models Quickly | Not as flexible as TensorFlow |
XGBoost | Structured/tabular data | High performance, efficient | Not for deep learning |
LightGBM | Large datasets | Fast, efficient | Non-structured support is limited |
CatBoost | Categorical data | Support for categorical variables natively | It works a bit slow on very big data sets |
4. Advanced Techniques for Handling Multi-Class Imbalanced Data
Introduction to Multi-Class Imbalance
Let's say, in the butterflies dataset, there exist 500 images of one species (e;g monarch) whereas only 50 images of some other species (for instance, Blue Morpho). In this case, the model may tend to become overfitted with only Monarch species and lose its accuracy for other species. This is often referred to as class imbalance.
Here are some advanced techniques for addressing this issue effectively:
1. Resampling Techniques
Oversampling: Oversampling is a technique that augments the existing data when dealing with class imbalance problems, where there is a majority class, and it outweighs the minority class. It helps to mitigate the effect of class imbalance by increasing the samples of minority classes.
Methods:
SMOTE (Synthetic Minority Over-sampling Technique): Using the feature space of minority class samples to create synthetic samples.
ADASYN (Adaptive Synthetic Sampling): This is also similar to SMOTE but with a major difference in creating samples where there are few samples in the minority class.
from imblearn.over_sampling import SMOTE
smote = SMOTE(sampling_strategy='auto', random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)
Undersampling: Undersampling is a technique to balance uneven datasets by keeping all of the data in the minority class and decreasing the size of the majority class.
Methods:
Random Undersampling: Randomly remove samples from the majority class.
Cluster-Based Undersampling: Use clustering techniques to select representative samples from the majority class.
from imblearn.under_sampling import RandomUnderSampler
undersampler = RandomUnderSampler(random_state=42)
X_resampled, y_resampled = undersampler.fit_resample(X_train, y_train)
2. Ensemble Methods
Balanced Random Forest: An extension of the Random Forest algorithm that balances each bootstrap sample to have equal representation from each class.
from imblearn.ensemble import BalancedRandomForestClassifier
brf = BalancedRandomForestClassifier(n_estimators=100, random_state=42)
brf.fit(X_train, y_train)
3. Cost-sensitive Learning
Modify the learning algorithm to make it more sensitive to the minority classes by assigning different misclassification costs. Many algorithms in Scikit-learn support the class_weight parameter to assign weights.
from sklearn.ensemble import RandomForestClassifier
class_weights = {0: 1, 1: 5, 2: 5}
rf = RandomForestClassifier(class_weight=class_weights)
rf.fit(X_train, y_train)
class_weights = {0: 1, 1: 5, 2: 5}
rf = RandomForestClassifier(class_weight=class_weights)
rf.fit(X_train, y_train)
5. Multi-Class Problems with High Dimensional Data
The Curse of Dimensionality
When it comes to machine learning and statistics when the data is high dimensional, the problem is referred to as the 'Curse of Dimensionality.' It has sparse data points, uniform distances, exponentially greater computational costs, overfitting, and no easy way to visualize data beyond 2D or 3D. The difficulty in fitting these factors along makes it difficult to discover patterns, reduces the performance of distance-based algorithms and impedes the human ability to understand high-dimensional data.
Techniques for Dimensionality Reduction
The following five methods are considered to be the most useful in dimensionality reduction:
1. Principal Component Analysis (PCA)
PCA for linear reduction while retaining variance.
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
X_reduced = pca.fit_transform(X)
2. t-Distributed Stochastic Neighbor Embedding (t-SNE)
t-SNE is a statistical method for visualizing high-dimensional data by giving each datapoint a location in a two or three-dimensional map.
from sklearn.manifold import TSNE
tsne = TSNE(n_components=2)
X_reduced = tsne.fit_transform(X)
3. Linear Discriminant Analysis (LDA)
LDA for supervised dimensionality reduction focusing on class separability. Linear Discriminant Analysis (LDA) is a statistical technique that is used to classify data into predetermined categories or classes.
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
lda = LinearDiscriminantAnalysis(n_components=2)
X_reduced = lda.fit_transform(X, y)
4. Uniform Manifold Approximation and Projection (UMAP)
UMAP or Uniform Manifold Approximation and Projection is a machine learning tool to reduce the number of dimensions in a dataset while preserving its geometric structure. UMAP aims to preserve the local structure and handle complex relationships aims to preserve the local structure and handle complex relationships.
import umap
umap_model = umap.UMAP(n_components=2)
X_reduced = umap_model.fit_transform(X)
5. Autoencoders
Autoencoders for capturing complex, non-linear relationships in data.
from keras.layers import Input, Dense
from keras.models import Model
input_data = Input(shape=(original_dim,))
encoded = Dense(encoding_dim, activation='relu')(input_data)
decoded = Dense(original_dim, activation='sigmoid')(encoded)
autoencoder = Model(input_data, decoded)
autoencoder.compile(optimizer='adam', loss='binary_crossentropy')
autoencoder.fit(X, X, epochs=50, batch_size=256, shuffle=True)
7. Conclusion and Next Steps
In this article, we discuss multi-class classification, how they can be done in a more practical way in the Python world, and how we can deal with class imbalance in such cases. With the use of appropriate libraries and the right techniques of imbalanced data and dimensionality reduction, model performance proves to be improved substantially.
8. FAQ
-
How do I choose between Scikit-Learn, XGBoost, and LightGBM?
A: To choose between Scikit-Learn, XGBoost, and LightGBM consider the project needs: If you are a beginner or your dataset is small and you need to learn machine learning easily and conveniently. So scikit-learn is great for you. It has easy-to-use algorithms. But if you need high accuracy and performance then use XGBoost, it works with structured data and is ready to give you high results. If you have large datasets, LightGBM is optimized and faster to train than XGBoost and it does not bother with sparse data
-
What are the best ways to handle a class imbalance in multi-class classification problems?
A: To handle class imbalance in multi-class classification, use resampling either oversampling or undersampling, class weighting, ensemble methods like XGBoost, synthetic data generation, and evaluation metrics like F1-score or balanced accuracy to improve model performance.
-
How can I improve the performance of my multi-class classifier on high-dimensional data?
A: To enhance multi-class classifier performance on high-dimensional data, use dimensionality reduction techniques like PCA or t-SNE, feature selection methods, and regularization techniques to minimize noise and overfitting.
Future Directions:
- Understand and explore AutoML where model selection and hyperparameter tuning are automated.
- Think about how you could deploy your model using Flask or Streamlit.
- Dive into time-series multi-class classification for handling sequential data.