How to Easily Solve Multi-Class Classification Problems in Python

Written by- Aionlinecourse130 times views

How to Easily Solve Multi-Class Classification Problems in Python

1. What is Multi-Class Classification?

Multi-class classification refers to a specific machine learning problem where findings are categorized into more than two classes. Multiclass classification means categorizing things based on their distinct features. While binary classification only concerns two labels (for example, 'Yes' or 'No'), multi-class classification deals with three or more categories. For example, determining whether an image depicts jojoba, a lily, or a rose is a multiclass classification problem with three possible classes (jojoba, lily, rose). Whereas determining whether an image contains a rose or not, is a binary classification problem (with two possible classes: rose and no rose).

Real-World Applications

Multi-class classification is a common tool that can be used in everyday problems to make them faster and more automated. It would also enable improved decision-making in different areas because the categorizations made are precise. Hence, the multi-class classification problems have extensive practical use in various industries. Here are some examples:

  • Image Recognition: Assigning images to groups. Such as identifying different groups of animals in pictures, such as cats, dogs, birds, etc.
  • Text Classification: Subject-based reference classification. News articles are categorized as politics, sports, entertainment, technology, etc.
  • Medical Diagnosis: Categorize different types of brain tumors at an early stage helping to detect and treat early
  • Sentiment Analysis: Sentiment analysis is more about understanding public opinion, like customer feedback. Customer reviews are divided into positive, negative, and neutral textual sentiments.
  • Speech Recognition: We can quickly write any spoken words using speech recognition. Analyzing voice instructions to smart or home intelligent devices (such alarms, music, and lighting).


2. Practical Implementation of Multiclass Classification Using Python: Butterfly Species Classification

Problem Statement

This code focuses on identifying the different species of butterflies by classifying their images. It is a multi-class classification problem as there are multiple species (classes) of butterflies and every image is assigned to one of those classes.

We will construct and train the model using a CNN architecture. The butterfly dataset hosted on Kaggle contains labeled images of various butterflies and will be utilized in the training so that unseen images are classified appropriately.

This code aims at the recognition of the species of butterflies through the classification of their images. This is a multi-class classification since there are many classes of butterflies and each of the images is expected to belong to a particular one.

Next, we will create as well as train our model using a CNN architecture. We collected the dataset from the Kaggle platform.

Dataset Used

The dataset used for this project is the Butterfly Image Classification Dataset from Kaggle, downloaded using:

!kaggle datasets download -d phucthaiv02/butterfly-image-classification


This dataset contains thousands of images of butterflies, categorized into 10 different species.

Step-by-Step Explanation

1. Importing Libraries

To begin, we will import the libraries that we will need the most in this project. We will rely on TensorFlow and Keras to construct the CNN model and Numpy and Matplotlib to manage the dataset and present the output.

import tensorflow as tf
from tensorflow.keras import datasets, layers, models
import matplotlib.pyplot as plt
import numpy as np
import os


2. Downloading and Unzipping the Dataset

The Kaggle dataset is downloaded and extracted from the Kaggle API. This step ensures we have access to the data.

!kaggle datasets download -d phucthaiv02/butterfly-image-classification
!unzip butterfly-image-classification.zip

After this, the dataset folder structure will contain training and testing images categorized by butterfly species.

3. Loading and Preprocessing the Data

Now, we load the images from the extracted dataset folder and preprocess them by resizing them to a fixed shape (e.g., 150x150). We also normalize the pixel values "normalize" the pixel values to a range between 0 and 1 (instead of 0 to 255), which helps the model learn better.

from tensorflow.keras.preprocessing.image import ImageDataGenerator
train_datagen = ImageDataGenerator(rescale=1./255)
test_datagen = ImageDataGenerator(rescale=1./255)
train_generator = train_datagen.flow_from_directory(
    'butterfly/train',
    target_size=(150, 150),
    batch_size=32,
    class_mode='categorical')
validation_generator = test_datagen.flow_from_directory(
    'butterfly/test',
    target_size=(150, 150),
    batch_size=32,
    class_mode='categorical')

We use flow_from_directory to load images directly from the dataset folder and ImageDataGenerator to handle real-time data augmentation if necessary.

4. Building the CNN Model

The next step is to build the CNN model. We use multiple convolutional and pooling layers, followed by dense layers for classification.

model = models.Sequential([
    layers.Conv2D(32, (3, 3), activation='relu', input_shape=(150, 150, 3)),
    layers.MaxPooling2D((2, 2)),
    layers.Conv2D(64, (3, 3), activation='relu'),
    layers.MaxPooling2D((2, 2)),
    layers.Conv2D(128, (3, 3), activation='relu'),
    layers.MaxPooling2D((2, 2)),
    layers.Conv2D(128, (3, 3), activation='relu'),
    layers.MaxPooling2D((2, 2)),
    layers.Flatten(),
    layers.Dense(512, activation='relu'),
    layers.Dense(10, activation='softmax'# 10 butterfly species
])
    • The input shape is set to 150x150 pixels, with 3 channels for color images (RGB).
    • The final dense layer has 10 units with a softmax activation, corresponding to the 10 different species of butterflies.

5. Compiling the Model

We compile the model using the Adam optimizer and categorical cross-entropy as the loss function since this is a multi-class classification problem.

model.compile(optimizer='adam',
              loss='categorical_crossentropy',
              metrics=['accuracy'])


6. Training the Model

The model is trained on the training set, with a batch size of 32 and for 50 epochs. The validation set is used to evaluate the model during training.

history = model.fit(
    train_generator,
    steps_per_epoch=100,  
    epochs=50,
    validation_data=validation_generator,
    validation_steps=50)


7. Evaluating the Model

After training, the model is evaluated on the test data to see how well it performs on unseen images.

test_loss, test_acc = model.evaluate(validation_generator)
print(f"Test accuracy: {test_acc}")

This step outputs the accuracy of the model when classifying images from the validation set.


8. Visualizing Training and Validation Accuracy

We can visualize the training and validation accuracy over epochs to check if the model is learning correctly or overfitting.

plt.plot(history.history['accuracy'], label='Training Accuracy')
plt.plot(history.history['val_accuracy'], label='Validation Accuracy')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend(loc='lower right')
plt.show()

9. Saving the Model

Once the model has been trained and evaluated, you can save it for future use.

model.save('butterfly_classification_model.h5')


3. Comparing Python Libraries for Multi-Class Classification

Popular Python Libraries

Numerous well-known Python Libraries for multi-class classification have their advantages:

  • Scikit-Learn: User friendly. Most popular and maintained.
  • XGBoost and LightGBM: Well known for their speed as they are designed to be faster than the rest, especially with large datasets.
  • CatBoost: Performs well on category-based features.
  • Keras/TensorFlow: Suitable for deep learning use cases


LibraryBest ForStrengthsWeaknesses
Scikit-learnClassical ML algorithmsIt has a simple API as well as a wide range of algorithms.If you have less deep learning use case
TensorFlowDeep learningFlexibility, Large dataset supportSteeper learning curve
KerasQuick prototypingEasy to use, Builds Models QuicklyNot as flexible as TensorFlow
XGBoostStructured/tabular dataHigh performance, efficientNot for deep learning
LightGBMLarge datasetsFast, efficientNon-structured support is limited
CatBoostCategorical dataSupport for categorical variables nativelyIt works a bit slow on very big data sets


4. Advanced Techniques for Handling Multi-Class Imbalanced Data

Introduction to Multi-Class Imbalance

Let's say, in the butterflies dataset, there exist 500 images of one species (e;g monarch) whereas only 50 images of some other species (for instance, Blue Morpho). In this case, the model may tend to become overfitted with only Monarch species and lose its accuracy for other species. This is often referred to as class imbalance.

Here are some advanced techniques for addressing this issue effectively:

1. Resampling Techniques

Oversampling: Oversampling is a technique that augments the existing data when dealing with class imbalance problems, where there is a majority class, and it outweighs the minority class. It helps to mitigate the effect of class imbalance by increasing the samples of minority classes.

Methods:

SMOTE (Synthetic Minority Over-sampling Technique): Using the feature space of minority class samples to create synthetic samples.

ADASYN (Adaptive Synthetic Sampling): This is also similar to SMOTE but with a major difference in creating samples where there are few samples in the minority class.

from imblearn.over_sampling import SMOTE
smote = SMOTE(sampling_strategy='auto', random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)


Undersampling: Undersampling is a technique to balance uneven datasets by keeping all of the data in the minority class and decreasing the size of the majority class.

Methods:

Random Undersampling: Randomly remove samples from the majority class.

Cluster-Based Undersampling: Use clustering techniques to select representative samples from the majority class.

from imblearn.under_sampling import RandomUnderSampler
undersampler = RandomUnderSampler(random_state=42)
X_resampled, y_resampled = undersampler.fit_resample(X_train, y_train)


2. Ensemble Methods

Balanced Random Forest: An extension of the Random Forest algorithm that balances each bootstrap sample to have equal representation from each class.

from imblearn.ensemble import BalancedRandomForestClassifier
brf = BalancedRandomForestClassifier(n_estimators=100, random_state=42)
brf.fit(X_train, y_train)


3. Cost-sensitive Learning

Modify the learning algorithm to make it more sensitive to the minority classes by assigning different misclassification costs. Many algorithms in Scikit-learn support the class_weight parameter to assign weights.

from sklearn.ensemble import RandomForestClassifier
class_weights = {0: 1, 1: 5, 2: 5}  
rf = RandomForestClassifier(class_weight=class_weights)
rf.fit(X_train, y_train)


5. Multi-Class Problems with High Dimensional Data

The Curse of Dimensionality

When it comes to machine learning and statistics when the data is high dimensional, the problem is referred to as the 'Curse of Dimensionality.' It has sparse data points, uniform distances, exponentially greater computational costs, overfitting, and no easy way to visualize data beyond 2D or 3D. The difficulty in fitting these factors along makes it difficult to discover patterns, reduces the performance of distance-based algorithms and impedes the human ability to understand high-dimensional data.

Techniques for Dimensionality Reduction

The following five methods are considered to be the most useful in dimensionality reduction:

1. Principal Component Analysis (PCA)

PCA for linear reduction while retaining variance.

from sklearn.decomposition import PCA
pca = PCA(n_components=2)
X_reduced = pca.fit_transform(X)

2. t-Distributed Stochastic Neighbor Embedding (t-SNE)

t-SNE is a statistical method for visualizing high-dimensional data by giving each datapoint a location in a two or three-dimensional map.

from sklearn.manifold import TSNE
tsne = TSNE(n_components=2)
X_reduced = tsne.fit_transform(X)

3. Linear Discriminant Analysis (LDA)

LDA for supervised dimensionality reduction focusing on class separability. Linear Discriminant Analysis (LDA) is a statistical technique that is used to classify data into predetermined categories or classes.

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
lda = LinearDiscriminantAnalysis(n_components=2)
X_reduced = lda.fit_transform(X, y)

4. Uniform Manifold Approximation and Projection (UMAP)

UMAP or Uniform Manifold Approximation and Projection is a machine learning tool to reduce the number of dimensions in a dataset while preserving its geometric structure. UMAP aims to preserve the local structure and handle complex relationships aims to preserve the local structure and handle complex relationships.

import umap
umap_model = umap.UMAP(n_components=2)
X_reduced = umap_model.fit_transform(X)

5. Autoencoders

Autoencoders for capturing complex, non-linear relationships in data.

from keras.layers import Input, Dense
from keras.models import Model
input_data = Input(shape=(original_dim,))
encoded = Dense(encoding_dim, activation='relu')(input_data)
decoded = Dense(original_dim, activation='sigmoid')(encoded)
autoencoder = Model(input_data, decoded)
autoencoder.compile(optimizer='adam', loss='binary_crossentropy')
autoencoder.fit(X, X, epochs=50, batch_size=256, shuffle=True)


7. Conclusion and Next Steps

In this article, we discuss multi-class classification, how they can be done in a more practical way in the Python world, and how we can deal with class imbalance in such cases. With the use of appropriate libraries and the right techniques of imbalanced data and dimensionality reduction, model performance proves to be improved substantially.


8. FAQ

  • How do I choose between Scikit-Learn, XGBoost, and LightGBM?

    A: To choose between Scikit-Learn, XGBoost, and LightGBM consider the project needs: If you are a beginner or your dataset is small and you need to learn machine learning easily and conveniently. So scikit-learn is great for you. It has easy-to-use algorithms. But if you need high accuracy and performance then use XGBoost, it works with structured data and is ready to give you high results. If you have large datasets, LightGBM is optimized and faster to train than XGBoost and it does not bother with sparse data

  • What are the best ways to handle a class imbalance in multi-class classification problems?

    A: To handle class imbalance in multi-class classification, use resampling either oversampling or undersampling, class weighting, ensemble methods like XGBoost, synthetic data generation, and evaluation metrics like F1-score or balanced accuracy to improve model performance.

  • How can I improve the performance of my multi-class classifier on high-dimensional data?

    A: To enhance multi-class classifier performance on high-dimensional data, use dimensionality reduction techniques like PCA or t-SNE, feature selection methods, and regularization techniques to minimize noise and overfitting.

Future Directions:

  • Understand and explore AutoML where model selection and hyperparameter tuning are automated.
  • Think about how you could deploy your model using Flask or Streamlit.
  • Dive into time-series multi-class classification for handling sequential data.