10 features engineering techniques for machine learning

1. What is Feature Engineering
Feature engineering is the process of transforming your raw data into important features to improve your model's performance. Whether working on a small or complex machine-learning problem, feature engineering is the backbone of your model's success. With the right features, you can highlight some patterns and relationships between the data that might be missed otherwise. Once you understand how feature engineering works, you can get to a whole new level of quality in your predictive models.
Thanks to domain knowledge, you can transform data within the context to be outputted. This results in more valuable and significant features, enabling your model to predict and conceptualize.

2. Advanced Imputation Techniques

Imputation deals with the situation where some values are missing in datasets. Some basic techniques such as filling with the mean do work though other techniques are likely to produce better results.

Multivariate Imputation

Multivariate Imputation is considered an important technique for handling missing data because it systematically handles the problem. Multivariate Imputation is a form of imputation that retains relationships in the original datasets, reduces bias with multiple imputations, and more appropriately suits datasets with huge missing data as it can handle both at-random and not-at-random data.

Example: Multivariate Imputation Using Iterative Imputer

from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
import pandas as pd
import numpy as np
# Sample dataset creating with null values
data = pd.DataFrame({
    'Age': [25, 27, np.nan, 35, 40],
    'Income': [50000, 52000, 45000, np.nan, 80000],
    'Education': [16, 14, 15, 16, np.nan]
})
imputer = IterativeImputer(max_iter=10, random_state=0)
# Apply imputer to data
imputed_data = imputer.fit_transform(data)
# Convert imputed data to a DataFrame
data_imputed = pd.DataFrame(imputed_data, columns=data.columns)
print(data_imputed)

Predictive Imputation Models

Predictive imputation models are also a powerful way to handle missing data. This approach treats it as a prediction problem. These models simply do not guess the missing values with averages or medians, but use machine learning to predict the missing values based on patterns in other features. This approach is useful because it helps to keep relationships within the data intact more often leading to higher accuracy.

When there are some missing values, you employ the other known values in your dataset to predict the missing ones. For instance, for the dataset tabulated from customers, you could guess the missing income values from age, geographical location, and spending abilities.

Example: Predictive Imputation Using Random Forest

from sklearn.ensemble import RandomForestRegressor
from sklearn.impute import SimpleImputer
import pandas as pd
# Creating dataset with null value
data = pd.DataFrame({
    'Age': [25, 27, None, 35, 40],
    'Income': [50000, None, 45000, 75000, 80000]
})
# Step 1: Filling null values with mean
initial_imputer = SimpleImputer(strategy='mean')
data_imputed = initial_imputer.fit_transform(data)
# Step 2: Set up the model
rf_imputer = RandomForestRegressor()

rf_imputer.fit(data_imputed[~pd.isnull(data['Income'])], data['Income'].dropna())
missing_income = rf_imputer.predict(data_imputed[pd.isnull(data['Income'])])
# Fill the missing values in the original dataset
data.loc[pd.isnull(data['Income']), 'Income'] = missing_income
print(data)

3. Encoding Complex Categorical Variables

When handling high-cardinality or multi-level categories, the encoding of complex categorical variables becomes essential as more advanced methods than the traditional one hot or label encodings are utilized. Such techniques assist the models in capturing the intricate relationships encoded within the categorical variables, thus improving the performance and usability of the models.

Ordinal Embedding

An ordinal embedding means finding a way to encode all the ranks in the category in the numerical form such that the hierarchical order in the dataset is maintained. Nominal categories do not exhibit ordered values, whereas ordinal categories possess an inherent order or rank. This process generally seeks to assign each category a numerical value representing the categorical order.

For example, for the levels of quality "Low," "Medium," and "High," they may be represented as 1, 2, and 3. Such encoding maintains the order in which the values are to be interpreted, giving the model critical insights on how to relate one category to the other.

Example:

import pandas as pd
from sklearn.preprocessing import OrdinalEncoder
data = pd.DataFrame({
    'Quality': ['Low', 'Medium', 'High', 'Medium', 'Low']
})
encoder = OrdinalEncoder(categories=[['Low', 'Medium', 'High']])
data['Quality_encoded'] = encoder.fit_transform(data[['Quality']])
print(data)

Why Use It?

Preserves Order: The order of categories is encoded and assists in better model understanding.
Boosts Performance: Assists in model interpretation of ranking hence improving accuracy.

Cluster Encoding

Cluster encoding groups categorical variables by shared traits or responses to other factors for advanced encoding. Cluster encoding combines categories into clusters, capturing commonalities and eliminating unique values. Cluster encoding uses K-Means to group categories. Instead of recording each category, cluster categories with comparable patterns, responses, or feature interactions. After clustering, each original category is labeled as a generalized group. Grouping categories reduces variables, making data easier to manage.

This works well for:

Datasets having many distinct category values allow high-dimensional encoding.
Grouping similar goal responses (e.g., customer segmentation, purchasing patterns).

Example: Cluster Encoding for Using K-Means for

import pandas as pd
from sklearn.cluster import KMeans
import numpy as np
# Sample data
data = pd.DataFrame({
    'Category': ['A', 'B', 'C', 'D', 'E'],
    'Response': [1.2, 1.5, 3.8, 3.6, 5.1]  })
# Reshape data if necessary
data['Response'] = np.array([1.2, 1.5, 3.8, 3.6, 5.1])
kmeans = KMeans(n_clusters=2, random_state=0)
data['Cluster'] = kmeans.fit_predict(data[['Response']])
print(data)

4. Feature Scaling Beyond Standardization

Feature scaling is an important process while working with machine learning algorithms. However, traditional methods which consist of min-max scaling or standardization of the data, may not be the best one for each data set. There are even more advanced scaling methods that are superior to standardization in that they address such concerns like skewness or outliers. Here are two powerful alternatives:

1. Robust Scaling

Robust scaling minimizes the effects of outliers in that it scales features on the basis of percentiles as opposed to the mean and standard deviations. Therefore, this method makes it highly effective for big data sets that may contain outliers and large valued data sets.

How it Works: Robust scaling takes the data and centers it at zero and scales it by the IQR.
When to Use: It is used when your data comprises large outliers which would mislead regular scaling methods.

Example: Robust Scaling application in Python

from sklearn.preprocessing import RobustScaler
import pandas as pd
data = pd.DataFrame({'Feature': [10, 20, 30, 1000, 50]})
scaler = RobustScaler()
data['Scaled_Feature'] = scaler.fit_transform(data[['Feature']])
print(data)

2. Quantile Transformation

Quantile transformation moves the data points of your distribution to form a different specified distribution such as uniform or normal. This method is convenient for working with outliers since it converts data to a new distribution that is either uniform or normal.

How it Works: Quantization is performed by mapping each data point to its corresponding quantile of the quantile distribution, thus transforming the input data into the required distribution.
When to Use: For skewed data use quantile transformation for normalization of the feature distribution.

Example: Quantile Transformation application in Python

from sklearn.preprocessing import QuantileTransformer
import pandas as pd
data = pd.DataFrame({'Feature': [1, 2, 3, 10, 100]})
transformer = QuantileTransformer(output_distribution='normal')
data['Transformed_Feature'] = transformer.fit_transform(data[['Feature']])
print(data)

5. Feature Engineering for Sequence Data

Time sequence data, such as time series, need separate feature engineering procedures to extract temporal features properly. Here are three common approaches:

1. Lag Features

Lagging features incorporate elements from earlier intervals, offering insights into the relationship between former values and current ones. When you incorporate lagged values in the model, you enable the model to "rewind" and explore patterns over time. As you know, if you are forecasting stock values, for the model to recognize patterns better, it is useful to add lag features for the previous day's prices.

import pandas as pd
data = pd.DataFrame({
    'Date': pd.date_range(start='2023-01-01', periods=5, freq='D'),
    'value': [10, 12, 15, 18, 20]
})
data['lag_1'] = data['value'].shift(1) 
data['lag_2'] = data['value'].shift(2) 
print(data)

2. Rolling Statistics

Rolling statistics usually in the form of moving averages help condense information in a given definite period, the average, to determine direction and filter out short-term variations. Moving averages come in handy when dealing with datasets that have a lot of noise. For instance, when looking at a data set such as sales figures, a 3-day moving average can be beneficial in showing the overall trend of sales.

data['rolling_mean_3'] = data['value'].rolling(window=3).mean()

data['rolling_var_3'] = data['value'].rolling(window=3).var()   ]
print(data)

3. Differencing

Differencing is the method through which the value from the previous time step is taken away from the current value for example to remove trends and seasonal patterns, this comes in handy especially when looking to make the data more stationary. For instance, suppose that the data present is regarding temperature changes; in this case differencing changes daily instead of absolute figures which helps in the sustenance of any trends.

data['diff_1'] = data['value'].diff(periods=1)  
print(data)

6. Geo-Location-Based Feature Engineering

Geo-location-based feature engineering is the utilization of geographical data, which is most appropriate for systems working on understanding geographic trends and patterns such as the real estate, logistics, and retail businesses, for extracting valuable information. You can define additional variables in your model using latitude, longitude, and other geographical aspects, enabling the model to consider various geographical factors.

Spatial Binning

Spatial binning means to 'bin' data points within a particular geospatial grid or region. For example, Models may focus on a city as a geographical area but define smaller neighborhood groups within that city to assist in identifying geographical trends.

import pandas as pd
import numpy as np
data = pd.DataFrame({'Latitude': [40.7728, 34.0552, 41.8741],
                     'Longitude': [-74.0060, -118.2437, -87.6298]})

data['lat_bin'] = (data['Latitude'] // 0.1) * 0.1
data['lon_bin'] = (data['Longitude'] // 0.1) * 0.1
print(data)

Distance Metrics

Distance Metrics measures distances from data points to certain key geographic points such as city centers or other landmarks. For example, real estate valuation often looks at the distance to the nearest educational institution or how close it is to the city center.

from geopy.distance import geodesic

reference_point = (40.7128, -74.0060)  
data['distance_to_reference'] = data.apply(
    lambda row: geodesic((row['Latitude'], row['Longitude']), reference_point).km, axis=1
)
print(data)

Regional Aggregates

Regional totals are calculated using a combination of strategy and spatial redundancy. For example, it is possible to determine the mean property cost for a specific area by taking into account all the properties in that area and defining the radial distance around it.

data['avg_neighborhood_price'] = data.groupby(['lat_bin', 'lon_bin'])['property_price'

7. Text Feature Extraction for Non-NLP Tasks

The text data is rich in information even outside the NLP domain.

TF-IDF Embeddings in Structured Data

TF-IDF embeddings determine the significance of a term. Including such information into highly ordered datasets makes text data valuable for models that are not specifically NLP.

from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
tfidf_features = vectorizer.fit_transform(data['text_column'])

Sentiment-Based Features

While sentiment analysis scores help assess the overall emotional tone of a given piece of text. It adds sentiment as a feature. That helps the model to understand the emotional context. For instance, customer reviews or feedback, where this aspect is very important.

from textblob import TextBlob
data['sentiment'] = data['text_column'].apply(lambda x: TextBlob(x).sentiment.polarity)

Textual Feature Ratios

Traditional content analyses tend to be qualitative. However, approaches that allow quantifying textual ratios, such as keyword density, are available too. The linguistic features that can be obtained by these ratios are often incorporated into models, which again enhances the meaning of qualitative data without employing NLP methods.

8. Domain-Specific Transformations

Some data needs different transformations based on the domain for better model performance

Box-Cox Transformation

The Box-Cox transformation tends to stabilize the variance of skewed distributions. However, power transformation can be applied to bring your data closer to normal distribution enabling many modeling techniques to be simple.

from scipy.stats import boxcox
import numpy as np
data = np.array([1, 2, 3, 10, 100])
data_transformed, _ = boxcox(data)  
print(data_transformed)

Yeo-Johnson Transformation

The Yeo-Johnson transformation is effective for cases when the given data contains zero or negative values. This is a flexible power transform, which is good to have when your datasets aren't positive or skewed.

from sklearn.preprocessing import PowerTransformer
import numpy as np
data = np.array([-1, 0, 1, 10, 100]).reshape(-1, 1)
transformer = PowerTransformer(method='yeo-johnson')
data_transformed = transformer.fit_transform(data)
print(data_transformed)

Winsorizing

Extreme values are limited in a specified range applying winsorizing. This allows us to reduce the influence of outliers, which otherwise may skew model predictions.

import numpy as np
from scipy.stats.mstats import winsorize
data = np.array([1, 2, 3, 10, 100])
data_winsorized = winsorize(data, limits=[0.05, 0.05])  
print(data_winsorized)

9. Feature Engineering from Graph-Based Data

Graph data demonstrates how pieces are related, depend on each other, or form groups, which is useful for various purposes. Social networks, recommendation systems, and biological networks use graph data with customizable connections. We can improve the model's performance in these areas by focusing on key network measures.

Network Features

Network attributes concern structural features and connectivity measures such as centrality and clustering coefficients, which provide the value of a particular node in the network. For example, within centrality metrics, one can predict how highly connected or influential the node will be, thus such measures are very helpful.

import networkx as nx
G = nx.Graph()
G.add_edges_from([(1, 2), (1, 3), (2, 3), (3, 4)])

degree_centrality = nx.degree_centrality(G)
for node, centrality in degree_centrality.items():
    print(f"Node {node}: Degree Centrality = {centrality}")

Community Detection Features

Community detection identifies clusters in a network. Adding community labels as categorical predictors for such behaviors helps better understand and explain group patterns which are often not the case with other data forms.

from networkx.algorithms.community import greedy_modularity_communities
communities = list(greedy_modularity_communities(G))
community_labels = {node: i for i, community in enumerate(communities) for node in community}
print(community_labels)

Subgraph Features

Subgraph features consist of retrieving particular structures or shapes locations captured or found in networks i.e. in this case shapes such as triangles. This is useful in helping get local relationships therefore aiding in group behavior prediction.

triangles = nx.triangles(G)
for node, triangle_count in triangles.items():
    print(f"Node {node}: Triangle Count = {triangle_count}")

10. Feature Engineering with Domain-Adaptive Models

Advanced models offer pre-built feature engineering, letting you tap into vast pre-trained knowledge bases.

Feature Extraction from Pre-Trained Models

Pre-trained models such as BERT for text and ResNet for images have been designed to learn rich and deep patterns that capture a lot of high-level representation and can, therefore, be used as a feature for a downstream model. This incorporation allows for the additional embedding of domain knowledge within the data using feature vectors.

Autoencoders for Feature Compression

An autoencoder is an example of an unsupervised model that encodes the input data into a compressed form. An autoencoder comprises an architecture with a bottleneck layer that performs dimensionality reduction and thus enables the construction of a low-dimensional feature space that retains critical elements of information, which is attempted especially for data that is high-dimensional in nature.

Contrastive Learning for Features

Particularly, it learns features by attracting instances that are alike and repelling those that are not. The feature space, for instance, an array or matrix, is learned through these processes. This procedure appeals particularly to use cases that involve vast amounts of data that are not labeled, for example, customer activity or outlier detection.

11. Summary and Best Practices

Feature engineering is an important step when it comes to improving the predictive performance of machine learning models. The key here is to carefully construct features that can reveal hidden patterns, relationships, and strange things that are unique to data and lift model quality and accuracy. They involve handling missing data using fancy imputation, encoding complicated categorical variables, scaling features past the standard approaches, and domain-specific transformations. Creating high-performing models depends on feature engineering. Here's a quick recap:

Imputation and encoding are used for advanced missing or categorical data.
Handle skewed distributions and outliers with applied transformations.
Explore sequence data, text features, and graph-based data.
Understanding the domain is super important. This keeps improving features as you gain more insights.

FAQs

How do I know which feature engineering technique to use?

The first thing to understand is your data type and the domain context. For instance, while algorithms might naturally benefit from sequence features in time series data (e.g. lagged values), high-cardinality categorical data might warrant newer encoding methods such as clustering.

Is feature engineering always necessary?

Yes, well-designed features are what make for high-performing models. However, manual feature engineering often finds valuable insights to step beyond automatic feature selection to improve prediction quality.

Can I employ several techniques of feature engineering in one analysis?

Of course. For instance, clustering allows for the addition of other features such as graph-based features to improve the analysis of user behavior.

Under what circumstances is it beneficial to apply domain-adaptive models for extracting features?

Domain-adaptive models are effective when dealing with complex data such as text, images, and large volumes of raw data. It is very hard to come up with easy features for adaptation, as pre-trained models capture very specific and detailed features of the domain these models are about.