how to use a custom embedding model locally on Langchain?

Written by - Aionlinecourse787 times views

Recently, Langchain brought evolution in the field of Natural Language Processing. It is a powerful framework designed to work with language models, enabling an easy way to integrate and use customized embedding models. In Natural Language Processing (NLP), Embeddings are critical for understanding and representing a text's semantic meaning. Using a custom embedding model can significantly enhance our system's performance if we want to build projects like a custom chatbot, recommendation system, or any application that requires text processing.

Solution 1:

from langchain.embeddings import HuggingFaceEmbeddings


modelPath = "BAAI/bge-large-en-v1.5"

# Create a dictionary with model configuration options, specifying to use the CPU for computations
model_kwargs = {'device':'cpu'}

#if using apple m1/m2 -> use device : mps (this will use apple metal)

# Create a dictionary with encoding options, specifically setting 'normalize_embeddings' to False
encode_kwargs = {'normalize_embeddings': True}

# Initialize an instance of HuggingFaceEmbeddings with the specified parameters
embeddings = HuggingFaceEmbeddings(
    model_name=modelPath,     # Provide the pre-trained model's path
    model_kwargs=model_kwargs, # Pass the model configuration options
    encode_kwargs=encode_kwargs # Pass the encoding options
)

This is how you could use it locally. it will download the model one time. Then you could go ahead and use.

db = Chroma.from_documents(texts, embedding=embeddings)
retriever = db.as_retriever(
    search_type="mmr",  # Also test "similarity"
    search_kwargs={"k": 20},
)

Solution 2:

You can create your own class and implement the methods such as embed_documents. If you strictly adhere to typing you can extend the Embeddings class (from langchain_core.embeddings.embeddings import Embeddings) and implement the abstract methods there. You can find the class implementation here.

Below is a small working custom embedding class I used with semantic chunking.

from sentence_transformers import SentenceTransformer
from langchain_experimental.text_splitter import SemanticChunker
from typing import List


class MyEmbeddings:
    def __init__(self):
        self.model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")

    def embed_documents(self, texts: List[str]) -> List[List[float]]:
        return [self.model.encode(t).tolist() for t in texts]


embeddings = MyEmbeddings()

splitter = SemanticChunker(embeddings)

Solution 3:

In order to use embeddings with something like langchain, you need to include the embed_documents and embed_query methods. Otherwise, routines such as

Like so...

from sentence_transformers import SentenceTransformer
from typing import List

class MyEmbeddings:
        def __init__(self, model):
            self.model = SentenceTransformer(model, trust_remote_code=True)
    
        def embed_documents(self, texts: List[str]) -> List[List[float]]:
            return [self.model.encode(t).tolist() for t in texts]
        
        def embed_query(self, query: str) -> List[float]:
            return self.model.encode([query])

.
.
.

embeddings=MyEmbeddings('your model name')

chromadb = Chroma.from_documents(
    documents=your_docs,
    embedding=embeddings,
)

You can efficiently load, integrate, and utilize your custom embedding model locally if you follow the above steps properly. It will provide your applications with accurate and customized text representations. If you are developing NLP-based solution or any text classification system, Langchain makes it easier to use your custom embeddings. Integrating a custom embedding model with langchain can give you numerous opportunities in the field of advanced text processing and NLP applications.

Thank you for reading the article.

Recommended Projects