A Coding Implementation to Build a Conversational Research Assistant with FAISS, Langchain, Pypdf, and TinyLlama-1.1B-Chat-v1.0

RAG-powered conversational research assistants address the limitations of traditional language models by combining them with information retrieval systems. The system searches through specific knowledge bases, retrieves relevant information, and presents it conversationally with proper citations. This approach reduces hallucinations, handles domain-specific knowledge, and grounds responses in retrieved text. In this tutorial, we will demonstrate building […] The post A Coding Implementation to Build a Conversational Research Assistant with FAISS, Langchain, Pypdf, and TinyLlama-1.1B-Chat-v1.0 appeared first on MarkTechPost.

Mar 23, 2025 - 06:06

A Coding Implementation to Build a Conversational Research Assistant with FAISS, Langchain, Pypdf, and TinyLlama-1.1B-Chat-v1.0

RAG-powered conversational research assistants address the limitations of traditional language models by combining them with information retrieval systems. The system searches through specific knowledge bases, retrieves relevant information, and presents it conversationally with proper citations. This approach reduces hallucinations, handles domain-specific knowledge, and grounds responses in retrieved text. In this tutorial, we will demonstrate building such an assistant using the open-source model TinyLlama-1.1B-Chat-v1.0 from Hugging Face, FAISS from Meta, and the LangChain framework to answer questions about scientific papers.

First, let’s install the necessary libraries:

Copy CodeCopiedUse a different Browser

!pip install langchain-community langchain pypdf sentence-transformers faiss-cpu transformers accelerate einops

Now, let’s import the required libraries:

Copy CodeCopiedUse a different Browser

import os
import torch
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyPDFLoader
from langchain_community.vectorstores import FAISS
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain.chains import ConversationalRetrievalChain
from langchain_community.llms import HuggingFacePipeline
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
import pandas as pd 
from IPython.display import display, Markdown

We will mount drive to save the paper in further step:

Copy CodeCopiedUse a different Browser

from google.colab import drive
drive.mount('/content/drive')
print("Google Drive mounted")

For our knowledge base, we’ll use PDF documents of scientific papers. Let’s create a function to load and process these documents:

Copy CodeCopiedUse a different Browser

def load_documents(pdf_folder_path):
    documents = []


    if not pdf_folder_path:
        print("Downloading a sample paper...")
        !wget -q https://arxiv.org/pdf/1706.03762.pdf -O attention.pdf
        pdf_docs = ["attention.pdf"]
    else:
        pdf_docs = [os.path.join(pdf_folder_path, f) for f in os.listdir(pdf_folder_path)
                   if f.endswith('.pdf')]


    print(f"Found {len(pdf_docs)} PDF documents")


    for pdf_path in pdf_docs:
        try:
            loader = PyPDFLoader(pdf_path)
            documents.extend(loader.load())
            print(f"Loaded: {pdf_path}")
        except Exception as e:
            print(f"Error loading {pdf_path}: {e}")


    return documents




documents = load_documents("")

Next, we need to split these documents into smaller chunks for efficient retrieval:

Copy CodeCopiedUse a different Browser

def split_documents(documents):
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=1000,
        chunk_overlap=200,
        length_function=len,
    )
    chunks = text_splitter.split_documents(documents)
    print(f"Split {len(documents)} documents into {len(chunks)} chunks")
    return chunks


chunks = split_documents(documents)

We’ll use sentence-transformers to create vector embeddings for our document chunks:

Copy CodeCopiedUse a different Browser

def create_vector_store(chunks):
    print("Loading embedding model...")
    embedding_model = HuggingFaceEmbeddings(
        model_name="sentence-transformers/all-MiniLM-L6-v2",
        model_kwargs={'device': 'cuda' if torch.cuda.is_available() else 'cpu'}
    )


    print("Creating vector store...")
    vector_store = FAISS.from_documents(chunks, embedding_model)
    print("Vector store created successfully!")
    return vector_store


vector_store = create_vector_store(chunks)

Now, let’s load an open-source language model to generate responses. We’ll use TinyLlama, which is small enough to run on Colab but still powerful enough for our task:

Copy CodeCopiedUse a different Browser

def load_language_model():
    print("Loading language model...")
    model_id = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"


    try:
        import subprocess
        print("Installing/updating bitsandbytes...")
        subprocess.check_call(["pip", "install", "-U", "bitsandbytes"])
        print("Successfully installed/updated bitsandbytes")
    except:
        print("Could not update bitsandbytes, will proceed without 8-bit quantization")


    from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig, pipeline
    import torch


    tokenizer = AutoTokenizer.from_pretrained(model_id)


    if torch.cuda.is_available():
        try:
            quantization_config = BitsAndBytesConfig(
                load_in_8bit=True,
                llm_int8_threshold=6.0,
                llm_int8_has_fp16_weight=False
            )


            model = AutoModelForCausalLM.from_pretrained(
                model_id,
                torch_dtype=torch.bfloat16,
                device_map="auto",
                quantization_config=quantization_config
            )
            print("Model loaded with 8-bit quantization")
        except Exception as e:
            print(f"Error with quantization: {e}")
            print("Falling back to standard model loading without quantization")
            model = AutoModelForCausalLM.from_pretrained(
                model_id,
                torch_dtype=torch.bfloat16,
                device_map="auto"
            )
    else:
        model = AutoModelForCausalLM.from_pretrained(
            model_id,
            torch_dtype=torch.float32,
            device_map="auto"
        )


    pipe = pipeline(
        "text-generation",
        model=model,
        tokenizer=tokenizer,
        max_length=2048,
        temperature=0.2,
        top_p=0.95,
        repetition_penalty=1.2,
        return_full_text=False
    )


    from langchain_community.llms import HuggingFacePipeline
    llm = HuggingFacePipeline(pipeline=pipe)
    print("Language model loaded successfully!")
    return llm


llm = load_language_model()

Now, let’s build our assistant by combining the vector store and language model:

Copy CodeCopiedUse a different Browser

def format_research_assistant_output(query, response, sources):
    output = f"n{'=' * 50}n"
    output += f"USER QUERY: {query}n"
    output += f"{'-' * 50}nn"
    output += f"ASSISTANT RESPONSE:n{response}nn"
    output += f"{'-' * 50}n"
    output += f"SOURCES REFERENCED:nn"


    for i, doc in enumerate(sources):
        output += f"Source #{i+1}:n"
        content_preview = doc.page_content[:200] + "..." if len(doc.page_content) > 200 else doc.page_content
        wrapped_content = textwrap.fill(content_preview, width=80)
        output += f"{wrapped_content}nn"


    output += f"{'=' * 50}n"
    return output


import textwrap


research_assistant = create_research_assistant(vector_store, llm)


test_queries = [
    "What is the key idea behind the Transformer model?",
    "Explain self-attention mechanism in simple terms.",
    "Who are the authors of the paper?",
    "What are the main advantages of using attention mechanisms?"
]


for query in test_queries:
    response, sources = research_assistant(query, return_sources=True)
    formatted_output = format_research_assistant_output(query, response, sources)
    print(formatted_output)

In this tutorial, we built a conversational research assistant using Retrieval-Augmented Generation with open-source models. RAG enhances language models by integrating document retrieval, reducing hallucination, and ensuring domain-specific accuracy. The guide walks through setting up the environment, processing scientific papers, creating vector embeddings using FAISS and sentence transformers, and integrating an open-source language model like TinyLlama. The assistant retrieves relevant document chunks and generates responses with citations. This implementation allows users to query a knowledge base, making AI-powered research more reliable and efficient for answering domain-specific questions.

Here is the Colab Notebook. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 85k+ ML SubReddit.

The post A Coding Implementation to Build a Conversational Research Assistant with FAISS, Langchain, Pypdf, and TinyLlama-1.1B-Chat-v1.0 appeared first on MarkTechPost.