

Technical Document Retrieval using Qdrant, PyMuPDF and DSPy
Many of our customers manage a large corpus of technical documents for machines, detailing operation procedures and extensive usage guidelines. These documents are critical for daily operations, but their length and complexity make it challenging and time-consuming to manually search for specific information.
Out-of-the-box RAG solutions frequently fall short in delivering the quality of results needed, as they are typically designed to handle a broad range of document types and use cases, often lacking specialized domain knowledge. Custom solutions that leverage open-source tools and are tailored to the specific needs of the domain usually offer significantly better results.
They often include images, diagrams and tables essential for understanding the full context. While text-based searches in vector databases can find relevant passages, they usually overlook the context provided by these visuals.
Introduction
In this article, we explore Qdrant for efficient storage and retrieval of embeddings, integrating gpt4o-mini for Object Character Recognition (OCR) to extract additional context from images and combine our ground truth data into DSPy as our Retrieval-Augmented Generation (RAG) pipeline.
Why Qdrant?
When developing real world applications, speed is crucial. By using the Hierarchical Navigable Small World (HNSW) indexing technique, Qdrant ensures rapid search times, even as the volume of data increases significantly. HNSW optimizes the search process through a multi-layered graph structure, reducing the time complexity of locating similar vectors. This allows Qdrant to scale efficiently to millions of vectors while maintaining swift query responses.
Qdrant further enhances performance through advanced methods like vector quantization, which minimize memory usage without compromising speed.
In fact, Qdrant achieves up to 15 times higher query throughput compared to pgvector, all while delivering high accuracy. These outcomes highlight why Qdrant has gained popularity among many organizations in their attempt to build efficient generative AI applications.
Now when it comes to retrieval systems, we also need to choose an embedding model that fits our task.
How do I choose the right embedding model?
A great place to start is the Massive Text Embedding Benchmark (MTEB) Leaderboard which provides an overview of embedding models, including both proprietary and open-source options.
The leaderboard ranks models based on their performance across a range of tasks, such as Reranking, Summarization and Retrieval. By selecting a model that performs well across these tasks, you can ensure that your embeddings are both accurate and versatile.

For our task, we have chosen the text-embedding-3-large model from OpenAI, which is well-suited for our document retrieval use case. It provides high-quality embeddings and ease of use through the Embeddings API.
Now let's see how all of this comes together.
Prerequisites
- Installing Dependencies:
1# Databases2pip install qdrant-client fastembed pymongo34# PDF processing5pip install pymupdf pymupdf4llm67# Text embedding8pip install openai tiktoken910# Utilities11pip install python-dotenv uuid6 tqdm
- Qdrant Setup: To get started we set up the Qdrant Server. We can use the official Docker image provided by Qdrant.
- MongoDB Setup: We set up MongoDB as our image storage.
We create the following docker-compose.yml:
1services:2 qdrant:3 container_name: qdrant4 image: qdrant/qdrant:latest5 ports:6 - "6333:6333"7 - "6334:6334"8 env_file:9 - .env10 mongodb:11 container_name: mongodb12 image: mongo:latest13 hostname: mongodb14 ports:15 - "27017:27017"
- Environment Variables: We store the required environment variables in a .env file, like this:
1HOST=host.docker.internal2QDRANT_PORT=63333MONGODB_PORT=270174OPENAI_API_KEY=<your openai api key>
- Once your configuration is ready, we can start up both the Qdrant and MongoDB containers:
1docker-compose up -d --build
- Setup MongoDB Class: We store our images separately in a MongoDB collection.
A multi-modal approach, though helpful in capturing both text and images in a vector space, might lack the exact matching between the two. For simplicity, and ensuring contextual alignment we opted to store all images separately in a MongoDB collection, matching them both by their UUID.
1import os23import pymongo4from bson.json_util import dumps, loads5from dotenv import load_dotenv6from pymongo.errors import DuplicateKeyError78load_dotenv()910class MongoDBWrapper:11 """Wrapper class for MongoDB Client."""1213 def __init__(self):14 self.client = pymongo.MongoClient(15 os.getenv("DOCKER_INTERNAL_HOST"), int(os.getenv("MONGODB_PORT"))16 )17 self.db_name = "my_pdfs"1819 self.db = self.client[self.db_name]20 self.images_collection = self.db["images"]2122 # Create compound index for machine and page23 self.images_collection.create_index({"machine": 1, "page": 1}, unique=True)2425 def upsert(26 self,27 uuid: str,28 image: str,29 machine: str,30 page: int,31 ) -> None:32 """Upsert data in MongoDB."""33 try:34 self.images_collection.update_one(35 {"id": uuid, "machine": machine, "page": page},36 {"$set": {"image": image}},37 upsert=True,38 )39 except DuplicateKeyError:40 pass
Step-by-Step Guide Document Retrieval
- Setup Chunk Class: We will use a simple Chunk class as a structured container to store the data extracted from PyMuPDF, organizing the elements we extract from PDF documents.
1class Chunk:2 """Chunking Class for Qdrant to store data."""34 def __init__(5 self,6 texts: Optional[list[str]] = [],7 images: Optional[list[Image.Image]] = [],8 path: Optional[str] = None,910 ):11 self.texts = texts12 self.images = images13 self.path = path
- Qdrant Client Setup: We will initialize the Qdrant Client by the following class:
1import os23from qdrant_client import QdrantClient45# Load environment variables6load_dotenv()78class QdrantWrapper:9 """Wrapper for Qdrant Client."""1011 def __init__(self):12 self.url = (13 f"http://{os.getenv('HOST')}:{os.getenv('QDRANT_PORT')}"14 )15 self.client = QdrantClient(url=self.url, timeout=10)16 self.embedding_model_name = "text-embedding-3-large"
- Document Preprocessing: We preprocess our document by chunking it into images and text using PyMuPDF and preparing the data for embedding:
1import re2import logging34import fitz5from tqdm import tqdm6from PIL import Image78logging.basicConfig(9 level=logging.INFO, format="%(asctime)s - %(message)s", datefmt="%d-%b-%y %H:%M:%S"10)1112def chunk_document(13 self,14 file_path: str,15 dpi: str = 600,16 image_format: str = "png",17 table_strategy: str = "lines",18) -> list:19 """20 Extract text, tables and images from PDF file.2122 Args:23 file_path (str): Path to the PDF file.24 dpi (str): Resolution for image extraction (default: 600).25 image_format (str): Format for extracted images (default: png).26 table_strategy (str): Strategy for table extraction (default: lines).27 """28 chunks = []29 doc = fitz.open(file_path)3031 try:32 # Use PyMuPDF4LLM to read and convert PDF into markdown-like format33 md_reader = pymupdf4llm.to_markdown(34 doc,35 page_chunks=True,36 dpi=dpi,37 image_format=image_format,38 table_strategy=table_strategy,39 )4041 # Iterate through pages and extract text, tables, and images42 for i, page in tqdm(enumerate(doc), total=len(doc), desc=f"Scraping {file_path} with pymupdf4llm"):43 page_data = md_reader[i]44 text = page_data["text"]45 cleaned_text = re.sub(r"\n{3,}", "\n\n", text).strip()4647 pix = page.get_pixmap()48 img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)49 chunks.append(50 Chunk([cleaned_text], [img], file_path)51 )52 except Exception as e:53 logging.error(f"Failed to scrape {file_path} with pymupdf4llm. Falling back to PyMuPDF. Error: {e}")5455 # Fallback method using basic PyMuPDF56 for i in tqdm(range(len(doc)), desc=f"Scraping {file_path} with PyMuPDF"):57 page = doc[i]58 text = page.get_text()59 cleaned_text = re.sub(r"\n{3,}", "\n\n", text).strip()6061 pix = page.get_pixmap()62 img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)63 chunks.append(64 Chunk([cleaned_text], [img], file_path)65 )66 finally:67 doc.close()68 return chunks
1import io2import os34import tiktoken56def preprocess_chunks(self, chunks: list) -> tuple[list[str], list[str], list[bytes]]:7 """Preprocesses data chunks."""89 # Extract texts and images from chunks10 texts = [chunk.texts[0].strip().replace("\n \n", "").replace("\n", " ") for chunk in chunks]11 images = [chunk.images[0] for chunk in chunks]1213 def _convert_image_to_binary(image) -> bytes:14 """Convert an image to binary data (PNG format)."""15 buffer = io.BytesIO()16 image.save(buffer, format="PNG")17 binary_string = buffer.getvalue()18 buffer.close()19 return binary_string2021 def _num_tokens_from_string(string: str) -> int:22 """Calculate and validate the number of tokens in a text string."""23 encoding = tiktoken.get_encoding(24 tiktoken.encoding_for_model(self.embedding_model_name).name25 )26 num_tokens = len(encoding.encode(string))27 if num_tokens > 8192:28 raise ValueError(f"Input is too long ({num_tokens} > 8192)")29 return num_tokens3031 def _sanitize(text: str) -> str:32 """Sanitize and return non-empty text."""33 if not text:34 return " "35 return text3637 # Sanitize and validate texts38 sanitized_texts = []39 for text in texts:40 _num_tokens_from_string(text)41 sanitized_text = _sanitize(text)4243 if " \n" == sanitized_text:44 sanitized_text = "Empty string \n"4546 sanitized_texts.append(sanitized_text)4748 # Convert images to binary format49 binary_images = [_convert_image_to_binary(image) for image in images]50 return sanitized_texts, binary_images
- Embedding: We embed the preprocessed texts using OpenAI's text-embedding-3-large embedding model:
1import openai2from dotenv import load_dotenv3from qdrant_client.models import PointStruct4from uuid6 import uuid756load_dotenv()78openai_client = openai.Client(api_key=os.getenv("OPENAI_API_KEY"))910def embed_and_upload(self, texts: list, images: list, machine: str) -> list[PointStruct]:11 """Embeds the preprocessed texts and uploads images to MongoDB."""12 # Initialize MongoDB wrapper13 mongodb = MongoDBWrapper()1415 # Create embeddings using OpenAI's API16 embeddings_response = openai_client.embeddings.create(17 input=texts, model=self.embedding_model_name18 )1920 points = []21 for idx, (data, text, image) in enumerate(zip(embeddings_response.data, texts, images)):22 # Generate a UUID for each point23 uuid_str = uuid7().hex2425 # Calculate the page number26 page = idx + 12728 # Create a point structure for Qdrant29 point = PointStruct(30 id=uuid_str,31 vector=data.embedding,32 payload={"id": uuid_str, "page": page, "machine": machine, "text": text},33 )34 points.append(point)3536 # Upsert image data into MongoDB37 mongodb.upsert(uuid=uuid_str, image_data=str(image), machine=machine, page=page)38 return points
- Upload to Qdrant: For the vector parameters we use size=3072 and distance=Distance.COSINE. To avoid timeout errors, we chunk the embeddings into a maximum of 200 points per request:
1from qdrant_client.models import Distance, VectorParams23from qdrant_wrapper import QdrantWrapper45# Initialize Qdrant wrapper6qdrant = QdrantWrapper()7collection_name = "pdf-manuals"89# Create Qdrant collection10qdrant.client.create_collection(11 collection_name=collection_name,12 vectors_config=VectorParams(size=3072, distance=Distance.COSINE)13)1415# Get chunks16chunks = qdrant.chunk_document(file_path="MLC.pdf")1718# Preprocess chunks19texts, images = qdrant.preprocess_chunks(chunks)2021# Embed and upload embeddings22points = qdrant.embed_and_upload(texts, images, machine="MLC")2324# Upload embeddings to Qdrant25if len(points) >= 200:26 list_of_points = [points[i : i + 200] for i in range(0, len(points), 200)]2728 for point in list_of_points:29 qdrant.client.upsert(collection_name, point)30else:31 qdrant.client.upsert(collection_name, points)
- Search in Qdrant: We want to implement a query filter to match the exact machine we're searching for. By applying a must filter, we ensure that only relevant technical documents are included, effectively preventing any overlap with unrelated PDF documents:
1from qdrant_client import models23def search(4 self,5 collection_name: str,6 query: str,7 machine: str,8 limit: int = 10,9) -> list:10 """Search in Qdrant collection."""11 search_result = self.client.search(12 collection_name=collection_name,13 query_filter=models.Filter(14 must=[15 models.FieldCondition(16 key="machine",17 match=models.MatchValue(value=machine),18 )19 ]20 ),21 query_vector=openai_client.embeddings.create(22 input=[query], model=self.embedding_model_name23 )24 .data[0]25 .embedding,26 limit=limit,27 )28 return search_result2930# Initialize Qdrant wrapper31qdrant = QdrantWrapper()3233# Search example34search_result = qdrant.search(35 collection_name="pdf-manuals",36 query="The signal lamp of my MLC machine is blinking green. What does that mean?",37 machine="MLC"38)39top_result = search_result[0]
- Output: We get the top result from the search:
1image = '<PIL.PngImagePlugin.PngImageFile image mode=RGB size=596x842 at 0x7FE5368C3E20>'2machine = 'MLC'3page = 44passage = '**1.1.1** **Schlüsselberechtigung** |Farbe|Anwender|Berechtigung / Funktion| |---|---|---| |grau (kein Schlüssel)|Weber (weaver)|Maschine in Produktion halten (minimale Geräteeinstellungen vornehmen, Produkti- onseinstellungen1) ansehen)| |blau|Einrichter (fitter)|Maschine mechanisch und textiltechnisch einrichten und verwalten (Geräte einstellen und Produktionseinstellungen1) vornehmen, Update und Diagnose durchführen, Zusatz- informationen ansehen)| |gelb|Vorgesetzter (supervisior)|Maschine statistisch und netzwerktechnisch einrichten und verwalten (Zeit und Schich- ten konfigurieren, Netzwerkeinstellungen vornehmen)| 1) Produktionseinstellungen = Auftrag, Artikel, Muster **1.2** **Signallampe** Die Maschine verfügt über folgende Störungsanzeigen: Multifunktions-Signalleuchte Störungsmeldungen MÜDATA-Display **RGB Multifunktions-Signalleuchte** Lampe leuchtet Lampe blink |Farbe|Muster|Fehler| |---|---|---| |dunkel||Maschine läuft| |weiss||System bereit, Stop| |weiss||Initialisierung läuft| |rot||No...'5query = 'The signal lamp of my MLC machine is blinking green. What does that mean?'6score = 0.46674937
Image to Text
To improve the retrieval accuracy, we could use OpenAI's vision capabilities to extract addional context from images.
Let's say we have the following table, where we want to query for specific information:

Our uploaded text-embeddings does not include the symbols (Muster) from the second column as context. Which makes it impossible to get an accurate result when we later feed the query to an LLM.
- OpenAI Vision: We will use gpt-4o-mini to reformat this image to text as an OCR task:
1import io2import os3import base644import requests5from PIL import Image6from dotenv import load_dotenv78load_dotenv()910class GPT:11 """Wrapper class to interact with OpenAI API Vision."""1213 def __call__(self, image: Image.Image):14 """Analyze image with OpenAI API."""15 buffered = io.BytesIO()16 image.save(buffered, format="PNG")17 image_bytes = buffered.getvalue()1819 # Encode image to base6420 base64_image = base64.b64encode(image_bytes).decode("utf-8")2122 headers = {23 "Content-Type": "application/json",24 "Authorization": f"Bearer {os.getenv('OPENAI_API_KEY')}",25 }2627 payload = {28 "model": "gpt-4o-mini",29 "messages": [30 {31 "role": "user",32 "content": [33 {34 "type": "text",35 "text": "Format this image precisely to text. Be very particular with your formatting.",36 },37 {38 "type": "image_url",39 "image_url": {40 "url": f"data:image/jpeg;base64,{base64_image}"41 },42 },43 ],44 },45 ],46 "max_tokens": 400,47 "top_p": 0.148 }4950 response = requests.post(51 "https://api.openai.com/v1/chat/completions", headers=headers, json=payload52 )53 response.raise_for_status()54 return response.json().get("choices")[0].get("message").get("content")5556query = 'The signal lamp of my MLC machine is blinking green. What does that mean?'57openai_vision = GPT()58image_context = openai_vision(query, top_result.image)
- Output: We did not got a perfect result, but we were able to improve the quality of the context as input to an LLM:
1**RGB Multifunktions-Signalleuchte**23⊙ Lampe leuchtet4⊗ Lampe blinkt56| Farbe | Muster | Fehler |7|-------|--------|--------|8| dunkel | Maschine läuft | |9| weiss | System bereit, Stop | |10| weiss | Initialisierung läuft | |11| rot | Not-HALT | |12| rot | Webstellanabdeckung geöffnet | |13| blau | Auftragsende | |14| grün | Schussfadenbruch | |15| grün | Kettfadenbruch / Scheibeltstopp | |16| pink | Handling Mode / Webstellanabdeckung geschlossen | |17| pink | Handling Mode und Webstellanabdeckung geöffnet | |18| gelb | Hilfsfadenbruch | |19| türkis | Aufwickelsicherung |
- DSPy Module: We can use dspy to feed all the relevant information to gpt-4o-mini.
Let's first define a class to store the relevant information.
1import ast2import io3from typing import Optional45from PIL import Image6from qdrant_client.http.exceptions import UnexpectedResponse78from qdrant_wrapper import QdrantWrapper9from mongodb_wrapper import MongoDBWrapper1011class Response:12 """Response object for Qdrant search."""1314 def __init__(15 self,16 query: str,17 machine_type: str,18 passage: Optional[str] = None,19 image: Optional[Image.Image] = None,20 page: Optional[int] = None,21 score: Optional[float] = None,22 ):23 self.query = query24 self.machine_type = machine_type25 self.passage = passage26 self.image = image27 self.page = page28 self.score = score2930 def process_query(self, k: int) -> list["QdrantResponse"]:31 """Process the query and return the top k results."""32 qdrant = QdrantWrapper()33 mongodb = MongoDBWrapper()3435 # Vector search and show results36 try:37 search_results = qdrant.search(38 "pdf-manuals", self.query, self.machine_type39 )40 except UnexpectedResponse as e:41 raise ValueError(f"Something went wrong with the Qdrant API: {e}")4243 top_results = []44 for result in search_results[:k]:45 payload = result.payload46 passage = payload["text"]47 page = payload["page"]4849 # Get image with the same UUID50 collection_item = mongodb.get(payload["id"])5152 byte_data = ast.literal_eval(collection_item["image"])53 image = Image.open(io.BytesIO(byte_data)) if byte_data else None5455 top_results.append(56 Response(57 self.query,58 self.machine_type,59 passage,60 image,61 page,62 result.score,63 )64 )65 return top_results
Let's define our retriever model, the input and output schema and build out the RAG pipeline.
1from typing import Optional23import dspy4from pydantic import BaseModel, Field56from gpt_wrapper import GPT7from qdrant_wrapper import Response89mini = dspy.OpenAI(model="gpt-4o-mini", max_tokens=500)10dspy.configure(lm=mini, trace=["Test"])111213class QdrantRMClient(dspy.Retrieve):14 """Custom Qdrant Retrieval Model Client."""1516 def __init__(self, machine_type: str, k: int = 3) -> None:17 super().__init__(k=k)18 self.machine_type = machine_type1920 def forward(self, query: str, k: int) -> Response:21 """Forward pass of the QdrantRMClient."""22 k = k if k else self.k23 response = Response(query, self.machine_type).process_query(k)24 return dspy.Prediction(passages=response)252627class Input(BaseModel):28 """Input Schema for the RAG model."""2930 context: str = Field(description="May contain relevant facts")31 question: str = Field()32 image_context: Optional[str] = Field(description="May contain additional facts")333435class Output(BaseModel):36 """Output Schema for the RAG model."""3738 answer: str = Field(description="The answer for the question")39 pdf_path: str = Field()40 image: bytes = Field()41 page: str = Field()42 machine_type: str = Field()434445class QASignature(dspy.Signature):46 """Answer the question based on the context provided."""4748 input: Input = dspy.InputField()49 output: Output = dspy.OutputField()505152class RAG(dspy.Module):53 """PDF Manuals RAG Pipeline."""5455 def __init__(self, machine_type: str, max_hops: int = 3) -> None:56 super().__init__()5758 self.predictor = dspy.TypedChainOfThought(QASignature)59 self.qdrant_retrieve = QdrantRMClient(machine_type)60 self.openai_vision_retrieve = GPT()61 self.max_hops = max_hops6263 def forward(self, question: str) -> dspy.Prediction:64 """Forward pass of the RAG model."""65 retrieval = retrieval[0] # Get the top k passage66 buffered = io.BytesIO()67 retrieval.image.save(buffered, format="PNG")68 image_bytes = buffered.getvalue()6970 input_fields = Input(71 context=retrieval.passage,72 question=question,73 image_context=self.openai_vision_retrieve(retrieval.image),74 )75 prediction = self.predictor(input=input_fields)76 return dspy.Prediction(77 answer=prediction.output.answer,78 pdf_path=retrieval.path,79 image=image_bytes,80 page=retrieval.page,81 machine_type=retrieval.machine_type,82 )8384if __name__ == "__main__":85 machine_type = "MLC"86 question = "The signal lamp of my MLC machine is blinking green. What does that mean?"87 rag = RAG(machine_type)88 pred = rag.forward(question)89 print(pred.answer)
LLM Output
1'The blinking green signal lamp on your MLC machine indicates a shot thread break or a warp thread break/separation sheet stop.'
Limitations
OCR tools, including those used by language models, sometimes struggle with extracting non-standard characters, small icons, or special symbols (like circles in this case). If the symbols are rendered in a smaller or less clear font, OCR may fail to recognize them. The accuracy largely depends on the complexity and quality of the image you are analyzing, and how well the image is preprocessed if needed.
Conclusion
In this article, we have demonstrated how to build a technical document retrieval system using Qdrant for retrieval of text-based embeddings, OpenAI's gpt-4o-mini for OCR, PyMuPDF for PDF extraction, and DSPy for retrieving and augmenting the passages with context into a unified pipeline.
Now if we want to further enhance the retrieval of our documents, we could train a subset of our data for similarity search utilizing Quaterion, a framework designed for fine-tuning similarity learning models. This would allow us to better align our embeddings with the specific nuances of our dataset.
If you want to learn more, check out Quaterion's Quick Start Guide.

Optimize Your CDN Efficiency With Cache-Control Headers
Unlock CDN's power with optimal cache-control headers. Discover how must-revalidate and Etags enhance content delivery, outperforming no-cache.


Prototyping machine dashboards using Node RED
Quick and easy way to set up a dashboard for the machines on your factory floor


