Get unlimited access to the best of Medium for less than $1/week.

How to run text embeddings on a PDF and upload to Pinecone Vector Database

1 min readOct 3, 2023

import os
import re
import pdfplumber
import openai
import pinecone
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Initialize OpenAI
openai.api_key = OPENAI-KEY
MODEL = "text-embedding-ada-002"

# Initialize Pinecone
pinecone.init(api_key=PINECONE_API, environment='gcp-starter')

# Define the index name
index_name = "hs-codes"

# Create the index if it doesn't exist
if index_name not in pinecone.list_indexes():
    pinecone.create_index(index_name, dimension=1536)

# Instantiate the index
index = pinecone.Index(index_name)

# Define a function to preprocess text
def preprocess_text(text):
    # Replace consecutive spaces, newlines and tabs
    text = re.sub(r'\s+', ' ', text)
    return text

def process_pdf(file_path):
    # create a loader
    loader = PyPDFLoader(file_path)
    # load your data
    data = loader.load()
    # Split your data up into smaller documents with Chunks
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
    documents = text_splitter.split_documents(data)
    # Convert Document objects into strings
    texts = [str(doc) for doc in documents]
    return texts

# Define a function to create embeddings
def create_embeddings(texts):
    embeddings_list = []
    for text in texts:
        res = openai.Embedding.create(input=[text], engine=MODEL)
        embeddings_list.append(res['data'][0]['embedding'])
    return embeddings_list

# Define a function to upsert embeddings to Pinecone
def upsert_embeddings_to_pinecone(index, embeddings, ids):
    index.upsert(vectors=[(id, embedding) for id, embedding in zip(ids, embeddings)])

# Process a PDF and create embeddings
file_path = "your_pdf_here.pdf"  # Replace with your actual file path
texts = process_pdf(file_path)
embeddings = create_embeddings(texts)

# Upsert the embeddings to Pinecone
upsert_embeddings_to_pinecone(index, embeddings, [file_path])

Written by Felix Lu

9 Followers

More from Felix Lu

Making a Python Script as an Executable in Windows 11

Felix Lu

Making a Python Script as an Executable in Windows 11

Let’s say you have a Python script that you regularly use, and you want it to be accessed at your convenience without having to search…

2 min readNov 13, 2023

Understanding and Use Cases of a Python Script for Image Data Extraction with OpenAI’s GPT-4 Vision…

Felix Lu

Understanding and Use Cases of a Python Script for Image Data Extraction with OpenAI’s GPT-4 Vision…

The pre-requisites are that you should already have an OpenAI API key of course.

3 min readNov 21, 2023

See all from Felix Lu

Recommended from Medium

Meilisearch Full Text search with Firestore — Introduction

Gautham Vijayan

Meilisearch Full Text search with Firestore — Introduction

In this post we will look into the introduction of how we can implement a full text search feature in our application with firestore as…

3 min readNov 7, 2023

Sujatha Mudadla

ChromaDB vsFaiss.

ChromaDB and Faiss are both libraries that serve the purpose of managing and querying large-scale vector databases, but they have different…

2 min readNov 13, 2023

Lists

Staff Picks

632 stories930 saves

Stories to Help You Level-Up at Work

19 stories583 saves

Self-Improvement 101

20 stories1698 saves

Productivity 101

20 stories1576 saves

Getting Started with Chroma DB: A Beginner’s Tutorial

Random-long-int

Getting Started with Chroma DB: A Beginner’s Tutorial

Are you interested in using vector databases for your next project? Look no further! In this tutorial, we will introduce you to Chroma DB…

4 min readMar 16, 2024

Building RAG application using Langchain 🦜, OpenAI 🤖, FAISS

Solidokishore

Building RAG application using Langchain 🦜, OpenAI 🤖, FAISS

To create a PDF chatbot to Ask question on your own pdf .

6 min readFeb 3, 2024

Using LangChain and Pinecone to chat with data.

Apil Adhikari

Using LangChain and Pinecone to chat with data.

Students, Researchers, AI Developers will find this blog useful.

4 min readJan 3, 2024

Working with FAISS for Similarity Search

Ajithkumar M

Working with FAISS for Similarity Search

FAISS

6 min readNov 1, 2023

See more recommendations

Help
Status
About
Careers
Blog
Privacy
Terms
Text to speech
Teams