Summarize large files - an introduction

ChatPDF providers, where you can question large files with large language models (LLM’s), are sprouting like mushrooms. The technique is mainly based on vector embedding with a vector index or vector database; based on the question, semantically relevant chunks from the file are provided to the LLM so it can compose an answer.

While this technique is cool, it is limited when you ask a question that spans the entire text, such as generating a summary, since that requires not a couple of chunks, but the full text.

I will explore how to exactly ask suchs questions to a large file.

Load the PDF

For this I’ll use the Beginners Guide to Home Gardening by Audree Khalishah:

36 pages
17.759 GPT-3 tokens

import tiktoken
from pypdf import PdfReader

reader = PdfReader("Home-Garden-Manual-2022-1.pdf")
text = "\n".join([page.extract_text() for page in reader.pages])
encoding = tiktoken.encoding_for_model("gpt-4")
tokens = encoding.encode(text)

We introduce a helper function based on the openai module that asks for a consice summary of the given text to the gpt-3.5-turbo model. Have the OpenAI API key as an environmental variable OPENAI_API_KEY.

import openai

from dotenv import dotenv_values

config = dotenv_values(".env")

client = openai.AsyncAzureOpenAI(
    api_key=config["AZURE_OPENAI_KEY"],
    api_version=config["AZURE_OPENAI_VERSION"],
    azure_endpoint=config["AZURE_OPENAI_ENDPOINT"],
)

def summarize_text(content):
    _prompt = (
        f"""Write a concise summary of the following:\n\n"{content}"\n\nCONCISE SUMMARY:"""
    )
    completion = await client.chat.completions.create(
        model=config["AZURE_OPENAI_DEPLOYMENT"],
        temperature=0,
        messages=[{"role": "user", "content": _prompt}],
    )
    return completion.choices[0].message.content

Stuff

Stuff is the practice of simply putting the entire text into a request. This will fail because the full text has too many tokens (more than 4097). What we can do is getting the first 3500 tokens so we leave some space for the LLM to respond:

summarize_text(encoding.decode(tokens[:3500]))

This manual, written and illustrated by Audree Khalishah and edited by Jenn Halpin, is a beginner’s guide to home gardening specifically for residents of Carlisle, Pennsylvania. It is part of ongoing research conducted by Dickinson College and the Burpee Foundation to understand residents’ attitudes towards home gardening. […]

This is impressive already, but it was only generated by the first part of the text.

MapReduce

A common practice to work with large datasets is to breaking down the task, by means of breaking down the text into chunks. In the Map phase, the question is asked to every chunk; in the reduce phase the intermediate results are combined into a single answer.

Now we loop over 6 chunks of tokens to generate therefore 6 summaries and then join all summaries into a single request.

import numpy as np

summaries = [
    summarize_text(encoding.decode(token_chunk))
    for token_chunk in np.array_split(tokens, 6)
]
summarize_text("\n\n".join(summaries))

This manual is a beginner’s guide to home gardening in Carlisle, Pennsylvania. It provides step-by-step instructions for starting a garden, information on pests and diseases, and guidance on soil preparation and care. It also includes tips for buying or DIY-ing seedlings, controlling pests and diseases, and remedies for gardening. Additionally, it discusses different garden models, rainwater harvesting, and composting. The manual aims to inspire and support Carlisle residents in their gardening endeavors.

LangChain MapReduce

What we have done above is a manual MapReduce that you can easily replicate with LangChain:

from langchain.chains.summarize import load_summarize_chain
from langchain.docstore.document import Document
from langchain.chat_models import ChatOpenAI

llm = ChatOpenAI(temperature=0, model="gpt-3.5-turbo")

docs = [
    Document(page_content=encoding.decode(token_chunk))
    for token_chunk in np.array_split(tokens, 6)
]

chain = load_summarize_chain(llm, chain_type="map_reduce")
chain.run(docs)

This manual is a beginner’s guide to home gardening in Carlisle, Pennsylvania. It provides information on overcoming barriers to gardening and offers technical assistance, resources, and support for residents. The manual covers topics such as soil preparation, pest and disease control, different planting methods, and composting. It also provides tips for buying or DIY-ing seedlings, watering techniques, using mulch, companion planting, and attracting beneficial insects. Additionally, it discusses common pests and diseases in gardens and provides remedies for repelling pests and controlling diseases. The manual also offers remedies for gardening, tips for weed control, and different garden models based on time and funding constraints. It includes information on different crops, building a rain barrel, and composting at home.

Conclusion

MapReduce is a technique that can be used with an LLM to ask questions to the entire text of large files.

Some further questions?

Both summaries took about 30 seconds to generate, can we further optimize for response time?
Can we further improve the output quality?
Is there a way to measure and compare the output quality?

Load the PDF#

Stuff#

MapReduce#

LangChain MapReduce#

Conclusion#

Load the PDF

Stuff

MapReduce

LangChain MapReduce

Conclusion