Watch less, learn more: Summarize YouTube videos with LLMs

Turn hours of watching into minutes of reading

Tamás Ujhelyi

Sep 18, 2024

I have one big YouTube problem.

(Two if I also count the annoying ads.)

I have too many videos in my “Watch later” playlist.

It’s a shame, because I:

feel bad for having unfinished business,
miss out on useful knowledge.

The solution?

I wrote a script that, when provided with a YouTube video URL, uses a large language model (LLM) to summarize videos in seconds.

Then I can decide if I want to watch the whole video or purge it from my playlist.

I’ve already saved hours with this method.

I’ll show you how.

1. How the script works from a bird's-eye view

You add a YouTube video URL.
The script gets the video’s transcript belonging to the URL.
The transcript gets passed to a LLM.
The LLM summarizes the video for you.
Everything takes place in a Streamlit app, so it looks cool:

2. How the script works step-by-step

You can find the GitHub repo here; in this article we’ll go through these steps:

The prerequisites (can’t be avoided)
Basic variables and some background knowledge (nearing the fun stuff)
Bringing the script to life with functions (the fun stuff)
Applying Streamlit (makes the project look cool)

2.1 The prerequisites (can’t be avoided)

You know the agonizing feeling when you just want to code, but first you hAVe tO sEt sOMe stuFF uP…?

Let’s get over it as quickly as possible.

First, install these libraries:1

pip install langchain langchain-anthropic langchain-community python-dotenv streamlit tiktoken transformers

Then import the libraries/modules:

import os

from dotenv import load_dotenv
from langchain_anthropic import ChatAnthropic
from langchain_community.document_loaders import YoutubeLoader
from langchain.chains.summarize import load_summarize_chain
from langchain.schema import Document
import streamlit as st
import tiktoken

You’ll also need two things set up:

An API key to a large language model.
1. I use Claude Sonnet 3.5 by Anthropic, but feel free to use the LLM of your choice.
2. You can create an Anthropic account here, an OpenAI account here, then claim your API key. But really, you may use any other LLMs that’s integrated with LangChain. Just make sure you update the code accordingly.
A .env file where you store your LLM’s API key.
1. If you need a refresher, here’s my guide on what’s .env and how to create it.
2. Add your API key to your .env like this:

ANTHROPIC_API_KEY=totally-made-up-API-key-value-pls-create-your-own

2.2 Basic variables and some background knowledge (nearing the fun stuff)

Alright, moving on:

load_dotenv()
ANTHROPIC_API_KEY = os.getenv("ANTHROPIC_API_KEY")
llm = ChatAnthropic(model="claude-3-5-sonnet-20240620")
llm_context_window = 195_000

With this, we accomplish three things:

Access the .env file (load_dotenv()), and store the API key in a variable (ANTHROPIC_API_KEY = os.getenv("ANTHROPIC_API_KEY")).
Pick the LLM that’ll summarize the videos (llm = ChatAnthropic(model="claude-3-5-sonnet-20240620")).
Define the LLM’s context window (llm_context_window = 195_000).
1. The context window determines how much text (=video transcript) converted to tokens can the LLM keep in its mind.
2. We‘ll check llm_context_window against the video transcript’s token size. If the video transcript fits within the context window, the LLM will summarize the video at a sitting. If not, we’ll split the video transcript up to manageable (=fits within the context window) sizes, then summarize them one-by-one, and then summarize these summarizations to get the final video summary. More on this later.
3. Note: Sonnet 3.5’s context window is 200 000 tokens, but in the variable I defined only 195 000. It’s because I left some tokens for the prompt that instructs the LLM to summarize the video. Why did I leave 5k tokens for the prompt? I scientifically picked this number at random. Probably we could do with less than that, but it’s not important now.

Okay.

We can finally proceed to the fun part!

2.3 Bringing the script to life with functions (the fun stuff)

This gon’ be eazy-peazy.

(Zs instead of Ss. I’ve heard it’s cooler.)

We’ll work with three functions:

get_video_docs() loads data (=docs) about a YouTube video, like the transcript (just what we need!) or the video’s YouTube id.
count_tokens() counts the number of tokens of the video transcript received from get_video_docs(), so the script can decide how the LLM should summarize the video (all at once or split up).
generate_video_summary() outputs the LLM’s video summary.

Let’s go over these in more detail.

2.3.1 get_video_docs() gets the video transcript

def get_video_docs(video_url: str) -> list[Document]:
    """
     Creates a list of Documents containing video data
     like the video's transcript or YouTube id.
    """

    loader = YoutubeLoader.from_youtube_url(
        video_url, add_video_info=False
    )
    video_docs = loader.load()

    return video_docs

The get_video_docs() function expects a YouTube video URL as input, then uses LangChain’s YoutubeLoader to create a LangChain Document.

This Document holds the transcript of the video that the LLM can summarize.

2.3.2 count_tokens() counts the number of tokens in the video transcript

def count_tokens(video_transcript: str, encoding_name: str = "cl100k_base") -> int:
    """Returns the number of tokens in a text string."""

    encoding = tiktoken.get_encoding(encoding_name)
    num_tokens = len(encoding.encode(video_transcript))

    return num_tokens

Here we use tiktoken2 to count how many tokens are in the video transcript.

We do this so that, in the next step, the script can pick a summarization strategy for the LLM.

count_tokens() first gets an encoding by the name cl100k_base.

The encoding specifies how the video transcript is converted into tokens. We use the cl100k_base encoding, because newer models (like GPT-4) use that.

The encode() method splits the video transcript into a list of tokens; since the result is a list, we can count its length with len() to get the number of tokens in the transcript (num_tokens).

2.3.3 generate_video_summary() creates the video summary

def generate_video_summary(video_url: str, chain_type: str = "stuff") -> str:
    """
    Generates the summary of the video.
    Before the generation, checks the token count and decides which approach to   use
    for summary: "map_reduce" if the video transcript token count surpasses
    the used LLM's context window, otherwise it defaults to "stuff".
    """

    video_docs = get_video_docs(video_url)
    video_transcript = video_docs[0].page_content
    num_tokens = count_tokens(video_transcript)

    if num_tokens > llm_context_window:
        chain_type = "map_reduce"

    chain = load_summarize_chain(llm, chain_type=chain_type)
    result = chain.invoke(video_docs)
    video_summary = result["output_text"]

    return video_summary

We utilize the previous two functions, sprinkle in something new, and soon we get to see the video summaries.

get_video_docs() gets data about the YouTube video, and we extract the video’s transcript into a separate variable:

video_docs = get_video_docs(video_url)
video_transcript = video_docs[0].page_content

Then we count the number of tokens in video_transcript with count_tokens():

num_tokens = count_tokens(video_transcript)

And we check num_tokens against the LLM’s context window:

if num_tokens > llm_context_window:
    chain_type = "map_reduce"

If the tokens are within llm_context_window, we use the summary approach called stuff (defined as a default argument in generate_video_summary()).

With stuff, we give the whole video transcript to the LLM: “Hey, LLM, here’s the full video transcript, summarize it for me, pls!”

We can do this when the video transcript is not that long and it fits into the LLM’s context window together with the summarization prompt.

map_reduce comes to the rescue when the video transcript is enormous, and the LLM can’t handle it all at once.

In such cases, the video transcript is split into smaller batches that each get summarized separately, then the LLM summarizes these separate summaries: “Hey, LLM, here’s video_transcript_part_1, video_transcript_part_2, video_transcript_part_n; summarize each of them, then give me a final summary of the small summaries!”

Once the script chose a summary strategy, we use load_summarize_chain to get the video summary:

chain = load_summarize_chain(llm, chain_type=chain_type)
result = chain.invoke(video_docs)
video_summary = result["output_text"]

return video_summary

At this point we’re cool, but we’ll take it one step further with Streamlit to be even cooler.

2.4 Applying Streamlit (makes the project look cool)

Let’s make this project look something like that you’d happily share with your friends without making them scrape their eyes off.

This is where Streamlit comes in; it was built so you and I can easily showcase our data projects. Thanks, Streamlit! ✌️

(It’s not an affiliate link.)

(It’d be nice, though.)

Now we place the main code inside if __name__ == "__main__": , and apply the Streamlit magic to it:

st.title("⚡ Summarize YouTube videos ⚡")
st.subheader("So you can save time for what matters.")

with st.form("my_form"):
    video_url = st.text_area(
        "Paste the URL of the YouTube video you want summarized:"
    )

    submitted = st.form_submit_button("Get my summary")

    if submitted and video_url.startswith("https://youtube.com/watch?v="):
        with st.spinner("Summarizing..."):
            video_summary = generate_video_summary(video_url)

        st.info(video_summary)

    if submitted and not video_url.startswith("https://youtube.com/watch?v="):
        st.toast("Please provide a valid YouTube video URL!", icon="😉")

If we run the script, this is what will happen:

We add a YouTube video URL and click on the “Get my summary” button.
If the URL is valid, we get the summary. If not, we get an error message.

Streamlit’s really easy to use; let me explain briefly.

st.title() and st.subheader() give you these nice, well, title and subheader:

st.form() (the whole stuff you see below), st.text_area() (where you add the video URL), and st.form_submit_button() (the “Get my summary" button):

st.spinner() is the loading message you see when the LLM summarizes the video:

st.toast() pops up a toast message if you provide an invalid YouTube video URL:

And st.info() displays the summary (a Kurzgesagt video in this example):

You can run your Streamlit app with:

python -m streamlit run main.py

Or simply:

streamlit run main.py

And voilà!

3. So what?

From now on you can summarize YouTube videos just like that, Chief.

I’m honestly curious how much time you managed to spare with this solution. By all means, let me know! Just comment it here, DM me on Substack or hit me up on LinkedIn!

And if you know someone who might find it useful, you can share this link with them where they can summarize YouTube videos without having to code the script. (All they’ll need is an Anthropic API key.)

Anyways.

Have an awesome day! ✌️

For some reason, I had to separately install transformers for map_reduce to work. Possibly at this point you don’t know yet what I’m talking about; don’t worry, we’ll get there.

Okay. So tiktoken is a tokenizer created by OpenAI. It was designed to be used with OpenAI’s LLMs. Yet, in this post I pair it with an Anthropic model. Although there are Anthropic-specific solutions, I found it easier to implement tiktoken (without compromising the project’s outcome). Laziness? Yes. Am I ashamed? No.*

*A little bit, yes.

Underhyped

Discussion about this post