Why Claude and ChatGPT Still Can't Really Watch Videos,And What Works Instead
Large language models like Claude and ChatGPT cannot watch videos natively the way humans do. Uploading an MP4 file to Claude often returns an "unsupported file type" error, while ChatGPT mostly extracts captions rather than analyzing visual content frame-by-frame. Even Google's Gemini, which accepts video natively, typically samples frames at a fixed rate of roughly one per second, which works fine for a lecture but misses critical details in fast-cut videos like TikTok reels or music videos.
The practical pattern that has emerged in 2026 is not "give the model the video." Instead, the working approach is to give the model what matters from the video: speech converted to text, visuals extracted as deduplicated keyframes, and metadata organized as a manifest. Then ask your questions. This shift reflects a fundamental constraint of how vision language models (VLMs) work: they do not stream 24 frames per second into their context window. They receive a bounded set of images and text within a strict token budget, which limits how much visual information they can process at once.
What Are the Main Approaches to Video Analysis Right Now?
Several competing strategies have emerged for getting language models to reason about video content. Each has distinct tradeoffs in speed, cost, privacy, and visual fidelity. The choice depends on what you are trying to accomplish: quick analysis, detailed editing, compliance, or building a product.
- Native Multimodal APIs (Gemini, GPT-4o): Google's Gemini and OpenAI's GPT-4o accept video directly in their APIs and web interfaces. Gemini's default sampling is approximately one frame per second, which works well for meeting recordings and lectures but struggles with fast-paced content like sports highlights or music videos with rapid cuts. Cost is bundled into subscription or API vision pricing, and you have no control over which frames the model selects.
- Scene-Aware Local Preprocessing (claude-real-video): An open-source tool called claude-real-video (crv) detects scene changes and deduplicates similar frames before sending them to any language model. Instead of sending 600 near-duplicate frames from a 10-minute static screencast, it collapses them to a handful of unique shots. The tool runs locally on your machine, preserving privacy, and works with Claude, ChatGPT, Gemini, or local models.
- Transcript-First with Selective Frames: For open-ended video questions, the most effective approach combines a full transcript with strategically chosen keyframes. This avoids the failure mode of transcript-only analysis, which answers dialogue but remains blind to on-screen code, charts, or visual details never mentioned in speech.
- DIY Local Processing: For air-gapped or compliance-critical work, you can use ffmpeg for scene detection and Whisper for transcription without uploading anything to a vendor. You lose automated deduplication and manifest generation, but retain complete control. This approach is reasonable for internal compliance reviews.
How to Prepare Video for Language Model Analysis
If you want to use a language model to analyze video content, the most practical workflow involves preprocessing the video locally before feeding it to the model. This approach reduces token costs by orders of magnitude and improves accuracy by eliminating redundant frames.
- Scene Detection and Deduplication: Use ffmpeg to identify scene changes, then apply pixel-difference deduplication to remove near-identical frames. For a typical 60-second video sampled at 30 frames per second, this reduces roughly 1,800 raw frames to 25 to 40 unique keyframes. The tool claude-real-video automates this process with a single command: crv "https://www.youtube.com/watch?v=...".
- Audio Transcription: Extract speech as text using Whisper or embedded subtitles if available. Whisper can struggle with domain-specific jargon and overlapping speakers, so embedded subtitles are preferable when available. Optional audio preservation (--keep-audio flag) saves the audio track for models like Gemini and GPT-4o that accept audio, since transcript alone loses tone and music.
- Manifest Generation: Create a structured metadata file listing frame filenames, timestamps, and transcript segments. This allows the language model to cite specific frames when answering questions, reducing hallucination. The manifest acts as a reference guide that keeps the model grounded in actual visual evidence.
- Token Budget Planning: A naive approach of sending 30 frames per second for 60 seconds at roughly 1,500 tokens per image would consume approximately 2.7 million tokens per minute of video, which is unusable. Scene-aware preprocessing reduces this to 40,000 to 100,000 tokens total for the same content, making analysis affordable.
Why Does Fixed-Rate Sampling Fail for Fast-Cut Video?
The default behavior of native video APIs like Gemini reveals a fundamental mismatch between how models process video and what users expect. Gemini's one-frame-per-second sampling works because it was designed for lecture recordings and slow-paced demos, where the visual content changes gradually. But this approach creates predictable failure modes for other content types.
A 15-second reel with six rapid cuts will likely miss the transition between cuts if frames are sampled only once per second. A 10-minute static presentation with a single slide will generate approximately 600 near-duplicate frames, wasting tokens and context window space. Repeat shots in A-B-A editing patterns get sent twice, further inflating token costs. Scene-aware deduplication solves this by detecting when the visual content actually changes, rather than sampling at a fixed interval.
The practical implication is that for any video with significant visual editing, you should preprocess locally before uploading to a language model. This is not a limitation of the models themselves, but rather a recognition that raw video is inefficient input for systems designed to process bounded sets of images and text.
What Are the Real-World Tradeoffs Between Approaches?
Choosing the right video analysis method requires understanding the specific tradeoffs in cost, privacy, speed, and control. A marketer wanting a quick summary of a webinar has different needs than a compliance officer reviewing sensitive footage or a video editor building an automated pipeline.
- Speed and Convenience: Native Gemini video upload is fastest for a one-off analysis. You paste a video URL or upload a file, and the model returns an answer within seconds. The tradeoff is that you have no control over frame selection, and the model may miss fast cuts or visual details.
- Privacy and Control: Local preprocessing with claude-real-video or DIY ffmpeg plus Whisper keeps video on your machine and never uploads to a vendor. The tradeoff is that you spend time running local tools and managing files. This approach is essential for compliance, medical, or legal video review.
- Token Cost: Local preprocessing reduces vision tokens by orders of magnitude compared to raw frame dumps. A 60-second video that would cost 2.7 million tokens sent raw costs 40,000 to 100,000 tokens after deduplication. This difference is substantial when processing multiple videos or building a product.
- Accuracy on Visual Details: Scene-aware frame selection catches visual changes that fixed-interval sampling misses. For tasks like finding when a UI bug appears on screen or identifying the exact moment a speaker changes slides, local preprocessing with deduplication outperforms native APIs.
The survey of available tools and approaches reveals that no single solution is best for all use cases. For Q&A and research, frames plus transcript plus manifest is optimal. For video editing and automated cuts, transcript-first tools like video-use are appropriate. For fastest cloud analysis, Gemini video is reasonable. For privacy-critical work, local preprocessing is required.
What Limitations Should You Know About?
Even the most effective approaches have real constraints. claude-real-video is young, currently at version 0.1.x with no formal releases, so verification on your specific content types is recommended before deploying to production pipelines. Whisper's transcription can still struggle with domain-specific jargon and overlapping speakers, making embedded subtitles preferable when available. Language models can hallucinate on sparse frames, so asking the model to cite frame filenames as evidence in its answers reduces false claims.
Copyright considerations matter: only process content you have rights to analyze. The --cookies flag in preprocessing tools is for your own authenticated access, not for sharing credentials. Finally, no video analysis tool should replace human review on legal, medical, or safety-critical content. These applications require human judgment and accountability that automated systems cannot provide.
The broader lesson is that vision language models have real constraints that require thoughtful engineering to work around. Understanding these constraints and choosing the right preprocessing strategy can mean the difference between a usable system and one that is too slow, too expensive, or too inaccurate for practical work.