The Problem: Video Content Doesn’t Scale
Every day, thousands of hours of video are published across YouTube, TikTok, Instagram, X, and other platforms. Buried inside are competitor mentions, product reviews, pricing signals, customer pain points, expert insights, and buying intent — data that teams across your organization need. But video data extraction today is broken:- Sales teams manually watch webinars to find lead signals.
- Market researchers hire interns to catalog competitor mentions.
- Content teams scrub through hours of footage to pull quotes.
Key Takeaways
- Define a custom schema (JSON) to extract exactly the data you need from any video.
- 2-phase AI pipeline: prompt compilation (cached) → structured extraction (Pydantic-enforced) guarantees consistent output.
- Works with online videos (
/v1/extract/video) and uploaded files (/v1/extract/file) — supporting YouTube, TikTok, Instagram, X, and many more. - Prompt caching means repeat extractions are instant: define a schema once, extract from hundreds of videos.
- Cost-effective: At 100 extractions per credit, processing videos at scale is extremely cheap.
Why Not Just Use standard LLMs?
You could paste a transcript into a standard LLM and ask for structured data. It works for one video, but it breaks at scale:- Inconsistent output shape: Standard LLMs return slightly different JSON keys, structures, and formatting every time. You can’t reliably pipe it into a database or API.
- No schema enforcement: The Extract API uses Pydantic to enforce your exact schema. Every response is guaranteed to match your field names, types, and nesting.
- No transcript pipeline: You have to manually get the transcript, paste it, and copy the result. The Extract API handles transcript retrieval, caching, and extraction in one single call.
- No prompt caching: Every standard LLM call re-generates the prompt. VidNavigator caches the optimized extraction prompt, so repeat schemas are faster and cheaper.
- No batch automation: The Extract API is a REST endpoint. Loop over 1,000 video URLs, feed results into your pipeline. No copy-paste needed.
How It Works: The 2-Phase Pipeline
Phase 1 — Prompt Compilation (one-time, cached)
The API takes your schema and optionalwhat_to_extract instruction and generates an optimized pair of AI prompts. This compiled “extraction plan” is cached with a 2-hour TTL using a fingerprint of your schema. The next time you send the exact same schema within the cache window, the compilation step is skipped entirely.
Phase 2 — Structured Extraction
The cached prompt template is filled with the video’s transcript text, then sent to the AI model with strict structured output enforcement. The result is validated JSON that exactly matches your custom schema extraction.Use Case Templates
1. Lead Generation
Built for sales and BD teams. Extract companies, decision-makers, pricing signals, pain points, buying intent, and calls-to-action from sales calls, webinars, or product demos.2. Market Research
Competitive intelligence for product and strategy teams. Map competitor mentions, feature claims, pricing strategies, target audiences, and objections addressed in industry talks and reviews.3. Content & Creator Analysis
Designed for marketing and content teams. Capture hooks, key quotes, content format, sponsored product mentions, and audience engagement cues from creator videos and branded content.4. AI Pipeline / RAG Ingestion
For AI builders and data engineers. Produce vector-ready summaries, named entities, factual claims, topic labels, language codes, and sentiment.5. Brand & E-Commerce Monitoring
Track brand mentions, promotional codes, creator recommendations, audience demographics cues, and purchase intent signals.Online Videos vs. Uploaded Files
Extract from Online Videos
Use the/v1/extract/video endpoint to extract data directly from public video URLs (YouTube, TikTok, Instagram, X, etc.).
Note: The /v1/extract/video endpoint requires the video to already have a transcript. If the video doesn’t have native captions, call /v1/transcribe first to generate a transcript via speech-to-text.
Extract from Uploaded Files
The/v1/extract/file endpoint works identically to /v1/extract/video but takes a file_id instead of a URL. The file must be uploaded and transcribed first via the file upload endpoints.
Schema Rules
To ensure high accuracy and strict adherence, your JSON schemas must follow these rules:- Max 10 root fields
- Max 3 nesting levels (level 3 must be primitive types only)
- Max 10 subfields per Object
- Supported types:
String,Number,Boolean,Integer,Array,Object,Enum - Every field requires both
typeanddescription
Prompt Caching & Performance
Every extraction schema you send is fingerprinted. The resulting hash is used to look up a previously compiled prompt plan.- The first call with a new schema has ~2–3s of compilation overhead.
- All subsequent calls with the same schema skip compilation entirely (Instant, or 0 seconds compilation time).
- Plans are cached for 2 hours (TTL) and are automatically recompiled when they expire.
Best Practices
- Write specific field descriptions: The better your descriptions, the more accurate the extraction. Instead of “topic”, write “Primary topic discussed in the video, in 5–10 words”.
- Use Enum types for classification fields instead of free-text Strings. Enums constrain the AI output to your predefined values, eliminating inconsistency.
- Start with a simple schema and add fields iteratively. Test with 2–3 fields first, verify accuracy, then expand.
- Use
what_to_extractto guide the AI’s focus. This optional instruction steers the model toward specific parts of the transcript, improving relevance. - Write descriptions in your target language. The output is returned in the same language as your schema descriptions (99+ languages supported).
Pricing: Built for Scale
Each extraction counts as 1 video analysis. With VidNavigator:- 1 credit = 100 video extractions/analyses
- Instant (0s) compilation on cached schemas.

