Drop a YouTube, Loom, Vimeo, or direct video URL. We download it, transcribe the audio, describe the frames, and let you ask questions in plain English.
YouTube, Loom, Vimeo, TikTok, X, Instagram, direct MP4 โ anything yt-dlp can pull.
Whisper large-v3-turbo on a private RTX 4090. Fast, accurate, no third-party API.
Local vision-language model summarizes what's on screen at each timestamp.
"Summarize this." "What was said about X?" "What's at 2:30?" Answers cite both audio and visuals.