Skyline-9/Visionary-Vids
Multi-modal transformer approach for natural language query based joint video summarization and highlight detection
This project helps video content creators and editors quickly identify and extract the most important segments from long video footage. You provide a natural language query describing what you're looking for, and it outputs precise video clips that match your description and highlight key moments. This is ideal for anyone who needs to efficiently create shorter versions of videos or find specific events within them.
No commits in the last 6 months.
Use this if you need to rapidly summarize long videos or pinpoint specific highlights using simple text descriptions, without manually scrubbing through footage.
Not ideal if you primarily work with image data, require extremely granular frame-by-frame editing, or are not comfortable with command-line tools for setup and operation.
Stars
17
Forks
2
Language
Jupyter Notebook
License
—
Category
Last pushed
May 23, 2024
Commits (30d)
0
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/transformers/Skyline-9/Visionary-Vids"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Higher-rated alternatives
kyegomez/RT-X
Pytorch implementation of the models RT-1-X and RT-2-X from the paper: "Open X-Embodiment:...
kyegomez/PALI3
Implementation of PALI3 from the paper PALI-3 VISION LANGUAGE MODELS: SMALLER, FASTER, STRONGER"
chuanyangjin/MMToM-QA
[🏆Outstanding Paper Award at ACL 2024] MMToM-QA: Multimodal Theory of Mind Question Answering
lyuchenyang/Macaw-LLM
Macaw-LLM: Multi-Modal Language Modeling with Image, Video, Audio, and Text Integration
Muennighoff/vilio
🥶Vilio: State-of-the-art VL models in PyTorch & PaddlePaddle