Skyline-9/Visionary-Vids

Multi-modal transformer approach for natural language query based joint video summarization and highlight detection

/ 100

Emerging

This project helps video content creators and editors quickly identify and extract the most important segments from long video footage. You provide a natural language query describing what you're looking for, and it outputs precise video clips that match your description and highlight key moments. This is ideal for anyone who needs to efficiently create shorter versions of videos or find specific events within them.

No commits in the last 6 months.

Use this if you need to rapidly summarize long videos or pinpoint specific highlights using simple text descriptions, without manually scrubbing through footage.

Not ideal if you primarily work with image data, require extremely granular frame-by-frame editing, or are not comfortable with command-line tools for setup and operation.

video-editing content-creation media-management video-summarization event-detection

Stale 6m No Package No Dependents

Maintenance 0 / 25

Adoption 6 / 25

Maturity 16 / 25

Community 9 / 25

How are scores calculated?

Stars

Forks

Language

Jupyter Notebook

License

—

Higher-rated alternatives

kyegomez/RT-X

Pytorch implementation of the models RT-1-X and RT-2-X from the paper: "Open X-Embodiment:...

kyegomez/PALI3

Implementation of PALI3 from the paper PALI-3 VISION LANGUAGE MODELS: SMALLER, FASTER, STRONGER"

chuanyangjin/MMToM-QA

[🏆Outstanding Paper Award at ACL 2024] MMToM-QA: Multimodal Theory of Mind Question Answering

lyuchenyang/Macaw-LLM

Macaw-LLM: Multi-Modal Language Modeling with Image, Video, Audio, and Text Integration

Muennighoff/vilio

🥶Vilio: State-of-the-art VL models in PyTorch & PaddlePaddle

Explore Transformer Models

All categories Trending Transformer directory Insights