etornam45/vl-jepa

This VL-JEPA implimentation takes direct insperation from the original VL-JEPA paper

/ 100

Emerging

This project helps machine learning researchers working with video and text data. It combines existing models like DINOv3 for video and Gemma for text to create a system that can learn relationships between them. The input is video and associated text, and the output is a trained predictor capable of understanding these multimodal relationships, without needing to retrain the core video and text models themselves.

Use this if you are a machine learning researcher focused on multimodal learning, specifically aiming to build models that predict relationships between video and text without extensive new model training.

Not ideal if you are looking for an off-the-shelf solution for video analysis or natural language processing without deep involvement in model architecture and training.

multimodal-AI video-understanding natural-language-processing deep-learning-research representation-learning

No License No Package No Dependents

Maintenance 10 / 25

Adoption 4 / 25

Maturity 3 / 25

Community 13 / 25

How are scores calculated?

Stars

Forks

Language

Python

License

—

Higher-rated alternatives

open-mmlab/mmpretrain

OpenMMLab Pre-training Toolbox and Benchmark

facebookresearch/mmf

A modular framework for vision & language multimodal research from Facebook AI Research (FAIR)

adambielski/siamese-triplet

Siamese and triplet networks with online pair/triplet mining in PyTorch

HuaizhengZhang/Awsome-Deep-Learning-for-Video-Analysis

Papers, code and datasets about deep learning and multi-modal learning for video analysis

KaiyangZhou/pytorch-vsumm-reinforce

Unsupervised video summarization with deep reinforcement learning (AAAI'18)

Explore ML Frameworks

All categories Trending ML Framework directory Insights