Do VLMs Need Vision Transformers? Evaluating State Space Models as Vision Encoders Paper β’ 2603.19209 β’ Published 4 days ago β’ 2
V-JEPA 2.1: Unlocking Dense Features in Video Self-Supervised Learning Paper β’ 2603.14482 β’ Published 8 days ago β’ 15
Running 1 Official Benchmarks Leaderboard 2026 π 1 Explore and compare AI model scores across official benchmarks
Omnilingual MT: Machine Translation for 1,600 Languages Paper β’ 2603.16309 β’ Published 6 days ago β’ 13
Running 1 Official Benchmarks Leaderboard 2026 π 1 Explore and compare AI model scores across official benchmarks
Strategic Navigation or Stochastic Search? How Agents and Humans Reason Over Document Collections Paper β’ 2603.12180 β’ Published 11 days ago β’ 63
Strategic Navigation or Stochastic Search? How Agents and Humans Reason Over Document Collections Paper β’ 2603.12180 β’ Published 11 days ago β’ 63
VidEoMT: Your ViT is Secretly Also a Video Segmentation Model Paper β’ 2602.17807 β’ Published Feb 19 β’ 6
Causal-JEPA: Learning World Models through Object-Level Latent Interventions Paper β’ 2602.11389 β’ Published Feb 11 β’ 7
UPLiFT: Efficient Pixel-Dense Feature Upsampling with Local Attenders Paper β’ 2601.17950 β’ Published Jan 25 β’ 4
TCAndon-Router: Adaptive Reasoning Router for Multi-Agent Collaboration Paper β’ 2601.04544 β’ Published Jan 8 β’ 6
CASA: Cross-Attention via Self-Attention for Efficient Vision-Language Fusion Paper β’ 2512.19535 β’ Published Dec 22, 2025 β’ 12