-
mHC: Manifold-Constrained Hyper-Connections
Paper • 2512.24880 • Published • 305 -
Fantastic Reasoning Behaviors and Where to Find Them: Unsupervised Discovery of the Reasoning Process
Paper • 2512.23988 • Published • 17 -
SpaceTimePilot: Generative Rendering of Dynamic Scenes Across Space and Time
Paper • 2512.25075 • Published • 15 -
Guiding a Diffusion Transformer with the Internal Dynamics of Itself
Paper • 2512.24176 • Published • 8
Collections
Discover the best community collections!
Collections including paper arxiv:2601.00417
-
A Unified View of Attention and Residual Sinks: Outlier-Driven Rescaling is Essential for Transformer Training
Paper • 2601.22966 • Published -
STEM: Scaling Transformers with Embedding Modules
Paper • 2601.10639 • Published • 1 -
Deep Delta Learning
Paper • 2601.00417 • Published • 34 -
mHC: Manifold-Constrained Hyper-Connections
Paper • 2512.24880 • Published • 305
-
Nuclear Norm Regularization for Deep Learning
Paper • 2405.14544 • Published • 1 -
Token embeddings violate the manifold hypothesis
Paper • 2504.01002 • Published • 1 -
Approximate Nullspace Augmented Finetuning for Robust Vision Transformers
Paper • 2403.10476 • Published • 1 -
ElaLoRA: Elastic & Learnable Low-Rank Adaptation for Efficient Model Fine-Tuning
Paper • 2504.00254 • Published • 1
-
Blending Is All You Need: Cheaper, Better Alternative to Trillion-Parameters LLM
Paper • 2401.02994 • Published • 52 -
MambaByte: Token-free Selective State Space Model
Paper • 2401.13660 • Published • 60 -
Repeat After Me: Transformers are Better than State Space Models at Copying
Paper • 2402.01032 • Published • 24 -
BlackMamba: Mixture of Experts for State-Space Models
Paper • 2402.01771 • Published • 25
-
Seedream 4.0: Toward Next-generation Multimodal Image Generation
Paper • 2509.20427 • Published • 82 -
Tree Search for LLM Agent Reinforcement Learning
Paper • 2509.21240 • Published • 92 -
SHANKS: Simultaneous Hearing and Thinking for Spoken Language Models
Paper • 2510.06917 • Published • 35 -
Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models
Paper • 2510.04618 • Published • 129
-
Forgetting Transformer: Softmax Attention with a Forget Gate
Paper • 2503.02130 • Published • 32 -
L^2M: Mutual Information Scaling Law for Long-Context Language Modeling
Paper • 2503.04725 • Published • 21 -
Transformers without Normalization
Paper • 2503.10622 • Published • 170 -
I-Con: A Unifying Framework for Representation Learning
Paper • 2504.16929 • Published • 30
-
mHC: Manifold-Constrained Hyper-Connections
Paper • 2512.24880 • Published • 305 -
Fantastic Reasoning Behaviors and Where to Find Them: Unsupervised Discovery of the Reasoning Process
Paper • 2512.23988 • Published • 17 -
SpaceTimePilot: Generative Rendering of Dynamic Scenes Across Space and Time
Paper • 2512.25075 • Published • 15 -
Guiding a Diffusion Transformer with the Internal Dynamics of Itself
Paper • 2512.24176 • Published • 8
-
A Unified View of Attention and Residual Sinks: Outlier-Driven Rescaling is Essential for Transformer Training
Paper • 2601.22966 • Published -
STEM: Scaling Transformers with Embedding Modules
Paper • 2601.10639 • Published • 1 -
Deep Delta Learning
Paper • 2601.00417 • Published • 34 -
mHC: Manifold-Constrained Hyper-Connections
Paper • 2512.24880 • Published • 305
-
Seedream 4.0: Toward Next-generation Multimodal Image Generation
Paper • 2509.20427 • Published • 82 -
Tree Search for LLM Agent Reinforcement Learning
Paper • 2509.21240 • Published • 92 -
SHANKS: Simultaneous Hearing and Thinking for Spoken Language Models
Paper • 2510.06917 • Published • 35 -
Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models
Paper • 2510.04618 • Published • 129
-
Nuclear Norm Regularization for Deep Learning
Paper • 2405.14544 • Published • 1 -
Token embeddings violate the manifold hypothesis
Paper • 2504.01002 • Published • 1 -
Approximate Nullspace Augmented Finetuning for Robust Vision Transformers
Paper • 2403.10476 • Published • 1 -
ElaLoRA: Elastic & Learnable Low-Rank Adaptation for Efficient Model Fine-Tuning
Paper • 2504.00254 • Published • 1
-
Forgetting Transformer: Softmax Attention with a Forget Gate
Paper • 2503.02130 • Published • 32 -
L^2M: Mutual Information Scaling Law for Long-Context Language Modeling
Paper • 2503.04725 • Published • 21 -
Transformers without Normalization
Paper • 2503.10622 • Published • 170 -
I-Con: A Unifying Framework for Representation Learning
Paper • 2504.16929 • Published • 30
-
Blending Is All You Need: Cheaper, Better Alternative to Trillion-Parameters LLM
Paper • 2401.02994 • Published • 52 -
MambaByte: Token-free Selective State Space Model
Paper • 2401.13660 • Published • 60 -
Repeat After Me: Transformers are Better than State Space Models at Copying
Paper • 2402.01032 • Published • 24 -
BlackMamba: Mixture of Experts for State-Space Models
Paper • 2402.01771 • Published • 25