Enhancing Linguistic Competence of Language Models through Pre-training with Language Learning Tasks
Abstract
Language models pre-trained with a framework combining standard next-token prediction and structured language learning tasks show enhanced linguistic competence without sacrificing general reasoning capabilities.
Language models (LMs) are pre-trained on raw text datasets to generate text sequences token-by-token. While this approach facilitates the learning of world knowledge and reasoning, it does not explicitly optimize for linguistic competence. To bridge this gap, we propose L2T, a pre-training framework integrating Language Learning Tasks alongside standard next-token prediction. Inspired by human language acquisition, L2T transforms raw text into structured input-output pairs to provide explicit linguistic stimulation. Pre-training LMs on a mixture of raw text and L2T data not only improves overall performance on linguistic competence benchmarks but accelerates its acquisition, while maintaining competitive performance on general reasoning tasks.
Community
We propose L2T, a pre-training framework integrating Language Learning Tasks alongside standard next-token prediction. L2T establishes the structural scaffolding required for linguistic competence, complementing world knowledge acquired through standard CLM.
The code is available on GitHub: https://github.com/gucci-j/l2t
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Cross-Lingual Interleaving for Speech Language Models (2025)
- Persian-Phi: Efficient Cross-Lingual Adaptation of Compact LLMs via Curriculum Learning (2025)
- HSKBenchmark: Modeling and Benchmarking Chinese Second Language Acquisition in Large Language Models through Curriculum Tuning (2025)
- From Scratch to Fine-Tuned: A Comparative Study of Transformer Training Strategies for Legal Machine Translation (2025)
- Sentence-Anchored Gist Compression for Long-Context LLMs (2025)
- SpidR: Learning Fast and Stable Linguistic Units for Spoken Language Models Without Supervision (2025)
- MiniLingua: A Small Open-Source LLM for European Languages (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 25
Browse 25 models citing this paperDatasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper