arxiv:2601.03448

Enhancing Linguistic Competence of Language Models through Pre-training with Language Learning Tasks

Published on Jan 6

· Submitted by

Atsuki Yamaguchi on Jan 8

Language Learning Tasks (L2T) Project

Upvote

Authors:

Atsuki Yamaguchi ,

Abstract

Language models pre-trained with a framework combining standard next-token prediction and structured language learning tasks show enhanced linguistic competence without sacrificing general reasoning capabilities.

AI-generated summary

Language models (LMs) are pre-trained on raw text datasets to generate text sequences token-by-token. While this approach facilitates the learning of world knowledge and reasoning, it does not explicitly optimize for linguistic competence. To bridge this gap, we propose L2T, a pre-training framework integrating Language Learning Tasks alongside standard next-token prediction. Inspired by human language acquisition, L2T transforms raw text into structured input-output pairs to provide explicit linguistic stimulation. Pre-training LMs on a mixture of raw text and L2T data not only improves overall performance on linguistic competence benchmarks but accelerates its acquisition, while maintaining competitive performance on general reasoning tasks.

View arXiv page View PDF GitHub 0 Add to collection

Community

atsuki-yamaguchi

Paper author Paper submitter 1 day ago

We propose L2T, a pre-training framework integrating Language Learning Tasks alongside standard next-token prediction. L2T establishes the structural scaffolding required for linguistic competence, complementing world knowledge acquired through standard CLM.

The code is available on GitHub: https://github.com/gucci-j/l2t