LLM course - training a causal language model from scratch

Hi, I’m working through the LLM course in building a causal language model from scratch. I have two problems.

  1. I’m running the notebook in Google Colab and when I get to

    def tokenize(element):
    outputs = tokenizer(
    element[“content”],
    truncation=True,
    max_length=context_length,
    return_overflowing_tokens=True,
    return_length=True,
    )
    input_batch = []
    for length, input_ids in zip(outputs[“length”], outputs[“input_ids”]):
    if length == context_length:
    input_batch.append(input_ids)
    return {“input_ids”: input_batch}

    tokenized_datasets = raw_datasets.map(
    tokenize, batched=True, remove_columns=raw_datasets[“train”].column_names
    )
    tokenized_datasets

    Even though I have the GPU selected, it says this will take over an hour to process (is this a reasonable amount of time?). I keep getting a warning that I am not using the GPU. How can I getDataset.map()to run on the GPU (I was under the impression that it should)?

  2. Second, there is a “Try it out!” exercise about modifying the tokenize() function above so that it packs the sequences together before chunking them (so you throw away less of the training data). My approach is to

    1. Add the EOS token to the end of each sequence.

    2. Tokenize all of the sequences.

    3. Concatenate the lists (either through + or append())

    4. Chunk by iterating through the list like

      chunks = []
      for i in range(0, len(concatenated_sequence), context_length):
      chunks.append(concatenated_sequences[i:i+context_length])

    But I have a feeling that this is note the correct way to do it and will be incredibly slow. Does anyone have any pointers?

Thank you for any help!

1 Like

Oh… Seems normal.


1) Why this takes so long in Colab, and why the GPU is idle

What that map(tokenize, ...) cell is doing

In the course section you linked, the dataset is large and this preprocessing step creates a much larger dataset than you started with.

  • You start with ~606,720 training files. (Hugging Face)
  • With context_length=128 and return_overflowing_tokens=True, each long file is split into many 128-token chunks. The course shows this becomes ~16.7 million training sequences after preprocessing. (Hugging Face)

So the time isn’t just “tokenize 600k strings”; it’s also “write tens of millions of rows to Arrow cache files”, which is often dominated by CPU + disk I/O on Colab.

Why you get “not using GPU”

Tokenization in the Hugging Face stack is CPU-side (including the “Fast” Rust tokenizers). HF maintainers explicitly state tokenization does not run on GPU. (Hugging Face Forums)

datasets.Dataset.map() only “uses GPU” if your mapped function itself performs CUDA work (e.g., model inference inside map()), and if you do that with multiprocessing you must use the spawn start method to avoid CUDA fork errors. (Hugging Face)
That’s not your situation: your tokenize() is pure CPU/string work.

Is “> 1 hour” reasonable?

Yes—given the expansion to ~16.7M examples and heavy cache writing, “hour-scale” preprocessing is plausible on Colab. (Hugging Face)
It’s also common to see speed drop near the end as writing/merging becomes the bottleneck. (Hugging Face Forums)


1A) What actually speeds up Dataset.map() in your case (CPU + I/O tuning)

These are the knobs that matter for large tokenization jobs:

Use batched mapping and tune batch_size

Datasets defaults to batch_size=1000 when batched=True, and you can adjust it. (Hugging Face)
Larger batches often improve throughput until you hit RAM limits.

Use CPU multiprocessing: num_proc

Tokenization is CPU-bound, so num_proc=2 or 4 can help on Colab (depending on available cores). The Datasets processing guide covers batched mapping and processing functions. (Hugging Face)

Reduce cache write overhead: writer_batch_size

writer_batch_size controls how many rows are written per operation. The docs state the default is 1000 and explain the speed/memory tradeoff. (Hugging Face)
HF staff also point to writer_batch_size as the parameter to reduce frequent flushing when mapping large datasets. (Hugging Face Forums)

Practical Colab configuration (good starting point)

import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"  # avoids common fork/parallelism warnings

tokenized_datasets = raw_datasets.map(
    tokenize,
    batched=True,
    batch_size=2000,        # try 1000, 2000, 5000
    num_proc=2,             # try 2 (then 4 if you have cores)
    writer_batch_size=5000, # try 2000–20000
    remove_columns=raw_datasets["train"].column_names,
    desc="Tokenizing",
)

Workflow tip that saves the most time

Use a subset to debug the pipeline end-to-end before you run the full preprocessing. The official course notebook is meant to be runnable in Colab, but full preprocessing can be expensive. (colab.research.google.com)


2) Packing before chunking: your approach is conceptually correct (here’s the efficient version)

Background: why packing helps

Your original function discards the remainder of each document after chunking to context_length. Packing reduces this waste by:

  1. inserting EOS between documents,
  2. concatenating tokens into a stream,
  3. chunking the stream into fixed-size blocks.

This is the same overall strategy as the canonical CLM preprocessing pattern used in HF’s run_clm.py, which concatenates then chunks (group_texts). (GitHub)

The main performance pitfall

Don’t build one gigantic concatenated list for the entire dataset in memory. Pack within each batch (batched=True), which keeps memory bounded and is what Datasets is optimized for. (Hugging Face)

Also avoid repeated + concatenations in a loop (can become quadratic). Prefer extend() or itertools.chain.

A fast “pack then chunk” tokenize() (minimal changes)

This replaces return_overflowing_tokens with explicit packing:

from itertools import chain

def tokenize(element):
    outputs = tokenizer(
        element["content"],
        truncation=False,
        add_special_tokens=False,
    )

    eos = tokenizer.eos_token_id
    # Add EOS after each document, then flatten efficiently
    stream = list(chain.from_iterable(ids + [eos] for ids in outputs["input_ids"]))

    # Chunk
    total_len = (len(stream) // context_length) * context_length
    if total_len == 0:
        return {"input_ids": []}

    input_ids = [stream[i:i+context_length] for i in range(0, total_len, context_length)]
    return {"input_ids": input_ids}

And map it (with the same speed knobs as above):

tokenized_datasets = raw_datasets.map(
    tokenize,
    batched=True,
    batch_size=2000,
    num_proc=2,
    writer_batch_size=5000,
    remove_columns=raw_datasets["train"].column_names,
)

One subtle limitation (and how to mitigate it)

Packing inside map(batched=True) packs within each batch, so any leftover tokens at the end of a batch are dropped. Increase batch_size to reduce this boundary waste. (Hugging Face)

“Is it okay that tokens can attend across document boundaries?”

With standard CLM pretraining, this is commonly accepted; EOS is the boundary signal. If you wanted strict isolation, you’d need special attention masking or a different packing strategy—more complex than what this exercise targets.


Quick checklist for your Colab run

  • Accept that GPU is not used during tokenization; GPU matters during training. (Hugging Face Forums)
  • Expect preprocessing to take a long time because it expands to millions of chunks. (Hugging Face)
  • Speed it up with batch_size, num_proc, writer_batch_size. (Hugging Face)
  • Implement packing by tokenizing with truncation=False, inserting EOS, concatenating per batch, then chunking (mirrors run_clm.py’s concatenate→chunk logic). (GitHub)

If you apply the packing function and the three map knobs (batch_size, num_proc, writer_batch_size), you typically get (a) less wasted data, and (b) more predictable preprocessing time on Colab, without trying to force GPU use where it doesn’t apply.