Oh… Seems normal.
1) Why this takes so long in Colab, and why the GPU is idle
What that map(tokenize, ...) cell is doing
In the course section you linked, the dataset is large and this preprocessing step creates a much larger dataset than you started with.
- You start with ~606,720 training files. (Hugging Face)
- With
context_length=128 and return_overflowing_tokens=True, each long file is split into many 128-token chunks. The course shows this becomes ~16.7 million training sequences after preprocessing. (Hugging Face)
So the time isn’t just “tokenize 600k strings”; it’s also “write tens of millions of rows to Arrow cache files”, which is often dominated by CPU + disk I/O on Colab.
Why you get “not using GPU”
Tokenization in the Hugging Face stack is CPU-side (including the “Fast” Rust tokenizers). HF maintainers explicitly state tokenization does not run on GPU. (Hugging Face Forums)
datasets.Dataset.map() only “uses GPU” if your mapped function itself performs CUDA work (e.g., model inference inside map()), and if you do that with multiprocessing you must use the spawn start method to avoid CUDA fork errors. (Hugging Face)
That’s not your situation: your tokenize() is pure CPU/string work.
Is “> 1 hour” reasonable?
Yes—given the expansion to ~16.7M examples and heavy cache writing, “hour-scale” preprocessing is plausible on Colab. (Hugging Face)
It’s also common to see speed drop near the end as writing/merging becomes the bottleneck. (Hugging Face Forums)
1A) What actually speeds up Dataset.map() in your case (CPU + I/O tuning)
These are the knobs that matter for large tokenization jobs:
Use batched mapping and tune batch_size
Datasets defaults to batch_size=1000 when batched=True, and you can adjust it. (Hugging Face)
Larger batches often improve throughput until you hit RAM limits.
Use CPU multiprocessing: num_proc
Tokenization is CPU-bound, so num_proc=2 or 4 can help on Colab (depending on available cores). The Datasets processing guide covers batched mapping and processing functions. (Hugging Face)
Reduce cache write overhead: writer_batch_size
writer_batch_size controls how many rows are written per operation. The docs state the default is 1000 and explain the speed/memory tradeoff. (Hugging Face)
HF staff also point to writer_batch_size as the parameter to reduce frequent flushing when mapping large datasets. (Hugging Face Forums)
Practical Colab configuration (good starting point)
import os
os.environ["TOKENIZERS_PARALLELISM"] = "false" # avoids common fork/parallelism warnings
tokenized_datasets = raw_datasets.map(
tokenize,
batched=True,
batch_size=2000, # try 1000, 2000, 5000
num_proc=2, # try 2 (then 4 if you have cores)
writer_batch_size=5000, # try 2000–20000
remove_columns=raw_datasets["train"].column_names,
desc="Tokenizing",
)
Workflow tip that saves the most time
Use a subset to debug the pipeline end-to-end before you run the full preprocessing. The official course notebook is meant to be runnable in Colab, but full preprocessing can be expensive. (colab.research.google.com)
2) Packing before chunking: your approach is conceptually correct (here’s the efficient version)
Background: why packing helps
Your original function discards the remainder of each document after chunking to context_length. Packing reduces this waste by:
- inserting EOS between documents,
- concatenating tokens into a stream,
- chunking the stream into fixed-size blocks.
This is the same overall strategy as the canonical CLM preprocessing pattern used in HF’s run_clm.py, which concatenates then chunks (group_texts). (GitHub)
The main performance pitfall
Don’t build one gigantic concatenated list for the entire dataset in memory. Pack within each batch (batched=True), which keeps memory bounded and is what Datasets is optimized for. (Hugging Face)
Also avoid repeated + concatenations in a loop (can become quadratic). Prefer extend() or itertools.chain.
A fast “pack then chunk” tokenize() (minimal changes)
This replaces return_overflowing_tokens with explicit packing:
from itertools import chain
def tokenize(element):
outputs = tokenizer(
element["content"],
truncation=False,
add_special_tokens=False,
)
eos = tokenizer.eos_token_id
# Add EOS after each document, then flatten efficiently
stream = list(chain.from_iterable(ids + [eos] for ids in outputs["input_ids"]))
# Chunk
total_len = (len(stream) // context_length) * context_length
if total_len == 0:
return {"input_ids": []}
input_ids = [stream[i:i+context_length] for i in range(0, total_len, context_length)]
return {"input_ids": input_ids}
And map it (with the same speed knobs as above):
tokenized_datasets = raw_datasets.map(
tokenize,
batched=True,
batch_size=2000,
num_proc=2,
writer_batch_size=5000,
remove_columns=raw_datasets["train"].column_names,
)
One subtle limitation (and how to mitigate it)
Packing inside map(batched=True) packs within each batch, so any leftover tokens at the end of a batch are dropped. Increase batch_size to reduce this boundary waste. (Hugging Face)
“Is it okay that tokens can attend across document boundaries?”
With standard CLM pretraining, this is commonly accepted; EOS is the boundary signal. If you wanted strict isolation, you’d need special attention masking or a different packing strategy—more complex than what this exercise targets.
Quick checklist for your Colab run
- Accept that GPU is not used during tokenization; GPU matters during training. (Hugging Face Forums)
- Expect preprocessing to take a long time because it expands to millions of chunks. (Hugging Face)
- Speed it up with
batch_size, num_proc, writer_batch_size. (Hugging Face)
- Implement packing by tokenizing with
truncation=False, inserting EOS, concatenating per batch, then chunking (mirrors run_clm.py’s concatenate→chunk logic). (GitHub)
If you apply the packing function and the three map knobs (batch_size, num_proc, writer_batch_size), you typically get (a) less wasted data, and (b) more predictable preprocessing time on Colab, without trying to force GPU use where it doesn’t apply.