Using dataset in streaming mode , causing increasing in ram


ds_stream = load_dataset(
        dataset_name,
        split=split,
        streaming=True,
        token=hf_token,
    )

for example in ds_stream:
   # do some processing here
   buffer.append(processed_example)
   
   if len(buffer) == CHUNK_SIZE:
        chunk_ds = Dataset.from_list(
                buffer,
                features=FEATURES,  
        )
        # do some processing here 
        del chunk_ds
        buffer.clear()
        gc.collect()

  1. I am using the above code ,but as we iter the RAM is also increasing . is there anything wrong i am doing and this is the issue with the datasets library.

  2. I am trying to skip to 100000 rows by using ds_stream.skip(100000) , but this is also taking too long and the ram is also increasing when it is running. is there any other method for this ?

1 Like

hmm… due to from_list?


Your code is “streaming”, but the step that dominates memory behavior is not streaming. It is the repeated materialization step:

  • You stream examples one by one.
  • You accumulate Python objects in buffer.
  • Every CHUNK_SIZE you call Dataset.from_list(buffer, features=...), which builds an in-memory Arrow table.

That combination often makes RAM look like it “keeps increasing”, even when nothing is strictly leaking.

Below is how I would reason about your case and what I would change.


1) Why RAM increases in your loop

1.1 You repeatedly allocate large Arrow buffers

Dataset.from_list(...) converts a Python list of dicts into Arrow arrays. That is a big allocation step, repeated many times.

Even if you del chunk_ds, the process RSS (what top/htop shows) can stay high due to allocator behavior and Arrow memory pooling. PyArrow explicitly documents a memory pool and even provides release_unused() as a debug method to attempt to return memory to the OS. It also warns it can be expensive and is mainly for benchmarking/debugging. (Apache Arrow)

What you see in practice is often:

  • RSS rises to a high-water mark as the program learns the “largest chunk so far”.
  • RSS does not go back down, even if objects are freed.
  • RSS may keep rising if chunk sizes or average example sizes trend upward.

This can look like a leak, but it is commonly “high-water mark” + pooling.

1.2 Your FEATURES can cause extra casts and extra transient memory

Passing features=FEATURES can force casting into a strict schema. Casting variable-length fields (strings, lists of ints, nested features) can allocate additional buffers. You can see transient “double allocation” patterns in chunk conversions.

1.3 There are also real reports of streaming memory growth

So you should keep a second hypothesis alive: there can be genuine growth in some streaming workflows. There are active issues reporting RAM increasing shard-by-shard with streaming=True. (GitHub)

That means your symptom could be:

  • mostly allocator pooling, or
  • a real retention issue (your code retains references), or
  • a library-level issue triggered by your format, your dataset, or your version.

2) The fastest way to tell “pooling” vs “leak”

You want one question answered: Does memory plateau or grow without bound?

2.1 Track Arrow pool bytes, not only RSS

RSS alone is misleading. Arrow can free memory internally while RSS stays high.

Use PyArrow pool stats:

import pyarrow as pa

pool = pa.default_memory_pool()
print(pool.bytes_allocated(), pool.max_memory())  # current and peak
# Debug only:
pool.release_unused()

release_unused() exists exactly for “try to give back memory” debugging. (Apache Arrow)

Interpretation:

  • If bytes_allocated() goes up and down but RSS only goes up, it is mostly pooling/allocator behavior.
  • If bytes_allocated() keeps climbing, something is truly being retained.

2.2 A brutal control test: remove Dataset.from_list

Do one run where you still stream, but you do not build chunk_ds at all. Only count examples or write minimal output. If RAM becomes stable, your growth is mainly from repeated materialization.


3) What I would change in your code (most impact first)

Change A: Stop building Dataset objects per chunk if you can

If your “do some processing here” after Dataset.from_list can be expressed as Python or batch logic, keep it streaming-native and skip the conversion step entirely.

The official streaming docs emphasize IterableDataset operations like take and skip, and they treat the dataset as a sequential stream. (Hugging Face)

A simple chunker without Dataset.from_list:

from itertools import islice
from datasets import load_dataset

ds = load_dataset(dataset_name, split=split, streaming=True, token=hf_token)

it = iter(ds)
while True:
    chunk = list(islice(it, CHUNK_SIZE))
    if not chunk:
        break

    processed = [process_example(x) for x in chunk]
    # do chunk processing on "processed" directly

This keeps memory bounded by:

  • CHUNK_SIZE worth of Python objects, plus
  • whatever your processing allocates.

No Arrow table creation per chunk.

Change B: If you truly need a Dataset, write shards to disk instead of keeping them in RAM

If you need Arrow/HF Dataset functionality (map, filtering, formatting), the scalable pattern is:

  • stream → process → write a shard (Parquet/Arrow) → discard → next shard

Then you can reload the written shards as a map-style dataset later and slice quickly.

This also solves your “skip 100000” problem because you can start from a shard boundary.

Change C: If you must keep Dataset.from_list, treat RAM growth as “expected high-water mark” unless it is unbounded

In that case:

  • accept that RSS may not drop,
  • verify plateau using Arrow pool stats,
  • use pool.release_unused() only as a debug aid. (Apache Arrow)

4) Your .skip(100000) problem

4.1 Why it is slow

In streaming mode, you generally cannot “jump” to row 100000 without reading earlier rows. The stream is sequential.

This is a known pain point. There is a feature request to speed up .skip() by adding “examples per shard” metadata so the library can skip whole shards without downloading them. The fact that this request exists tells you the current behavior is often inherently slow. (GitHub)

There are also forum reports of large .skip() taking a very long time in streaming mode. (Hugging Face Forums)

4.2 Why RAM rises while skipping

Skipping still requires:

  • reading records,
  • decoding them,
  • allocating Python objects,
  • then discarding them.

So you can still see RSS rise from allocator pooling even though you “aren’t keeping data”.

4.3 Practical alternatives

Pick one based on your goal:

Option 1: If you only need “start later” once, don’t stream. Use split slicing.
This does not work with streaming. There is an issue explicitly stating that when streaming=True, split slicing is not supported and you should use .skip() / .take() instead. (GitHub)
So this option is only valid if you can load non-streaming and accept the storage/caching cost.

Option 2: Start at shard boundaries, not row boundaries.
If your dataset has many shards, starting from a later shard is much cheaper than skipping 100k examples. The streaming docs discuss shards and warn about how take and skip interact with shard order. (Hugging Face)
This is not an exact “row 100000”, but it can be fast.

Option 3: The robust solution. Pre-shard your own processed output.
Stream once, write processed shards of known size (example: 50k rows per Parquet file). Next time, “skip 100k” becomes “open shard 3”.

This is the only approach that gives you a true seek-like experience with large data.


5) Likely diagnosis for your exact code

Most likely:

  1. Your RAM increase is mostly high-water mark + Arrow pooling caused by Dataset.from_list repeated allocations. PyArrow explicitly acknowledges pooled memory and provides release_unused() for debugging. (Apache Arrow)
  2. .skip(100000) is slow because streaming is sequential and .skip() cannot generally jump. The skip-speed issue and forum thread match your symptom closely. (GitHub)
  3. If memory grows without plateau even after removing Dataset.from_list, you might be hitting a real issue similar to the streaming memory growth reports. (GitHub)

High-quality references (hand-picked)

  • HF streaming docs (take, skip, shard order notes): (Hugging Face)
  • “Split slicing not supported with streaming; use skip/take”: (GitHub)
  • Slow skip report (forum): (Hugging Face Forums)
  • Skip speed improvement request (why it is slow): (GitHub)
  • Streaming memory growth reports (issues): (GitHub)
  • PyArrow memory pool and release_unused() documentation: (Apache Arrow)

Summary

  • Your loop is not purely streaming. Dataset.from_list repeatedly materializes Arrow data and drives RAM behavior.
  • RSS rising is often pooling, not a leak. Validate with PyArrow pool stats and plateau behavior. (Apache Arrow)
  • .skip(100000) is usually O(n) in streaming. That slowness is known and documented by user reports and feature requests. (GitHub)
  • Best practical fix: stream in batches without building Dataset per chunk, or stream once and write fixed-size shards to disk for fast resume/skip later.

This is a great breakdown. One thing we’ve seen trip people up is assuming “streaming” automatically implies bounded memory, when in practice allocator behavior and repeated materialization still matter a lot.

Validating whether memory plateaus (vs grows unbounded) using Arrow pool stats is a useful sanity check.

1 Like