How can I store files on my project

Personally, I think using a private dataset repository as a storage is the most stable approach in your case.


Hugging Face Spaces is not “deleting your files after a few days.” What is happening is simpler and stricter:

  1. Your Space’s default disk is ephemeral. Anything you save to the container filesystem can vanish when the Space restarts or is stopped. Hugging Face states this explicitly. (Hugging Face)
  2. Gradio upload paths are cache artifacts. Gradio stores uploaded and generated files in a cache directory, can delete them on a schedule, and the cache is deleted entirely when the server restarts. (gradio.app)

Even if your Space “doesn’t sleep,” it can still restart for other reasons. For example, Hugging Face notes that any change to secrets or hardware triggers a restart. (Hugging Face)

So the fix is not “prevent deletion.” The fix is: persist uploads somewhere that survives restarts.

You have two Hugging Face native options (plus a third external-storage option).


The storage types you must not mix up

A) Space repo storage (Git/LFS) is not where user uploads go

Space repositories have a small Git/LFS cap (commonly hit around ~1 GB), so pushing user uploads into the Space repo is a dead end. (Hugging Face Forums)

B) Runtime filesystem (what your code writes to by default) is ephemeral

Hugging Face’s “Disk usage on Spaces” doc: ephemeral disk content is lost on restart/stop. (Hugging Face)

C) Persistent Storage add-on mounts durable disk at /data

If you pay for persistent storage, /data behaves like a normal disk and survives restarts. Hugging Face documents the mount path and the persistence behavior. (Hugging Face)

D) Dataset repos can act as a durable datastore

Hugging Face explicitly recommends “use a dataset as a data store” when you need data to outlive the Space. (Hugging Face)


Best-practice decision (what I would do for your case)

If you want persistence on free tier

Use a private dataset repo as your upload bucket.
This is the most reliable for “uploads must not disappear,” because it survives restarts automatically. Hugging Face documents upload APIs and this pattern. (Hugging Face)

If you are OK paying a small monthly add-on

Add persistent storage and write to /data/uploads/…, optionally also pushing to a dataset repo for backup or sharing.

If you expect heavy traffic or very large files

Use external object storage (S3, R2, GCS). Keep the Space stateless. Dataset repos are git-backed, which is great for moderate volume, not ideal for extremely high write rates.


Option 1: Store uploads on /data (persistent storage add-on)

Why it works

Hugging Face: persistent storage “acts like traditional disk storage mounted on /data” and “persists across restarts.” (Hugging Face)

What to do

  1. Enable persistent storage in the Space Settings. (This is required for /data to actually persist.) (Hugging Face)

  2. Save user files to /data/uploads/…

  3. Point caches to /data so you do not redownload models every restart:

    • Set HF_HOME=/data/.huggingface as Hugging Face recommends. (Hugging Face)
  4. Point Gradio temp/cache to /data so uploads do not land on ephemeral disk first:

    • Set GRADIO_TEMP_DIR=/data/gradio-tmp (Gradio documents this env var). (gradio.app)

Pitfall to avoid

Do not assume /data is persistent unless you actually attached the persistent storage upgrade. (Hugging Face)


Option 2: Store uploads in a private dataset repo (recommended default)

This is the cleanest “Spaces-native” way to store user uploads long term.

Why it works

  • The Space is ephemeral, but the Hub repo is durable.
  • Hugging Face explicitly suggests using a dataset repo as the datastore for persistence. (Hugging Face)
  • The Hub Python library provides upload_file() and upload_folder() for dataset repos. (Hugging Face)

Setup steps

  1. Create a dataset repo on the Hub, set it to Private.

  2. Create a Hugging Face token with write access.

  3. Add it to your Space as a Secret named HF_TOKEN.

    • Hugging Face confirms secrets become environment variables inside the Space. (Hugging Face)

Best-practice repo layout

Use a predictable structure:

  • uploads/YYYY-MM-DD/<uuid>.<ext>
  • meta/YYYY-MM-DD/<uuid>.json

This gives you:

  • easy cleanup
  • easy per-day browsing
  • easy indexing later

Code pattern 1: Upload immediately (simple, safest)

Use this when upload volume is low to moderate.

# deps: huggingface_hub, gradio
import os, uuid, datetime, json, hashlib
from pathlib import Path
from huggingface_hub import HfApi
import gradio as gr

DATASET_REPO = os.environ["DATASET_REPO"]  # e.g. "yourname/private-user-uploads"
api = HfApi()

def sha256_file(path: Path) -> str:
    h = hashlib.sha256()
    with path.open("rb") as f:
        for chunk in iter(lambda: f.read(1024 * 1024), b""):
            h.update(chunk)
    return h.hexdigest()

def save_upload(file_obj):
    token = os.environ.get("HF_TOKEN")
    if not token:
        raise RuntimeError("HF_TOKEN secret is missing.")

    src = Path(getattr(file_obj, "name", str(file_obj)))
    day = datetime.date.today().isoformat()
    uid = uuid.uuid4().hex
    ext = src.suffix.lower() or ".bin"

    remote_file = f"uploads/{day}/{uid}{ext}"
    remote_meta = f"meta/{day}/{uid}.json"

    meta = {
        "id": uid,
        "date": day,
        "original_name": src.name,
        "ext": ext,
        "sha256": sha256_file(src),
    }

    api.upload_file(
        repo_id=DATASET_REPO,
        repo_type="dataset",
        path_in_repo=remote_file,
        path_or_fileobj=str(src),
        token=token,
        commit_message=f"Add upload {uid}",
    )

    api.upload_file(
        repo_id=DATASET_REPO,
        repo_type="dataset",
        path_in_repo=remote_meta,
        path_or_fileobj=json.dumps(meta).encode("utf-8"),
        token=token,
        commit_message=f"Add metadata {uid}",
    )

    return f"Saved as {uid}"

demo = gr.Interface(save_upload, gr.File(label="Upload image or PDF"), gr.Textbox())
demo.launch()

This uses the official upload API and repo_type="dataset" support. (Hugging Face)
It uses HF_TOKEN via environment variable, which Hugging Face documents. (Hugging Face)

Code pattern 2: Batch commits with CommitScheduler (best practice at medium scale)

If you commit on every upload, your dataset repo history can become noisy. Hugging Face calls this out and provides CommitScheduler to push a local folder every N minutes. (Hugging Face)

Key rules Hugging Face documents for CommitScheduler:

  • It assumes append-only behavior. Overwriting or deleting can corrupt the repo. (Hugging Face)
  • Use at least ~5 minutes between commits to avoid polluting history. (Hugging Face)
  • Use unique file names (UUID) to avoid collisions across concurrent users and restarts. (Hugging Face)

You can apply the same pattern to uploaded files:

  • save uploads into staging/… locally
  • scheduler pushes staging/ into the dataset repo every 5 to 10 minutes

Gradio best practices (so you stop losing upload files)

1) Treat the upload path as temporary

Gradio explicitly: uploaded and generated files go to a cache directory; it may delete them periodically; and the cache is deleted on server restart. (gradio.app)

So your handler should:

  • read the file immediately
  • copy it to /data or upload it to the dataset repo
  • not rely on the original temporary path later

2) Control the cache location

Set GRADIO_TEMP_DIR if you have /data persistent storage. Gradio documents this env var. (gradio.app)

3) Be careful with delete_cache

If you set delete_cache=(frequency, age), Gradio will delete cached files older than age seconds every frequency seconds. (gradio.app)
That is good for privacy and disk usage, but it guarantees “uploaded file disappears” unless you persist it elsewhere.


Security and product best practices for user uploads

Upload validation

  • Enforce size limits.
  • Restrict extensions and MIME types (PDF, PNG, JPG).
  • Consider hashing (sha256) to deduplicate repeated uploads.
  • Consider antivirus scanning if you allow public uploads.

Access control

  • Keep uploads in a private dataset repo.
  • Never expose HF_TOKEN to the client.
  • Use Space Secrets for tokens. Hugging Face documents secrets-as-env-vars. (Hugging Face)

Retention and cleanup

If you store user files, define a policy:

  • delete after X days unless “saved”
  • per-user quota
  • periodic cleanup job (dataset-side or external)

Avoid the Space repo for user data

Space repo Git/LFS cap is easy to hit and is meant for code. (Hugging Face Forums)


What I recommend you implement first

If you want “works reliably” with minimal moving parts:

  1. Create a private dataset repo as your upload bucket. (Hugging Face)
  2. Store HF_TOKEN as a Space Secret. (Hugging Face)
  3. On each upload, immediately upload_file() to the dataset repo. (Hugging Face)
  4. Add metadata and a stable ID so you can retrieve files later.
  5. If volume grows, switch to CommitScheduler and batch commits every 5–10 minutes, append-only. (Hugging Face)

Summary

  • Files disappear because Space disk is ephemeral and Gradio upload paths are cache, both vanish on restart or cleanup. (Hugging Face)
  • Best default fix: store uploads in a private dataset repo using HF_TOKEN + upload_file(repo_type="dataset"). (Hugging Face)
  • Best practice for scale: use CommitScheduler batching, append-only, and avoid frequent commits. (Hugging Face)
  • If you pay for it, persistent storage mounted at /data is durable and you can also set HF_HOME and GRADIO_TEMP_DIR there. (Hugging Face)