Multiple Image Tokens

#60

by rogergheser - opened 3 days ago

3 days ago

Hey,
thanks for the great work!
I am having weird results and in a debugging session I noticed that the LLavaProcessor produces multiple tokens. According to this guide it would seem like I should be expecting only one? https://github.com/NielsRogge/Transformers-Tutorials/blob/master/LLaVa/Fine_tune_LLaVa_on_a_custom_dataset_(with_PyTorch_Lightning).ipynb

Any help is grealy appreciated, the code for Processing the inputs below.
For reference I get a tokenized input as this:

<s> USER: <image><image> ... <image> \nCaption this image. ASSISTANT: some animals are laying in the grass '

The exact count of <image> tokens is 576. Perhaps that should suggest something I'm missing?

def llava_collate_fn(
        batch: list[ModelInput],
        processor: LlavaProcessor,
        train: bool = False,
    ) -> PreProcessedModelInput:
    """
    Collate function for LLava training. Can be used for train, val and test dataloader.
    Args:
        batch: A batch of ModelInput.
        processor: The Llava processor.
        prob_unsafe: Probability of using unsafe image, caption pair.
        MAX_LENGTH: Maximum length of the input sequence.
    """
    # we only feed the prompt to the model
    images: list = []
    texts: list[str] = []
    unsafe_answers = []
    safe_answers = []
    use_unsafes = []
    for example in batch:
        image, use_unsafe, unsafe, safe = example.image, example.use_unsafe, example.nsfw, example.safe
        images.append(image)
        unsafe_answers.append(unsafe)
        safe_answers.append(safe)
        use_unsafes.append(use_unsafe)
        prompt = unsafe if use_unsafe else safe
        texts.append(
            processor.apply_chat_template(
                conversation=get_train_conversation(prompt) if train else get_eval_conversation(),
                add_generation_prompt=not train,
            )
        )

    for i, img in enumerate(images):
        if not hasattr(img, "convert"):
            raise TypeError(f"Item {i} is not a PIL image. Got {type(img)}")

    
    processed_batch = processor(
        text=texts,
        images=images,
        padding=True,
        return_tensors="pt",
    )
    labels = processed_batch["input_ids"].clone()
    labels[labels == processor.tokenizer.pad_token_id] = -100
    processed_batch["labels"] = labels

    input_ids = processed_batch["input_ids"]
    attention_mask = processed_batch["attention_mask"]
    pixel_values = processed_batch["pixel_values"]
    labels = processed_batch["labels"]

    return PreProcessedModelInput(
        input_ids=input_ids,
        attention_mask=attention_mask,
        pixel_values=pixel_values,
        labels=labels,
        dict_labels={
            "nsfw": unsafe_answers, 
            "safe": safe_answers,
            "use_unsafes": use_unsafes,
        },
    )


def get_train_conversation(caption: str):
    return [
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Caption this image."},
                {"type": "image"},
            ]
        },
        {
            "role": "assistant",
            "content": [
                {"type": "text", "text": caption},
            ]
        }
    ]

def get_eval_conversation():
    return [
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Caption this image."},
                {"type": "image"},
            ]
        }
    ]

nielsr

Llava Hugging Face org 3 days ago

Hi, this model is pretty outdated and does not work well with multiple images.

At the time of writing, I recommend Qwen3-VL: https://huggingface.co/collections/Qwen/qwen3-vl. It comes in different sizes, from 2 billion parameters all the way to 235 billion parameters. It also comes in "instruct" vs. "thinking" versions, which means it either directly generates a response or first generates a chain-of-thought (CoT) reasoning before coming up with the answer. Additionally, it comes in different formats, there's the float16 version, fp8 version, and the GGUF version.

See the docs on how to use it: https://huggingface.co/Qwen/Qwen3-VL-4B-Instruct?library=transformers.

rogergheser

2 days ago

•

edited 2 days ago

Hi Niels,
thanks for the quick response!
I'm working on a research topic and do require to work on this version of the model.
Anyhow, I'm only providing a single image and find multiple tokens. I was essentially wondering if these <image> tokens are representing patches. I see exactly 576 of them, working with 336 with patch size 14 that means a feature map of 24x24 so it would all make sense and however I struggled to find answers online

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment