Multiple Image Tokens
Hey,
thanks for the great work!
I am having weird results and in a debugging session I noticed that the LLavaProcessor produces multiple tokens. According to this guide it would seem like I should be expecting only one? https://github.com/NielsRogge/Transformers-Tutorials/blob/master/LLaVa/Fine_tune_LLaVa_on_a_custom_dataset_(with_PyTorch_Lightning).ipynb
Any help is grealy appreciated, the code for Processing the inputs below.
For reference I get a tokenized input as this:
<s> USER: <image><image> ... <image> \nCaption this image. ASSISTANT: some animals are laying in the grass '
The exact count of <image> tokens is 576. Perhaps that should suggest something I'm missing?
def llava_collate_fn(
batch: list[ModelInput],
processor: LlavaProcessor,
train: bool = False,
) -> PreProcessedModelInput:
"""
Collate function for LLava training. Can be used for train, val and test dataloader.
Args:
batch: A batch of ModelInput.
processor: The Llava processor.
prob_unsafe: Probability of using unsafe image, caption pair.
MAX_LENGTH: Maximum length of the input sequence.
"""
# we only feed the prompt to the model
images: list = []
texts: list[str] = []
unsafe_answers = []
safe_answers = []
use_unsafes = []
for example in batch:
image, use_unsafe, unsafe, safe = example.image, example.use_unsafe, example.nsfw, example.safe
images.append(image)
unsafe_answers.append(unsafe)
safe_answers.append(safe)
use_unsafes.append(use_unsafe)
prompt = unsafe if use_unsafe else safe
texts.append(
processor.apply_chat_template(
conversation=get_train_conversation(prompt) if train else get_eval_conversation(),
add_generation_prompt=not train,
)
)
for i, img in enumerate(images):
if not hasattr(img, "convert"):
raise TypeError(f"Item {i} is not a PIL image. Got {type(img)}")
processed_batch = processor(
text=texts,
images=images,
padding=True,
return_tensors="pt",
)
labels = processed_batch["input_ids"].clone()
labels[labels == processor.tokenizer.pad_token_id] = -100
processed_batch["labels"] = labels
input_ids = processed_batch["input_ids"]
attention_mask = processed_batch["attention_mask"]
pixel_values = processed_batch["pixel_values"]
labels = processed_batch["labels"]
return PreProcessedModelInput(
input_ids=input_ids,
attention_mask=attention_mask,
pixel_values=pixel_values,
labels=labels,
dict_labels={
"nsfw": unsafe_answers,
"safe": safe_answers,
"use_unsafes": use_unsafes,
},
)
def get_train_conversation(caption: str):
return [
{
"role": "user",
"content": [
{"type": "text", "text": "Caption this image."},
{"type": "image"},
]
},
{
"role": "assistant",
"content": [
{"type": "text", "text": caption},
]
}
]
def get_eval_conversation():
return [
{
"role": "user",
"content": [
{"type": "text", "text": "Caption this image."},
{"type": "image"},
]
}
]
Hi, this model is pretty outdated and does not work well with multiple images.
At the time of writing, I recommend Qwen3-VL: https://huggingface.co/collections/Qwen/qwen3-vl. It comes in different sizes, from 2 billion parameters all the way to 235 billion parameters. It also comes in "instruct" vs. "thinking" versions, which means it either directly generates a response or first generates a chain-of-thought (CoT) reasoning before coming up with the answer. Additionally, it comes in different formats, there's the float16 version, fp8 version, and the GGUF version.
See the docs on how to use it: https://huggingface.co/Qwen/Qwen3-VL-4B-Instruct?library=transformers.
Hi Niels,
thanks for the quick response!
I'm working on a research topic and do require to work on this version of the model.
Anyhow, I'm only providing a single image and find multiple tokens. I was essentially wondering if these <image> tokens are representing patches. I see exactly 576 of them, working with 336 with patch size 14 that means a feature map of 24x24 so it would all make sense and however I struggled to find answers online