Thirawarit commited on
Commit
d2dccbb
·
verified ·
1 Parent(s): 61b6e12

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +315 -0
README.md ADDED
@@ -0,0 +1,315 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - th
4
+ metrics:
5
+ - sacrebleu
6
+ base_model:
7
+ - Qwen/Qwen2-VL-7B-Instruct
8
+ pipeline_tag: visual-question-answering
9
+ ---
10
+
11
+ # Pathumma-llm-vision-1.0.0
12
+
13
+ ## Model Overview
14
+ Pathumma-llm-vision-2.0.0-preview is a multi-modal language model fine-tuned for Visual Question Answering (VQA) and Image Captioning tasks. It contains 8 billion parameters and leverages both image and text processing to understand and generate multi-modal content.
15
+
16
+ - **Model Name**: Pathumma-llm-vision-2.0.0-preview
17
+ - **Base Model**: Qwen/Qwen2-VL-7B-Instruct
18
+ - **Architecture**: Multi-modal LLM (Visual Language Model)
19
+ - **Parameters**: 7 Billion
20
+ - **Organization**: NECTEC
21
+ - **License**: [Specify License]
22
+
23
+ ## Intended Use
24
+ - **Primary Use Cases**:
25
+ - Visual Question Answering (VQA)
26
+ - Image Captioning
27
+ - **Intended Users**: Developers, researchers, and AI practitioners working on multi-modal tasks.
28
+ - **Possible Applications**: Educational tools, accessibility applications, interactive visual content generation.
29
+
30
+ ## Model Description
31
+ Pathumma-llm-vision-2.0.0-preview is designed to perform multi-modal tasks by integrating both visual and textual information. The model is fine-tuned with diverse datasets to improve its ability to understand and generate content that aligns with both image and text inputs.
32
+
33
+ ## Training Data
34
+ The model was fine-tuned on several datasets:
35
+ - **Thai Image Caption**: Data sourced from image captioning competitions on Kaggle.
36
+ - **Small-Thai-Wikipedia**: Articles in Thai from Wikipedia.
37
+
38
+ ### Dataset Size
39
+ - **Training Dataset Size**: 132,946 examples
40
+ - **Validation Dataset Size**: - examples
41
+
42
+ ## Training Details
43
+ - **Hardware Used**:
44
+ - **HPC Cluster**: Lanta
45
+ - **Number of Nodes**: 4 Nodes
46
+ - **GPUs per Node**: 4 GPUs
47
+ - **Total GPUs Used**: 16 GPUs
48
+ - **Fine-tuning Duration**: 20 hours, 34 minutes, and 43 seconds (excluding evaluation)
49
+
50
+ ## Evaluation Results
51
+
52
+ | Type | Encoder | Decoder | IPU24-dataset <br>(test) <br>(Sentence SacreBLEU) |
53
+ |----------------------------------------|------------------------------------|-------------------------------------|-------------------------------|
54
+ | Pathumma-llm-vision-beta-0.0.0 | siglip-so400m-patch14-384 | Meta-Llama-3.1-8B-Instruct | 13.45412 |
55
+ | Pathumma-llm-vision-1.0.0 | siglip-so400m-patch14-384 | Meta-Llama-3.1-8B-Instruct | 17.66370 |
56
+ | Pathumma-llm-vision-2.0.0-preview | Qwen2-VL-7B-Instruct | Qwen2-VL-7B-Instruct | **19.112962** |
57
+
58
+ **\*\*Note**: Other models not target fine-tuned on IPU24-datasets may be less representative of IPU24 performance.
59
+
60
+ ## Required Libraries
61
+
62
+ Before you start, ensure you have the following libraries installed:
63
+
64
+ ```
65
+ pip install transformers==4.48.1 accelerate peft bitsandbytes qwen-vl-utils[decord]==0.0.8
66
+ ```
67
+
68
+ ## Usage
69
+ We provide a [inference tutorial](https://colab.research.google.com/drive/1URMEJr2P_9JK0BvBzFv4NN4824iAf0y4#scrollTo=_S-LoNKcv8ww).
70
+ To use the model with the Hugging Face `transformers` library:
71
+
72
+ ```python
73
+ import torch
74
+
75
+ from peft import get_peft_model, LoraConfig
76
+ from transformers import BitsAndBytesConfig
77
+ from transformers import (
78
+ Qwen2VLForConditionalGeneration,
79
+ Qwen2VLProcessor,
80
+ )
81
+ ```
82
+
83
+ ```python
84
+ MODEL_ID = "nectec/Pathumma-llm-vision-2.0.0-preview"
85
+ DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
86
+ USE_QLORA = True
87
+
88
+ lora_config = LoraConfig(
89
+ lora_alpha=16,
90
+ lora_dropout=0.05,
91
+ r=8,
92
+ bias="none",
93
+ target_modules=["q_proj", "v_proj"],
94
+ task_type="CAUSAL_LM",
95
+ )
96
+
97
+ if USE_QLORA:
98
+ bnb_config = BitsAndBytesConfig(
99
+ load_in_8bit=True,
100
+ # load_in_4bit=True,
101
+ # bnb_4bit_use_double_quant=True,
102
+ # bnb_4bit_quant_type="nf4",
103
+ # bnb_4bit_compute_type=torch.bfloat16
104
+ )
105
+
106
+
107
+ model = Qwen2VLForConditionalGeneration.from_pretrained(
108
+ MODEL_ID,
109
+ device_map="auto",
110
+ quantization_config=bnb_config if USE_QLORA else None,
111
+ torch_dtype=torch.bfloat16
112
+ )
113
+
114
+
115
+ model = get_peft_model(model, lora_config)
116
+ model.print_trainable_parameters()
117
+
118
+ MIN_PIXELS = 256 * 28 * 28
119
+ MAX_PIXELS = 1280 * 28 * 28
120
+
121
+ processor = Qwen2VLProcessor.from_pretrained(MODEL_ID, min_pixels=MIN_PIXELS, max_pixels=MAX_PIXELS)
122
+
123
+ def encode_via_processor(image, instruction, question):
124
+
125
+ if isinstance(image, str):
126
+ local_path = image
127
+ image = Image.open(local_path)
128
+
129
+ messages = [
130
+ {
131
+ "role": "system", "content": [{"type": "text", "text": instruction}]
132
+ },
133
+ {
134
+ "role": "user",
135
+ "content": [
136
+ {
137
+ "type": "image"
138
+ },
139
+ {
140
+ "type": "text",
141
+ "text": question
142
+ }
143
+ ]
144
+ },
145
+ ]
146
+
147
+ text = processor.apply_chat_template(
148
+ messages,
149
+ add_generation_prompt=True,
150
+ ).strip()
151
+
152
+ def convert_img(image):
153
+ width, height = image.size
154
+ factor = processor.image_processor.patch_size * processor.image_processor.merge_size
155
+ if width < factor:
156
+ image = image.copy().resize((factor, factor * height // width))
157
+ elif height < factor:
158
+ image = image.copy().resize((factor * width // height, factor))
159
+ return image
160
+ image_inputs = [convert_img(image)]
161
+
162
+ encoding = processor(
163
+ text=text,
164
+ images=image_inputs,
165
+ videos=None,
166
+ return_tensors="pt",
167
+ )
168
+
169
+ ## Remove batch dimension
170
+ # encoding = {k:v.squeeze(dim=0) for k,v in encoding.items()}
171
+ encoding = {k: v.to(DEVICE) for k, v in encoding.items()}
172
+ inputs = encoding
173
+ return inputs
174
+
175
+
176
+ def encode_via_processor_extlib(local_path, instruction, question):
177
+ img_path = "file://" + local_path
178
+ messages = [
179
+ {
180
+ "role": "system", "content": [{"type": "text", "text": instruction}]
181
+ },
182
+ {
183
+ "role": "user",
184
+ "content": [
185
+ {
186
+ "type": "image",
187
+ "image": img_path,
188
+ },
189
+ {
190
+ "type": "text",
191
+ "text": question
192
+ }
193
+ ]
194
+ },
195
+ ]
196
+
197
+ text = processor.apply_chat_template(
198
+ messages,
199
+ add_generation_prompt=True,
200
+ ).strip()
201
+
202
+ image_inputs, video_inputs = process_vision_info(messages)
203
+
204
+ encoding = processor(
205
+ text=text,
206
+ images=image_inputs,
207
+ videos=video_inputs,
208
+ return_tensors="pt",
209
+ )
210
+
211
+ ## Remove batch dimension
212
+ # encoding = {k:v.squeeze(dim=0) for k,v in encoding.items()}
213
+ encoding = {k: v.to(DEVICE) for k, v in encoding.items()}
214
+ inputs = encoding
215
+ return inputs
216
+
217
+ def inference(inputs):
218
+ start_time = time.time()
219
+ model.eval()
220
+ with torch.inference_mode():
221
+ # Generate
222
+ generated_ids = model.generate(
223
+ **inputs,
224
+ max_new_tokens=256,
225
+ temperature=.1,
226
+ # repetition_penalty=1.2,
227
+ # top_k=2,
228
+ # top_p=1,
229
+ )
230
+ generated_texts = processor.batch_decode(generated_ids, skip_special_tokens=True)
231
+ end_time = time.time()
232
+
233
+ ## Get letency_time...
234
+ latency_time = end_time - start_time
235
+
236
+ answer_prompt = [*map(
237
+ lambda x: re.sub(r"assistant(:|\n)?", "<||SEP-ASSIST||>", x).split('<||SEP-ASSIST||>')[-1].strip(),
238
+ generated_texts
239
+ )]
240
+ predict_output = generated_texts[0]
241
+ response = re.sub(r"assistant(:|\n)?", "<||SEP-ASSIST||>", predict_output).split('<||SEP-ASSIST||>')[-1].strip()
242
+
243
+ return predict_output, response, round(latency_time, 3)
244
+
245
+ instruction = "You are a helpful assistant."
246
+
247
+ def response_image(img_path, question, instruction=instruction):
248
+ image = Image.open(img_path)
249
+ _, response, latency_time = inference(encode_via_processor(image=image, instruction=instruction, question=question))
250
+ print("RESPONSE".center(60, "="))
251
+ print(response)
252
+ print(latency_time, "sec.")
253
+ print("IMAGE".center(60, "="))
254
+ plt.imshow(image)
255
+ plt.show()
256
+
257
+ # Output processing (depends on task requirements)
258
+ question = "อธิบายภาพนี้"
259
+ img_path = "/content/The Most Beautiful Public High School in Every State in America.jpg"
260
+ response_image(img_path, question)
261
+
262
+ >>> ==========================RESPONSE==========================
263
+ >>> อาคารสีน้ำตาลขนาดใหญ่ที่มีเสาไฟฟ้าอยู่ด้านหน้าและมีต้นไม้อยู่ด้านข้าง
264
+ >>> 7.987 sec.
265
+ >>> ===========================IMAGE============================
266
+ >>> <IMAGE_MATPLOTLIB>
267
+ ```
268
+
269
+ ## Limitations and Biases
270
+ - The model may exhibit biases due to the training data, which might not be fully representative of all contexts.
271
+ - Performance may degrade on unfamiliar images or non-standard question formats.
272
+
273
+ ## Ethical Considerations
274
+ - The model should not be used to generate misleading information or in ways that violate privacy.
275
+ - Consider fairness and minimize bias when using the model for language and image processing tasks.
276
+
277
+ ## Citation
278
+ If you use this model, please cite it as follows:
279
+
280
+ ```bibtex
281
+ @misc{PathummaVision,
282
+ author = {Thirawarit Pitiphiphat and NECTEC Team},
283
+ title = {nectec/Pathumma-llm-vision-2.0.0-preview},
284
+ year = {2025},
285
+ url = {https://huggingface.co/nectec/Pathumma-llm-vision-2.0.0-preview}
286
+ }
287
+ ```
288
+
289
+ ```bibtex
290
+ @article{Qwen2VL,
291
+ title={Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution},
292
+ author={Wang, Peng and Bai, Shuai and Tan, Sinan and Wang, Shijie and Fan, Zhihao and Bai, Jinze and Chen, Keqin and Liu, Xuejing and Wang, Jialin and Ge, Wenbin and Fan, Yang and Dang, Kai and Du, Mengfei and Ren, Xuancheng and Men, Rui and Liu, Dayiheng and Zhou, Chang and Zhou, Jingren and Lin, Junyang},
293
+ journal={arXiv preprint arXiv:2409.12191},
294
+ year={2024}
295
+ }
296
+
297
+ @article{Qwen-VL,
298
+ title={Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond},
299
+ author={Bai, Jinze and Bai, Shuai and Yang, Shusheng and Wang, Shijie and Tan, Sinan and Wang, Peng and Lin, Junyang and Zhou, Chang and Zhou, Jingren},
300
+ journal={arXiv preprint arXiv:2308.12966},
301
+ year={2023}
302
+ }
303
+ ```
304
+
305
+ ## **Contributor Contract**
306
+ **Vision Team**
307
+ Thirawarit Pitiphiphat ([email protected])<br>
308
+ Theerasit Issaranon ([email protected])
309
+
310
+ ## Contact
311
+ For questions or support, please contact **https://discord.gg/3WJwJjZt7r**.
312
+
313
+ ```
314
+ This formatting provides a clean, structured, and readable Markdown layout for these sections. Let me know if further adjustments are needed!
315
+ ```