bpiyush commited on
Commit
8f0466d
·
verified ·
1 Parent(s): dc09495

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +49 -1
README.md CHANGED
@@ -1,6 +1,36 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  # TARA Model
2
 
3
- TARA (Tarsier-based Audio-Visual Representation) is a multimodal model for video and text understanding.
 
 
 
 
 
 
 
4
 
5
  ## Installation
6
 
@@ -31,3 +61,21 @@ text = "someone is folding a paper"
31
  with torch.no_grad():
32
  text_emb = model.encode_text(text)
33
  ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ base_model: meta-llama/Llama-3.1-8B
4
+ tags:
5
+ - video-understanding
6
+ - multimodal
7
+ - vision-language
8
+ - video-encoding
9
+ - text-encoding
10
+ - feature-extraction
11
+ - video-text-alignment
12
+ library_name: transformers
13
+ pipeline_tag: feature-extraction
14
+ language:
15
+ - en
16
+ datasets:
17
+ - nli
18
+ - ego4d
19
+ metrics:
20
+ - embedding-quality
21
+ - video-text-alignment
22
+ ---
23
+
24
  # TARA Model
25
 
26
+ TARA (Time-Aware Retrieval Adaptation) is a multimodal model for video and text understanding. It can encode both videos and text into a shared embedding space, enabling tasks like video-text retrieval, video understanding, and cross-modal alignment.
27
+
28
+ ## Model Details
29
+
30
+ - **Base Model**: Tarsier-7B (based on Llama-3.1-8B)
31
+ - **Architecture**: Multimodal encoder with vision and language components
32
+ - **Supported Modalities**: Video, Images, Text
33
+ - **Max Video Frames**: 32 frames
34
 
35
  ## Installation
36
 
 
61
  with torch.no_grad():
62
  text_emb = model.encode_text(text)
63
  ```
64
+
65
+ ## Usage
66
+
67
+ The model provides two main encoding methods:
68
+
69
+ - `encode_vision()`: Encodes video or image inputs into embeddings
70
+ - `encode_text()`: Encodes text inputs into embeddings
71
+
72
+ Both methods return embeddings in a shared space, enabling cross-modal tasks.
73
+
74
+ ## Citation
75
+
76
+ If you use this model, please cite the original Tarsier work and this implementation.
77
+
78
+ ## License
79
+
80
+ Apache 2.0
81
+