bpiyush
/

TARA

@@ -1,6 +1,36 @@
 # TARA Model
-TARA (Tarsier-based Audio-Visual Representation) is a multimodal model for video and text understanding.
 ## Installation
@@ -31,3 +61,21 @@ text = "someone is folding a paper"
 with torch.no_grad():
     text_emb = model.encode_text(text)
 ```

+---
+license: apache-2.0
+base_model: meta-llama/Llama-3.1-8B
+tags:
+  - video-understanding
+  - multimodal
+  - vision-language
+  - video-encoding
+  - text-encoding
+  - feature-extraction
+  - video-text-alignment
+library_name: transformers
+pipeline_tag: feature-extraction
+language:
+  - en
+datasets:
+  - nli
+  - ego4d
+metrics:
+  - embedding-quality
+  - video-text-alignment
+---
 # TARA Model
+TARA (Time-Aware Retrieval Adaptation) is a multimodal model for video and text understanding. It can encode both videos and text into a shared embedding space, enabling tasks like video-text retrieval, video understanding, and cross-modal alignment.
+## Model Details
+- **Base Model**: Tarsier-7B (based on Llama-3.1-8B)
+- **Architecture**: Multimodal encoder with vision and language components
+- **Supported Modalities**: Video, Images, Text
+- **Max Video Frames**: 32 frames
 ## Installation
 with torch.no_grad():
     text_emb = model.encode_text(text)
 ```
+## Usage
+The model provides two main encoding methods:
+- `encode_vision()`: Encodes video or image inputs into embeddings
+- `encode_text()`: Encodes text inputs into embeddings
+Both methods return embeddings in a shared space, enabling cross-modal tasks.
+## Citation
+If you use this model, please cite the original Tarsier work and this implementation.
+## License
+Apache 2.0