shivalikasingh
/

video-mask2former-swin-small-youtubevis-2019-instance

@@ -20,6 +20,7 @@ Disclaimer: The team releasing Mask2Former did not write a model card for this m
 Mask2Former addresses instance, semantic and panoptic segmentation with the same paradigm: by predicting a set of masks and corresponding labels. Hence, all 3 tasks are treated as if they were instance segmentation. Mask2Former outperforms the previous SOTA,
 [MaskFormer](https://arxiv.org/abs/2107.06278) both in terms of performance an efficiency by (i) replacing the pixel decoder with a more advanced multi-scale deformable attention Transformer, (ii) adopting a Transformer decoder with masked attention to boost performance without
 without introducing additional computation and (iii) improving training efficiency by calculating the loss on subsampled points instead of whole masks.
 In the paper [Mask2Former for Video Instance Segmentation
 ](https://arxiv.org/abs/2112.10764), the authors have shown that Mask2Former also achieves state-of-the-art performance on video instance segmentation without modifying the architecture, the loss or even the training pipeline.
@@ -34,9 +35,9 @@ You can use this particular checkpoint for instance segmentation. See the [model
 Here is how to use this model:
 ```python
-import requests
 import torch
-from PIL import Image
 from transformers import AutoImageProcessor, Mask2FormerForUniversalSegmentation
@@ -46,7 +47,7 @@ model = Mask2FormerForUniversalSegmentation.from_pretrained("facebook/video-mask
 file_path = hf_hub_download(repo_id="shivi/video-demo", filename="cars.mp4", repo_type="dataset")
 video = torchvision.io.read_video(file_path)[0]
-video_frames = [image_processor(images=frame, return_tensors="pt", do_resize=True, size=(480, 640)).pixel_values for frame in video]
 video_input = torch.cat(video_frames)
 with torch.no_grad():

 Mask2Former addresses instance, semantic and panoptic segmentation with the same paradigm: by predicting a set of masks and corresponding labels. Hence, all 3 tasks are treated as if they were instance segmentation. Mask2Former outperforms the previous SOTA,
 [MaskFormer](https://arxiv.org/abs/2107.06278) both in terms of performance an efficiency by (i) replacing the pixel decoder with a more advanced multi-scale deformable attention Transformer, (ii) adopting a Transformer decoder with masked attention to boost performance without
 without introducing additional computation and (iii) improving training efficiency by calculating the loss on subsampled points instead of whole masks.
 In the paper [Mask2Former for Video Instance Segmentation
 ](https://arxiv.org/abs/2112.10764), the authors have shown that Mask2Former also achieves state-of-the-art performance on video instance segmentation without modifying the architecture, the loss or even the training pipeline.
 Here is how to use this model:
 ```python
 import torch
+import torchvision
+from huggingface_hub import hf_hub_download
 from transformers import AutoImageProcessor, Mask2FormerForUniversalSegmentation
 file_path = hf_hub_download(repo_id="shivi/video-demo", filename="cars.mp4", repo_type="dataset")
 video = torchvision.io.read_video(file_path)[0]
+video_frames = [image_processor(images=frame, return_tensors="pt").pixel_values for frame in video]
 video_input = torch.cat(video_frames)
 with torch.no_grad():