Java知识分享网 - 轻松学习从此开始!    

Java知识分享网

        
AI编程,程序员挑战年入30~100万高级指南 - 职业规划
SpringBoot+SpringSecurity+Vue权限系统高级实战课程        

IDEA永久激活

Java微信小程序电商实战课程(SpringBoot+VUe)

     

AI人工智能学习大礼包

     

PyCharm永久激活

66套java实战课程无套路领取

     

Cursor+Claude AI编程 1天快速上手视频教程

     
当前位置: 主页 > Java文档 > 人工智能AI >

【计算机视觉】基于Transformer的视频图像统一分割模型:SAM 2的流式记忆架构与大规模SA-V数据集


时间:2025-12-12 09:46来源:http://www.java1234.com 作者:转载  侵权举报
【计算机视觉】基于Transformer的视频图像统一分割模型:SAM 2的流式记忆架构与大规模SA-V数据集构建
失效链接处理
【计算机视觉】基于Transformer的视频图像统一分割模型:SAM 2的流式记忆架构与大规模SA-V数据集构建 PDF 下载

 
 
相关截图:
 
主要内容:
 
1 Introduction
 
Segment Anything (SA) introduced a foundation model for promptable segmentation in images (Kirillov et al.,
2023). However an image is only a static snapshot of the real world in which visual segments can exhibit
complex motion, and with the rapid growth of multimedia content, a significant portion is now recorded
with a temporal dimension, particularly in video data. Many important applications in AR/VR, robotics,
autonomous vehicles, and video editing require temporal localization beyond image-level segmentation. We
believe a universal visual segmentation system should be applicable to both images and videos.
Segmentation in video aims to determine the spatio-temporal extent of entities, which presents unique
challenges beyond those in images. Entities can undergo significant changes in appearance due to motion,
deformation, occlusion, lighting changes, and other factors. Videos often have lower quality than images due
to camera motion, blur, and lower resolution. Further, efficient processing of a large number of frames is a
key challenge. While SA successfully addresses segmentation in images, existing video segmentation models
and datasets fall short in providing a comparable capability to “segment anything in videos”.
We introduce the Segment Anything Model 2 (SAM 2), a unified model for video and image segmentation (we
consider an image as a single-frame video). Our work includes a task, model, and dataset (see Fig. 1).
We focus on the Promptable Visual Segmentation (PVS) task that generalizes image segmentation to the
video domain. The task takes as input points, boxes, or masks on any frame of the video to define a segment of
interest for which the spatio-temporal mask (i.e., a ‘masklet’) is to be predicted. Once a masklet is predicted,
it can be iteratively refined by providing prompts in additional frames.
Our model 4) produces segmentation masks of the object of interest, in single images and across video
frames. SAM 2 is equipped with a memory that stores information about the object and previous interactions,
which allows it to generate masklet predictions throughout the video, and also effectively correct these based
on the stored memory context of the object from previously observed frames. Our streaming architecture is a
natural generalization of SAM to the video domain, processing video frames one at a time, equipped with a
memory attention module to attend to the previous memories of the target object. When applied to images,
the memory is empty and the model behaves like SAM.


 

------分隔线----------------------------


锋哥推荐