Java知识分享网 - 轻松学习从此开始!    

Java知识分享网

        
SpringBoot+SpringSecurity+Vue+ElementPlus权限系统实战课程 震撼发布        

最新Java全栈就业实战课程(免费)

AI人工智能学习大礼包

IDEA永久激活

66套java实战课程无套路领取

锋哥开始收Java学员啦!

Python学习路线图

锋哥开始收Java学员啦!
当前位置: 主页 > Java文档 > 人工智能AI >

【计算机视觉】基于Transformer的视频图像统一分割模型:SAM 2的流式记忆架构与大规模SA-V数据集


分享到:
时间:2025-12-12 09:46来源:http://www.java1234.com 作者:转载  侵权举报
【计算机视觉】基于Transformer的视频图像统一分割模型:SAM 2的流式记忆架构与大规模SA-V数据集构建
失效链接处理
【计算机视觉】基于Transformer的视频图像统一分割模型:SAM 2的流式记忆架构与大规模SA-V数据集构建 PDF 下载

 
 
相关截图:
 
主要内容:
 
1 Introduction
 
Segment Anything (SA) introduced a foundation model for promptable segmentation in images (Kirillov et al.,
2023). However an image is only a static snapshot of the real world in which visual segments can exhibit
complex motion, and with the rapid growth of multimedia content, a significant portion is now recorded
with a temporal dimension, particularly in video data. Many important applications in AR/VR, robotics,
autonomous vehicles, and video editing require temporal localization beyond image-level segmentation. We
believe a universal visual segmentation system should be applicable to both images and videos.
Segmentation in video aims to determine the spatio-temporal extent of entities, which presents unique
challenges beyond those in images. Entities can undergo significant changes in appearance due to motion,
deformation, occlusion, lighting changes, and other factors. Videos often have lower quality than images due
to camera motion, blur, and lower resolution. Further, efficient processing of a large number of frames is a
key challenge. While SA successfully addresses segmentation in images, existing video segmentation models
and datasets fall short in providing a comparable capability to “segment anything in videos”.
We introduce the Segment Anything Model 2 (SAM 2), a unified model for video and image segmentation (we
consider an image as a single-frame video). Our work includes a task, model, and dataset (see Fig. 1).
We focus on the Promptable Visual Segmentation (PVS) task that generalizes image segmentation to the
video domain. The task takes as input points, boxes, or masks on any frame of the video to define a segment of
interest for which the spatio-temporal mask (i.e., a ‘masklet’) is to be predicted. Once a masklet is predicted,
it can be iteratively refined by providing prompts in additional frames.
Our model 4) produces segmentation masks of the object of interest, in single images and across video
frames. SAM 2 is equipped with a memory that stores information about the object and previous interactions,
which allows it to generate masklet predictions throughout the video, and also effectively correct these based
on the stored memory context of the object from previously observed frames. Our streaming architecture is a
natural generalization of SAM to the video domain, processing video frames one at a time, equipped with a
memory attention module to attend to the previous memories of the target object. When applied to images,
the memory is empty and the model behaves like SAM.


 

------分隔线----------------------------

锋哥公众号


锋哥微信


关注公众号
【Java资料站】
回复 666
获取 
66套java
从菜鸡到大神
项目实战课程

锋哥推荐