| 失效链接处理 | 
| 
      Swin3D:一个用于3D室内场景理解的预先训练的Transformer主干  PDF 下载 
	相关截图: 
![]() 主要内容: 
 
		. Introduction 
	
		Pretrained backbones with fine-tuning have been widely 
	
		applied to various 2D vision and NLP tasks [13, 2, 10, 3], 
	
		where a backbone network pretrained on a large dataset is 
	
		concatenated with task-specific back-end and then fine-tuned 
	
		for different downstream tasks. This approach demonstrates 
	
		* 
	
		Interns at Microsoft Research Asia. †Contact person. 
	
		its superior performance and great advantages in reducing 
	
		the workload of network design and training, as well as the 
	
		amount of labeled data required for different vision tasks. 
	
		In the work, we present a pretrained 3D backbone, named 
	
		SWIN3D, for 3D indoor scene understanding tasks. Our 
	
		method represents the 3D point cloud of an input 3D scene as 
	
		sparse voxels in 3D space and adapts the Swin Transformer 
	
		[30] designed for regular 2D images to unorganized 3D 
	
		points as the 3D backbone. We analyze the key issues that 
	
		prevent the na¨ıve 3D extension of Swin Transformer from 
	
		exploring large models and achieving high performance, 
	
		i.e., the high memory complexity, the ignorance of signal 
	
		irregularity. Based on our analysis, we develop a novel 
	
		3D self-attention operator to compute the self-attentions of 
	
		sparse voxels within each local window, which reduces the 
	
		memory cost of self-attention from quadratic to linear with 
	
		respect to the number of sparse voxels within a window and 
	
		computes efficiently; enhances self-attention via capturing 
	
		various signal irregularities by our generalized contextual 
	
		relative positional embedding [48, 26]. 
	
		The novel design of our SWIN3D backbone enables us to 
	
		scale up the backbone model and the amount of data used 
	
		for pretraining. To this end, we pretrained a large SWIN3D 
	
		model with 60M parameters via a 3D semantic segmenta 
	
		tion task over a synthetic 3D indoor scene dataset [60] that 
	
		includes 21K rooms and is about ten times larger than the 
	
		ScanNet dataset. After pretraining, we cascade the pretrained 
	
		SWIN3D backbone with task-specific back-end decoders 
	
		and fine-tune the models for various downstream 3D indoor 
	
		scene understanding tasks. 
	 | 
    




    
苏公网安备 32061202001004号


    