Java知识分享网 - 轻松学习从此开始!    

Java知识分享网

Java1234官方群25:java1234官方群17
Java1234官方群25:838462530
        
SpringBoot+SpringSecurity+Vue+ElementPlus权限系统实战课程 震撼发布        

最新Java全栈就业实战课程(免费)

AI人工智能学习大礼包

IDEA永久激活

66套java实战课程无套路领取

锋哥开始收Java学员啦!

Python学习路线图

锋哥开始收Java学员啦!
当前位置: 主页 > Java文档 > 人工智能AI >

Transformer论文原文Attention is all you need PDF 下载


分享到:
时间:2025-12-01 09:54来源:http://www.java1234.com 作者:转载  侵权举报
Transformer论文原文Attention is all you need
失效链接处理
Transformer论文原文Attention is all you need PDF 下载


 
 
相关截图:
 
主要内容:


1Introduction
Recurrent neural networks,long short-term memory 3and gated recurrent neural networks
inparlhabeasdasatoftaarachismla
transductionproblemssuchaslanguagemodelingandmachinetranslationB.Numerous
efforts have since continued to push the boundaries of recurrent language models and encoder-decoder
architectures[24
Recurrent models typically factor computation along the symbol positions of the input and output
sequences. Aligning the positions to steps in computation time,they generate a sequence of hidden
aprvto
sequntanatupcldspaallizatiowititaininxalwichbcoscriticalatloe
sequenclnamoconsatbahacsseactorkhasac
significantimprovementsincomputationalefficiencythroughfactorizationtricksandconditional
computation[B2], while also improving model performance in case of the later. The fundamental
constraint of sequential computation,however,remains.
Attention mechanisms have becomean integral part of compelling sequence modeling and transduc-
tion models in various tasks,allowing modeling of dependencies without regard to their distance in
the inputor output sequencesI9In allbutafew cases 2Z, however,such attention mechanisms
are used in conjunction with a recurrent network.
InthisworkweproposetheTransformeramodelarchictureeschewingrecurrenceandinstad
relying entirely on an attention mechanism to draw global dependencies between input and output.
aoasas
translation quality after being trained for as litte as twelve hourson eight PlooGPUs.

 

2 Background
The goal of reducing sequential computation also forms the foundation of the Extended Neural GPU
, ByteNet[Iand ConvS2S, all of which use convolutional neural networks as basic building
haoala
thenumber of operationsrequiredtorelatesignals from twoarbitraryinput or output positions grows
inthdianbslealadgarticrisa
it moredifficult to learn dependencies between distantpositions [2. In theTransformer this is
reduced to a constant number of operations, albeit at the cost of reduced effective resolution due
to averagingattention-weighted positions,an effect we counteract with Multi-Head Attention as
described in section 32
Self-attention,sometimescalled intra-attentionis anatentionmechanismrelatingdiferentpositions
of a single sequence in order to compute a representation of the sequence.Self-attention has been
usedsssvarioadiraisiaaiaza
textual entailment and learning task-independent sentence representations272822
End-to-end memory networks are based on a recurrent attention mechanism instead of sequence-
aligned recurrenceand havebeen shown to performwell on simple-languagequestion answeringand
language modeling tasks [
Tothebestofourknowledge,however,theTransformeristhefirsttransdutionmodelrelying
tlaa
alignedsorconvotinthfolwigconswwilldecrietraoemtia
self-attentionanddiscuss its advantagesover modelssuchas[Iand

 

3 Model Architecture
Most competitive neural sequence transduction modelshave an encoder-decoder structure国3
Here, the encoder maps an input sequence of symbol representations(1,, n) to a sequence
of continuous representations(21,,2)Given,the decoder thengenerates an output
sequence(mofsymbolsoneelmentatatime.Ateachstepthemodelisauto-regressiv
I], consuming the previously generated symbols as additional input when generating the next.

 



 


------分隔线----------------------------

锋哥公众号


锋哥微信


关注公众号
【Java资料站】
回复 666
获取 
66套java
从菜鸡到大神
项目实战课程

锋哥推荐