Self-Attention with Relative Position Representations(19.07.09)

이 번에 읽은 논문은 Self-Attention with Relative Position Representations 입니다.

Abstract

기존의 Transformer(Attention is all you need, Vaswani et al, 2017)는 RNN과 CNN와는 다르게 구조에서 relative positions 또는 absolute position 정보를 명시적으로 모델링하지 않습니다. 대신에 입력에 absolute position의 표현을 추가해야 합니다. 이 논문에서는 Self-Attention 메커니즘을 상대적 위치의 표현 또는 시퀀스 요소 사이의 거리를 효율적으로 고려하는 대안적 접근 방식을 제시합니다.

WMT 2014 영어 독일어 및 프랑스어 번역 과제에서 이 접근 방식은 절대 위치 표현에 비해 각각 1.3 BLEU와 0.3 BLEU를 개선하였고, 특히, relative와 absolute position를 결합하면, 더 이상 번역의 품질을 높일 수 없다는 점을 발견할 수 있다고 합니다.

Introduce

Recurrent neural networks는 일반적으로 시간 t 및 이전 hidden state h_(t-1)의 입력 함수로써 hidden state h_t를 계산하며 순차적 구조를 통해 직접적으로 시간 dimention에 따라 relative positions 이고 absolute position를 캡쳐합니다. 하지만 비 반복 모델에서는 입력 요소를 순차적으로 고려하지 않으며, 따라서 시퀀스 순서를 사용할 수 있도록 명시적으로 인코딩 위치 정보를 요구할 수 있습니다.

한 가지 일반적인 접근 방식은 모델에 위치 정보를 노출하기 위해 입력 요소와 결합된 Positional Encoding을 사용하는 것입니다.

Transformer의 경우 입력 순서에 전적으로 의존하지 않기 때문에 위치 정보의 명시적 표현을 포함하는 것은 특히 중요한 고려사항입니다. 따라서 Attention에 기반한 모델들은 거리에 기초한 위치 인코딩 또는 거리에 의해 biased attention weights(Parikh et al., 2016)를 사용하여 왔습니다.

⇒ 이 논문에서 Transformer의 Self-Attention 매커니즘에 relative position representations를 통합시키는 효율적인 방법을 제시합니다. absolute position encodings를 완전히 교체하여도 두 개의 기계 번역 작업에서 번역 품질의 상당한 향상을 보여줍니다.

Background

Transformer

frequency가 다양한 sinusoids에 기반한 Position encodings는 첫 번째 레이어 이전에 인코더 및 디코더 입력 요소에 추가합니다. Vaswani et al는 학습된 absolute position representations와 대조적으로 sinusoids position encoding이 relative position의 정보를 배울 수 있을 것이라 가정하였습니다. 또한 이렇게 함으로써 training때 없었던 길이의 sequence가 들어와도 잘 작동할 수 있을 것이라 생각하였습니다.

실제 Transformer(Attention is all you need, Vaswani et al, 2017) 논문을 보면 아래와 같이 표현하고 있습니다.

Since our model contains no recurrence and no convolution, in order for the model to make use of the order of the sequence, we must inject some information about the relative or absolute position of the tokens in the sequence. .. We chose this function because we hypothesized it would allow the model to easily learn to attend by relative positions, since for any fixed offset k, PEpos+k can be represented as a linear function of PEpos.

이러한 속성은 absolute position representations와 대조적인 전체 시퀀트 길이에 불변한 이 논문의 relative position representations에 영향을 주었습니다.