🪴 Hayul's digital garden

Search

Search IconIcon to open search

LittleBird - Efficient Faster & Longer Transformer for QuestionAnswering

Last updated Dec 23, 2022 Edit Source

선행 개념

  • sliding window attention from BigBird (Zaheer et al., 2020)
  • linear bias to attention from ALiBi (Press et al., 2021)
  • pack and unpack attention from LUNA (Ma et al., 2021)

BERT has shown a lot of sucess in a wide variety of NLP tasks. But it has a limitation dealing with long inputs due to its attention mechanism.

(…)LittleBird, a novel model based on BigBird with improved speed and memory footprint while maintaining accuracy.

# Pretraining objectives for Question Answering

However, Masked LM (MLM) is suboptimal for extractive QA task. Joshi et al. (2020) proposed SpanBERT, which is pretrained by a span-level masking scheme whose lengths follows geometric distribution and it outperformed BERT with MLM in the most of tasks, especially extractive QA. They proved that training objective predicting spans rather than tokens generates better representations especially for span selection tasks.

# LittleBird Architecture

# Bidirectional ALiBi

# Sliding window attention & Pack unpack attention

# Conclusion and Limitations

참고

Causal masking is a technique used to ensure that the decoder can only rely on the tokens earlier in the sequence when making its prediction. However, LittleBird uses a method called ‘pack and unpack attention’, which prevents causal masking from being used, meaning that LittleBird cannot be used to make predictions one token at a time (known as an autoregressive language model). An example of a decoder-only autoregressive model is a machine translation system, where the input text is encoded into an intermediate representation and then decoded one token at a time to generate the output.