In a transformer architecture, there are several sources of efficiency losses, which can affect both training and inference phases. Some of the key efficiency losses are:
Self-Attention Complexity: The self-attention mechanism in transformers allows the model to capture relationships between different positions in the input sequence. However, it comes with a quadratic complexity in terms of the input sequence length. As the sequence length increases, the computational cost grows significantly, making it less efficient for longer sequences.
Computation Time: Transformers generally require a large number of computations, especially for deep models with many layers. This can result in longer training and inference times compared to other architectures like recurrent neural networks (RNNs) or convolutional neural networks (CNNs).
Memory Requirements: Transformers often demand large amounts of memory due to the need to store intermediate activations during the forward and backward passes. This can be a challenge, particularly for GPUs with limited memory capacity, limiting the batch size or the model size that can be used.
Parameter Size: Transformers can have a vast number of parameters, especially in large models like GPT-3. This leads to higher memory requirements during training and makes the model harder to deploy on devices with limited resources.
Attention Masking Overhead: In some cases, when processing sequences of varying lengths, padding is used to make all sequences equal in length. However, this requires the attention mechanism to handle padded tokens, which can introduce inefficiencies during computation.
Warmup Steps: During the training of transformers, a warmup phase is often used where the learning rate is gradually increased to stabilize the optimization process. This can lead to additional training steps, which might not be fully efficient in terms of the overall training time.
Communication Overhead: For distributed training of large transformers across multiple GPUs or devices, communication overhead between devices can be a significant source of efficiency loss.
Researchers and engineers have been continuously working on addressing these efficiency concerns in transformers. Techniques such as sparsity, quantization, knowledge distillation, and model pruning have been explored to reduce the computational and memory requirements while preserving model performance. Additionally, transformer variants like Longformer, BigBird, and Reformer have been proposed to address the quadratic attention complexity issue, making transformers more efficient for longer sequences.