Fill in the Middle: Improve the Infilling Capability of Large Language Models
This article introduces a simple and effective method to enable models to have infilling capability.
The reason for the insufficient infilling capability of current LLMs
Currently, the most powerful generative language models, such as GPT-4[1], PaLM[2], Llama2[3], belong to causal decoder-based models. These models have simpler architectures and usually do not require fine-tuning for specific tasks, making them more appealing in terms of inference and deployment.
However, all of the left-to-right models are limited when it comes to infilling, which is the task of generating text at a specific location within a prompt while considering both a prefix and a suffix.
This is because the left-to-right model can only impose conditional constraints on prefixes. This is unfortunate because the context surrounding the padding naturally occurs in applications of automatic generation before and after the generated points. For example, in creating an coding assistant, padding can be used to generate docstrings, import statements, or complete partially written functions. Although the encoder-only and encoder-decoder models can impose conditional constraints on suffixes, the length of the padding area seen during training is often much shorter than the useful length in actual applications.
Solution
The solution presented in the paper “Efficient Training of Language Models to Fill in the Middle”[4] is to enhance the capabilities of the current state-of-the-art large language models, specifically the causal decoder-based language models, by adding the ability called fill-in-the-middle (FIM). This addresses the limitation of insufficient infilling ability in large language models.
The FIM method achieves improved infilling capability in causal decoder-based autoregressive language models by making simple modifications to the training data without altering the model architecture. This enhancement is achieved without weakening the left-to-right generation ability.
After the introduction of the FIM method, it has been adopted by many models such as SantaCoder[5], Star Coder[6], and Code Llama[7].
Mechanism of Fill-in-the-middle(FIM)
The key to this approach is a transformation applied to a fraction of dataset, in which we split documents into three pieces at random and move the middle piece to the end:
document → (prefix, middle,suffix) → (prefix,suffix, middle)
We then concatenate the three pieces using sentinel tokens.
FIM achieves its goal by simply designing special tokens and constructing different input training samples, and then training the model in an autoregressive manner from left to right. Compared to designing multiple tasks to train the model and give it various capabilities, this approach is more lightweight.
The FIM mechanism consists of two types: PSM and SPM. PSM uses a prefix-suffix-middle format, while SPM uses a suffix-prefix-middle format.
Evaluation
In order to evaluate the performance of FIM, numerous experiments were conducted in the paper. We won’t list them all here, but we will present some intuitive and successful infilling examples.
Based on the context before and after the code snippet, the intermediate content (highlighted in green) was successfully generated, as shown in Figure 1:
According to the context before and after the natural language, generate intermediate content (green part), as shown in Figure 2:
Conclusion
This article introduces a simple and effective method to enable models to have infilling capability. The method only changes the input format without altering the model structure, and does not compromise the original left-to-right generation ability. Many large models have adopted this method.
References
[1]: OpenAI. GPT-4 Technical Report (2023). arXiv preprint arXiv:2303.08774.
[2]: A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, H. Chung, C. Sutton, S. Gehrmann, P. Schuh, et al. PaLM: Scaling language modeling with Pathways (2022). arXiv preprint arXiv:2204.02311.
[3]: H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models (2023). arXiv preprint arXiv:2307.09288.
[4]: M. Bavarian, He. Jun, N. Tezak, J. Schulman, C. McLeavey, J. Tworek, M. Chen. Efficient Training of Language Models to Fill in the Middle (2022). arXiv preprint arXiv: 2207.14255
[5]: L. Allal, R. Li, D. Kocetkov, C. Mou, C. Akiki, et al. SantaCoder: don’t reach for the stars (2023). arXiv preprint arXiv:2301.03988.
[6]: R. Li, L. Allal, Y. Zi, N. Muennighoff, D. Kocetkov, et al. StarCoder: may the source be with you!. (2023). arXiv preprint arXiv: 2305.06161.
[7]: B. Rozière, J. Gehring, F. Gloeckle, S. Sootla, I. Gat, et al. Code Llama: Open Foundation Models for Code (2023). arXiv preprint arXiv: 2308.12950.