Paper #2
BERT: Pre-training of Deep Bidirectional Transformers (2018)
AI-generated
BERT showed that pre-training a Transformer to understand language bidirectionally (looking at words both before and after a given position) creates a versatile model that can be fine-tuned for a wide range of language tasks with minimal task-specific modification.
BERT (Bidirectional Encoder Representations from Transformers) is pre-trained on two tasks: masked language modeling (predicting randomly hidden words in a sentence) and next sentence prediction (determining if two sentences follow each other). This bidirectional approach means BERT understands context from both directions simultaneously, unlike earlier models like GPT-1 that only read left to right.
After pre-training on a large corpus, BERT can be fine-tuned on specific tasks like question answering, sentiment analysis, or named entity recognition by adding a simple output layer.
BERT demonstrated the power of the "pre-train then fine-tune" paradigm that dominates modern AI. It showed that a single pre-trained model could achieve state-of-the-art results across 11 different language understanding benchmarks, often by a significant margin. This was a pivotal moment: instead of building custom architectures for each task, you could train one general model and adapt it.
BERT's influence extends beyond its own architecture. The pre-training approach it popularized is now the standard for all foundation models.
Authors: Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova (Google AI Language).
Model sizes: BERT-Base (110M parameters) and BERT-Large (340M parameters).
Link to paper: https://arxiv.org/abs/1810.04805
Full paper: https://arxiv.org/abs/1810.04805
The Illustrated BERT (Jay Alammar) - https://jalammar.github.io/illustrated-bert/
Google AI Blog: "Open Sourcing BERT" - https://ai.googleblog.com/2018/11/open-sourcing-bert-state-of-art-pre.html