Explore BERT, including an overview of how this language model is used, how it works, and how it's trained.
![[Featured Image] Two people stand together at a desk in a dimly lit room and review information on a shared computer.](https://d3njjcbhbojbot.cloudfront.net/api/utilities/v1/imageproxy/https://images.ctfassets.net/wp1lcwdav1p1/2d2inRsMjWKyQGxtB5c55c/2fdb7c4548d405c7650dab40d8655a6e/GettyImages-1158816111.jpeg?w=1500&h=680&q=60&fit=fill&f=faces&fm=jpg&fl=progressive&auto=format%2Ccompress&dpr=1&w=1000)
BERT is a deep learning language model designed to improve the efficiency of natural language processing (NLP) tasks.
Google researchers introduced the BERT model in a 2018 paper titled “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding” [1].
The BERT model relies on bidirectional pretraining, which helps the model better understand the relationships between words by analyzing both preceding and following words in a sentence.
You can use the BERT model for biomedical text mining, downstream tasks in scientific domains, patent classification, and financial sentiment analysis.
Discover what was so revolutionary about the emergence of the BERT model in helping machines understand and generate human language, as well as the architecture, use cases, and training methods of BERT. If you’re ready to start building expertise in training AI models to understand human language, enroll in the Natural Language Processing Specialization from DeepLearning.AI. You’ll have the opportunity to gain experience with sentiment analysis, text generation, and dynamic programming, as well as technologies like LSTMs, recurrent neural networks, and Markov models, in as little as three months. Upon completion, you’ll have earned a career certificate for your resume.
BERT (Bidirectional Encoder Representations from Transformers) is a deep learning language model designed to improve the efficiency of natural language processing (NLP) tasks. It is famous for its ability to consider context by analyzing the relationships between words in a sentence bidirectionally. It was introduced by Google researchers in a 2018 paper titled “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding” [1]. Since then, the BERT model has been fine-tuned for use in a variety of fields, including biology, data science, and medicine.
Read more: How Does Natural Language Processing Work?
LLM is a broad term describing large-scale language models designed for NLP tasks. BERT is an example of an LLM. GPT models are another notable example of LLMs.
Two initial BERT model sizes, BERTlarge and BERTbase, were compared in Google’s 2018 paper [1].
BERTbase was made with the same model size as OpenAI’s GPT for performance comparison purposes. Both were trained on an enormous data set containing over three billion words, including Wikipedia and Google’s BooksCorpus [1]. This level of training can be time-consuming, but 64 of Google’s custom-built tensor processing units (TPUs) managed to train BERTlarge in just four days [1]. BERT’s pre-training methods differ from other language models (LMs) because they are bidirectional, meaning that data is processed forward and backward.
Bidirectional pretraining helps the model better understand the relationships between words by analyzing both preceding and following words in a sentence. This type of bidirectional pretraining relies on masked language models (MLM). MLMs facilitate bidirectional learning by masking a word in a sentence and forcing BERT to infer what it is based on the context to the left and right of the hidden word.
Hear more about transformer architecture and the BERT model in this course from Google Cloud:
BERT stands for Bidirectional Encoder Representations from Transformers. We’ve already discussed how bidirectional pretraining with MLMs enables BERT to function, so let’s cover the remaining letters in the acronym to get a better understanding of its architecture.
Encoder Representations: Encoders are neural network components that translate input data into representations that are easier for machine learning algorithms to process. Once an encoder reads input text, it generates a hidden state vector. Hidden state vectors are like lists of values and internal parameters that provide additional context. This packaged representation of information is then passed on to the transformer.
Transformer: The transformer uses the information above to infer patterns or make predictions. A transformer is a deep learning architecture that transforms an input into another type of output. Nearly all NLP applications use transformers. If you’ve ever used ChatGPT, you’ve seen transformer architecture in action. Typically, transformers consist of an encoder and a decoder. However, BERT uses only the encoder part of the transformer.
Vision transformer (ViT) models and BERT models share some similar features but have very different outputs. While BERT uses sentences as inputs and outputs for natural language tasks, ViTs use images.
In 2021, Google Research released a paper describing ViT models, which divide images into small patches and encode them into vector representations that are then analyzed for internal qualities [2]. A key component of what makes this possible is the self-attention mechanism also used in BERT.
BERT is widely used in AI for language processing and pretraining. For example, it can be used to discern context for better results in search queries. BERT outperforms many other architectures in a variety of token-level and sentence-level NLP tasks:
Token-level task examples: Tokens refer to labels that are assigned to specific and semantically meaningful groups of characters, like words. Examples of token-level tasks include part-of-speech (POS) tagging and named entity recognition (NER).
Sentence-level task examples: Processing each token or word and discerning the context from surrounding words can be computationally exhausting for some NLP tasks. Examples of sentence-level tasks include semantic search and sentiment analysis.
From industry to industry, BERT is being fine-tuned for specific needs. Here are a few examples of specialized pre-trained BERT models:
BioBERT: Used for biomedical text mining, BioBERT is a pre-trained biomedical language representation model.
SciBERT: Similar to BioBERT, this model is pretrained on a wide range of high-quality scientific publications to perform downstream tasks in a variety of scientific domains.
PatentBERT: This BERT model version is used to perform patent classification.
VideoBERT: VideoBERT is a visual-linguistic model used to leverage the abundance of unlabeled data on platforms such as YouTube.
FinBERT: General-purpose models struggle to conduct financial sentiment analysis due to the field's specialized language. This BERT model is pretrained on financial texts to perform NLP tasks in the domain.
BERT is open-source and accessible via GitHub. According to Google, its users can train a sophisticated question and answer system within hours on a graphics processing unit (GPU) and within at most one hour on a cloud tensor processing unit (TPU) [1].
Discover fresh insights into your career or learn about trends in your industry by subscribing to our LinkedIn newsletter, Career Chat. Or if you want to keep learning more about AI concepts like LLMs and machine learning, check out these free resources:
Discover learning guidance: How to Start Learning Machine Learning: A Custom Course Guide
Watch on YouTube: How GPT Turns Text Into Answers
Bookmark for later: Artificial Intelligence Glossary: Learn AI Vocabulary
Whether you want to develop a new skill, get comfortable with an in-demand technology, or advance your abilities, keep growing with a Coursera Plus subscription. You’ll get access to over 10,000 flexible courses.
arXiv. “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, https://arxiv.org/pdf/1810.04805.” Accessed February 27, 2026.
arXiv. “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, https://arxiv.org/abs/2010.11929.” Accessed February 27, 2026.
SEO Content Manager I
Jessica is a technical writer who specializes in computer science and information technology. Equipp...
This content has been made available for informational purposes only. Learners are advised to conduct additional research to ensure that courses and other credentials pursued meet their personal, professional, and financial goals.