+ - 0:00:00
Notes for current slide

Presenter notes contain extra information which might be useful if you intend to use these slides for teaching.

Press P again to switch presenter notes off

Press C to create a new window where the same presentation will be displayed. This window is linked to the main window. Changing slides on one will cause the slide to change on the other.

Useful when presenting.

Notes for next slide



Fine-tuning Protein Language Model



last_modification Updated:   purlPURL: gxy.io/GTN:S00135

text-document Plain-text slides |

Tip: press P to view the presenter notes | arrow-keys Use arrow keys to move between slides
1 / 20

Presenter notes contain extra information which might be useful if you intend to use these slides for teaching.

Press P again to switch presenter notes off

Press C to create a new window where the same presentation will be displayed. This window is linked to the main window. Changing slides on one will cause the slide to change on the other.

Useful when presenting.

Requirements

Before diving into this slide deck, we recommend you to have a look at:

2 / 20

question Questions

  • How to load large protein AI models?

  • How to fine-tune such models on downstream tasks such as post-translational site prediction?

3 / 20

objectives Objectives

  • Learn to load and use large protein models from HuggingFace

  • Learn to fine-tune them on specific tasks such as predicting dephosphorylation sites

4 / 20

Language Models (LM)

  • Powerful LMs "understand" language like humans
  • LMs are trained to understand and generate human language
  • Popular models: GPT-3, Llama2, Gemini, …
  • Trained on vast datasets with billions of parameters
  • Self-supervised learning
  • Masked language modeling
  • Next word/sentence prediction
  • Can we train such language models on protein/DNA/RNA sequences?
  • LM for life sciences - DNABert, ProtBert, ProtT5, CodonBert, RNA-FM, ESM/2, BioGPT, …
  • Many are available on HuggingFace
5 / 20

Bidirectional Encoder Representations from Transformers (BERT)

BERT

6 / 20

Language Models (LMs)

Rise of language models and their sizes

7 / 20

Protein Language Model (pLM)

  • Models trained on large protein databases
    • Big fantastic database (> 2.4 billions sequences), Uniprot, …
  • Popular architectures such as BERT, Albert, T5, T5-XXL …
  • Key challenges in training such models
    • Large number of GPUs needed for training: (Prot)TXL needs > 5000 GPUs
    • Expertise needed in large scale AI training
  • Most labs and researchers don't have access such resources
  • Solution: fine-tune pre-trained models on downstream tasks such as protein family classification
  • Benefits: requires smaller data, less expertise, training time and compute resources
8 / 20

Architecture: Protein Language Model (pLM)

Architecture of protein language model

9 / 20

T-SNE embedding projections

TSNE embeddings of proteins and amino acids

10 / 20

Challenges for downstream tasks

  • Key tasks
    • Fine tuning
    • Embedding extraction
  • Training challenges
    • ProtT5: 1.2 Billion
    • Longer training time
    • Training or fine-tuning cannot fit on GPU with 15 GB memory (~26 GB)
11 / 20

Low-ranking adaption (LoRA)

  • Reduce model size
    • 1.2 Billion to 3.5 Million parameters
  • Fits on GPUs with < 15 GB memory

TSNE embeddings of proteins and amino acids

12 / 20

Use-case: Dephosphorylation (Post-translational modification (PTM)) site prediction

  • PTM: chemical modifications to a protein after systhesis
  • Crucial for biological processes such as regulating proteins, gene expression, cell cycle, …
  • Dephosphorylation
    • Removal of a phosphate group from a molecule
    • Is less studied and publicly available labeled dataset is small
    • Hard to train a large deep learning model
    • Fine-tuning might improve site classification accuracy
13 / 20

For additional references, please see tutorial's References section

15 / 20

Screenshot of the gtn stats page with 21 topics, 170 tutorials, 159 contributors, 16 scientific topics, and a growing community

16 / 20
  • If you would like to learn more about Galaxy, there are a large number of tutorials available.
  • These tutorials cover a wide range of scientific domains.

Getting Help

Galaxy Help

17 / 20
  • If you get stuck, there are ways to get help.
  • You can ask your questions on the help forum.
  • Or you can chat with the community on Gitter.

Join an event

Event schedule

18 / 20
  • There are frequent Galaxy events all around the world.
  • You can find upcoming events on the Galaxy Event Horizon.

keypoints Key points

  • Training a very large deep learning model from scratch on a large dataset requires exertise and compute power

  • Large models such as ProtTrans are trained using millions of protein sequences

  • They contain significant knowledge about context in protein sequences

  • These models can be used in multiple ways for learning on a new dataset such as fine tuning, embedding extraction, ...

  • Fine-tuning using LoRA requires much less time and compute power

  • Downstream tasks such as protein sequence classification can be performed using these resources

19 / 20

Thank You!

This material is the result of a collaborative work. Thanks to the Galaxy Training Network and all the contributors!

Author(s) Anup Kumar avatar Anup Kumar
Reviewers Björn Grüning avatarTeresa Müller avatarMartin Čech avatarArmin Dadras avatar
Galaxy Training Network

Tutorial Content is licensed under Creative Commons Attribution 4.0 International License.

20 / 20

Requirements

Before diving into this slide deck, we recommend you to have a look at:

2 / 20
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow