단백질 : 논문리뷰 : Highly accurate protein structure prediction with AlphaFold

논문리뷰

단백질 : 논문리뷰 : Highly accurate protein structure prediction with AlphaFold

AI바라기 2025. 2. 8. 16:49

Overall Summary

AlphaFold는 novel deep learning approach를 제시하여, homologous information이 제한적인 challenging cases에서도 atomic-level accuracy로 protein structure prediction을 가능하게 합니다. Physical and biological knowledge를 통합한 architecture (Evoformer, Structure module)와 labeled/unlabeled data를 모두 활용함으로써, AlphaFold는 protein structure prediction 분야의 state-of-the-art를 크게 발전시켰습니다. 이러한 breakthrough는 structural bioinformatics 및 다양한 biological applications에 substantial implications를 가지며, 전에 없는 규모로 protein structures의 accurate modeling을 가능하게 합니다.

쉬운 설명: AlphaFold는 마치 레고 블록(아미노산)을 조립하는 방법을 스스로 학습하는 AI와 같습니다. 기존의 AI들은 레고 설명서(homologous structures)가 없으면 조립을 매우 어려워했지만, AlphaFold는 레고 블록 간의 물리적, 화학적 관계(physical and biological knowledge)와 여러 레고 조립 설명서들을 동시에 참고(multi-sequence alignments)하는 새로운 방식(Evoformer, IPA)을 통해, 설명서 없이도 매우 정확하게 레고 모델(단백질 구조)을 조립할 수 있습니다. 또한, 조립 과정을 반복하고(recycling), 자신이 만든 레고 모델을 보고 다시 배우는 방식(self-distillation)을 통해 조립 실력을 더욱 향상시켰습니다. 마치 숙련된 레고 마스터처럼, 복잡한 구조도 척척 만들어내는 능력을 갖추게 된 것입니다.

AlphaFold: Highly accurate protein structure prediction with AlphaFold (학습 노트)

Purpose of the Paper

Problem: 기존 protein structure prediction을 위한 computational methods는 atomic accuracy를 달성하는 데 어려움이 있었으며, 특히 homologous structure가 없는 경우에는 더욱 성능이 제한되어 biological applications에 활용하기 어려웠습니다.
Goal: 알려진 유사 구조(homologous structure)가 없는 경우에도 atomic accuracy로 protein structure를 예측할 수 있는 computational method를 개발하는 것입니다.
Novel approach: 기존 방법들과 다르게, AlphaFold는 deep learning algorithm 설계에 protein structure에 대한 physical and biological knowledge를 통합하고, multi-sequence alignments를 활용하여 이 문제를 해결하고자 했습니다.

Key Contributions

Novel Neural Network Architecture (Evoformer):
- MSA (multiple sequence alignments)와 pairwise features를 jointly embed하는 Evoformer block을 도입했습니다.
- 이를 통해 spatial and evolutionary relationships에 대한 직접적인 추론(direct reasoning)이 가능합니다.
- MSA와 pair representations 내에서 information exchange를 위한 새로운 메커니즘을 도입했습니다.
End-to-End Structure Prediction: Neural network가 protein의 모든 heavy atoms의 3D coordinates를 직접 예측(directly predict)합니다.
Equivariant Attention (Invariant Point Attention, IPA):
- 3D positions를 변경하지 않고 neural activations를 update하는 새로운 geometry-aware attention operation (IPA)을 사용했습니다.
- Residue gas에 equivariant update를 수행합니다.
Iterative Refinement (Recycling): Final loss를 outputs에 반복적으로 적용하고, 이를 다시 동일한 modules에 feeding하는 "recycling"을 통해 iterative refinement를 강화했습니다.
Training with Labelled and Unlabelled Data: Self-distillation과 BERT-style objective를 활용하여 unlabeled sequence data를 효과적으로 사용함으로써 accuracy를 크게 향상시켰습니다.
Structure Module: Rotation and translation을 사용하여 explicit 3D structure를 도입했습니다.

Novelty

Evoformer와 Structure module의 architectural design이 핵심적인 novelty이며, 이를 통해 MSA를 사용한 joint refinement가 가능합니다.

Experimental Highlights

CASP14 Validation: AlphaFold는 대부분의 경우에서 experimental structures와 경쟁할 만한 accuracy를 보였으며, 다른 methods를 크게 능가했습니다.
- Median backbone accuracy: 0.96 Å r.m.s.d.95 (Ca root-mean-square deviation at 95% residue coverage). 수치로 제시
- Next best method: 2.8 Å r.m.s.d.95. 수치로 비교
High Accuracy on Recent PDB Structures: 최근에 공개된 PDB structures에서도 높은 accuracy를 유지함을 입증했습니다.
Reliable Confidence Estimates: Predicted local-distance difference test (pLDDT)가 Ca local-distance difference test (IDDT-Ca) accuracy를 신뢰성 있게 예측합니다 (reliably predicts).
Zinc binding site: 정확하게 예측했습니다(correctly predicted).
2,180-residue single chain: 올바른 domain packing을 보였습니다 (prediction made after CASP).

Limitations and Future Work

MSA Depth Dependence: Median alignment depth가 약 30 sequences 미만일 때 accuracy가 크게 감소합니다. 구체적인 수치 제시
- Why important? MSA information은 early stages에서 올바른 structure를 찾는 데 중요하지만, refinement에는 MSA information에 대한 의존도가 크지 않습니다.
Weakness in Heterotypic Contact-Rich Proteins: AlphaFold는 intra-chain 또는 homotypic contacts가 적고 heterotypic contacts가 많은 proteins에서는 성능이 저하됩니다.
- Why Important?: 이러한 Proteins의 shape은 주로 complex 내의 other chains와의 interactions에 의해 결정되기 때문입니다.
Future Work: AlphaFold의 아이디어를 full hetero-complexes prediction에 적용하는 것이 중요한 future work입니다.
- How it overcomes limitations? 이를 통해 hetero-contacts가 많은 protein chains의 어려움을 해결할 수 있습니다.

Abstract

Proteins은 생명체에 필수적이며, Proteins의 structure를 이해하면 Proteins의 기능에 대한 메커니즘적 이해를 도울 수 있다. 방대한 실험적 노력으로, 약 100,000개의 고유한 Proteins structure가 결정되었지만, 이는 알려진 수십억 개의 protein sequences의 작은 부분만을 나타낸다. Structural coverage는 단일 protein structure를 결정하는 데 필요한 수개월에서 수년간의 고된 노력으로 인해 병목 현상이 발생한다. 이러한 격차를 해소하고 대규모 structural bioinformatics를 가능하게 하려면 정확한 computational approaches가 필요하다. 아미노산 sequence만을 기반으로 Proteins이 채택할 3차원 structure를 예측하는 것('protein folding problem'의 structure prediction 구성 요소)은 50년 이상 중요한 미해결 연구 문제였다. 최근의 발전에도 불구하고, 기존 methods는 특히 homologous structure가 없는 경우 atomic accuracy에 훨씬 못 미친다. 여기서 우리는 비슷한 structure가 알려지지 않은 경우에도 Proteins structure를 atomic accuracy로 정기적으로 예측할 수 있는 최초의 computational method를 제공한다. 우리는 neural network-based model인 AlphaFold의 완전히 재설계된 버전을 까다로운 14차 Critical Assessment of protein Structure Prediction (CASP14)에서 검증하여 대다수의 경우 experimental structures와 경쟁할 만한 accuracy를 입증하고 다른 methods보다 훨씬 뛰어난 성능을 보였다. AlphaFold의 최신 버전을 뒷받침하는 것은 multi-sequence alignments를 활용하여 protein structure에 대한 물리적, 생물학적 지식을 deep learning algorithm 설계에 통합하는 새로운 machine learning approach이다.

Introduction

Protein sequence로부터 3차원 (3D) protein structures를 예측하는 computational methods 개발은 physical interactions 또는 evolutionary history에 초점을 맞추는 두 가지 상호 보완적인 경로를 따라 진행되었다. physical interaction 프로그램은 molecular driving forces에 대한 우리의 이해를 protein physics의 thermodynamic 또는 kinetic simulation 또는 그것의 statistical approximations에 크게 통합한다. 이론적으로는 매우 매력적이지만, 이 approach는 molecular simulation의 computational intractability, protein stability의 context dependence 및 protein physics의 충분히 정확한 models을 생성하는 어려움으로 인해 중간 크기의 Proteins조차도 매우 어려운 것으로 판명되었다. evolutionary 프로그램은 최근 몇 년 동안 대안을 제공했으며, protein structure에 대한 제약 조건은 Proteins의 evolutionary history, solved structures에 대한 homology 및 pairwise evolutionary correlations에 대한 bioinformatics analysis에서 파생된다. 이 bioinformatics approach는 Protein Data Bank (PDB)에 축적된 experimental protein structures의 꾸준한 증가, genomic sequencing의 폭발적인 증가, 그리고 이러한 correlations을 해석하기 위한 deep learning techniques의 급속한 발전으로부터 큰 이점을 얻었다. 이러한 발전에도 불구하고, 현대의 physical 및 evolutionary-history-based approaches는 close homologue가 실험적으로 해결되지 않은 대부분의 경우에서 experimental accuracy에 훨씬 못 미치는 예측을 생성하며, 이는 많은 생물학적 응용 분야에서 그 유용성을 제한했다.

본 연구에서, 우리는 대부분의 경우에서 experimental accuracy에 근접하게 protein structures를 예측할 수 있는, 우리가 아는 한, 최초의 computational approach를 개발했다. 우리가 개발한 neural network AlphaFold는 CASP14 평가(2020년 5월~7월, 팀명 'AlphaFold2'로 참가했으며 CASP13 AlphaFold system과는 완전히 다른 model)에 참가했다. CASP 평가는 최근에 해결되었지만 PDB에 등록되지 않았거나 공개적으로 발표되지 않은 structures를 사용하여 2년마다 수행되므로 참가 methods에 대한 blind test이며, 오랫동안 structure prediction의 accuracy에 대한 gold-standard 평가 역할을 해왔다.

CASP14에서 AlphaFold structures는 경쟁 methods보다 훨씬 더 정확했다. AlphaFold structures는 median backbone accuracy가 0.96 Å r.m.s.d.95 (95% residue coverage에서 Cα root-mean-square deviation) (95% confidence interval = 0.85–1.16 Å)인 반면, 차선책 method는 median backbone accuracy가 2.8 Å r.m.s.d.95 (95% confidence interval = 2.7–4.0 Å)였다 (CASP domains에서 측정, backbone accuracy는 Fig. 1a, all-atom accuracy는 Supplementary Fig. 14 참조). 이 accuracy의 비교 기준으로 탄소 원자의 너비는 약 1.4 Å이다. AlphaFold는 매우 정확한 domain structures (Fig. 1b) 외에도 backbone이 매우 정확할 때 매우 정확한 side chains (Fig. 1c)를 생성할 수 있으며, 강력한 templates을 사용할 수 있는 경우에도 template-based methods보다 훨씬 향상된다. AlphaFold의 all-atom accuracy는 1.5 Å r.m.s.d.95 (95% confidence interval = 1.2–1.6 Å)였으며, 이는 차선책 method의 3.5 Å r.m.s.d.95 (95% confidence interval = 3.1–4.2 Å)와 비교된다. 우리의 methods는 정확한 domains 및 domain-packing을 통해 매우 긴 Proteins까지 scalable하다 (structural homologues가 없는 2,180-residue protein의 예측은 Fig. 1d 참조). 마지막으로, 이 model은 predictions의 신뢰성 있는 사용을 가능하게 하는 정확한 per-residue estimates of its reliability를 제공할 수 있다.

우리는 Fig. 2a에서 AlphaFold가 CASP14에서 보여준 높은 accuracy가 최근에 release된 PDB structures의 큰 sample로 확장됨을 보여준다. 이 dataset에서 모든 structures는 우리의 training data cut-off 이후에 PDB에 deposit되었으며 full chains로 분석된다 (자세한 내용은 Methods, Supplementary Fig. 15 및 Supplementary Table 6 참조). 또한, backbone prediction이 정확할 때 높은 side-chain accuracy를 관찰하고(Fig. 2b), 우리의 confidence measure인 predicted local-distance difference test (pLDDT)가 해당 prediction의 Cα local-distance difference test (lDDT-Cα) accuracy를 신뢰성 있게 예측한다는 것을 보여준다(Fig. 2c). 또한 global superposition metric template modelling score (TM-score)를 정확하게 예측할 수 있음을 발견했다(Fig. 2d). 전반적으로, 이러한 분석은 CASP14 Proteins에 대한 AlphaFold의 높은 accuracy와 reliability가 예상대로 최근 PDB submissions의 uncurated collection에도 적용됨을 입증한다 (이 높은 accuracy가 new folds로 확장됨을 확인하려면 Supplementary Methods 1.15 및 Supplementary Fig. 11 참조).

AlphaFold2 논문 Introduction 섹션 핵심 정리 노트 (AI 연구자 대상)

핵심

문제 정의: 기존 protein structure prediction 방법들은 homologous structure가 없을 때 atomic accuracy 수준의 예측에 어려움을 겪음. 이는 computational intractability, context dependence, 정확한 protein physics model 부재 등의 문제에 기인함.
제안 방법:
- AlphaFold2: Deep learning 기반 neural network model.
- Physical interactions와 evolutionary history를 모두 통합하는 새로운 machine learning approach 사용.
- Multi-sequence alignments 활용.
CASP14 결과:
- Median backbone accuracy: 0.96 Å r.m.s.d.95 (타 methods 대비 압도적 성능).
- All-atom accuracy: 1.5 Å r.m.s.d.95.
- 정확한 domain structures 및 side chain prediction (backbone 정확 시).
- Domain-packing 정확도를 갖춘 긴 protein sequence 예측 가능.
- Per-residue reliability estimates 제공.
PDB structures 추가 검증
CASP14와 유사하게 높은 accuracy, reliability 확인.
High side-chain accuracy (when the backbone prediction is accurate)
pLDDT confidence measure는 lDDT-Cα accuracy를 predict

차별점

Atomic accuracy 수준의 protein structure prediction을 "대부분의 경우"에 달성한 최초의 computational approach.
Physical interaction과 evolutionary history 정보를 deep learning architecture 설계에 통합하는 새로운 machine learning approach 제시.
CASP14에서 압도적인 성능으로 기존 방법들의 한계를 극복.
Scalability: 긴 protein sequence에 대한 예측 가능.
Reliability: Per-residue confidence estimates를 제공하여 예측 결과의 신뢰성 판단 가능.

쉬운 설명 :

AlphaFold2는 단백질의 3차원 구조를 예측하는 AI 모델입니다. 기존에도 단백질 구조 예측을 위한 여러 방법들이 있었지만, 특히 이전에 알려진 유사한 구조(homologous structure)가 없는 경우에는 정확도가 많이 떨어졌습니다. AlphaFold2는 이러한 문제를 해결하기 위해, 단백질의 물리적, 화학적 특성뿐만 아니라 진화 과정에서 축적된 정보(multi-sequence alignments)까지 활용하는 새로운 방법을 사용했습니다.

결과적으로, AlphaFold2는 CASP14라는 단백질 구조 예측 대회에서 다른 모든 방법들을 압도적으로 능가하는 성능을 보여주었습니다. 거의 원자 수준의 정확도로 단백질 구조를 예측할 수 있게 되었고, 심지어 매우 긴 단백질에 대해서도 정확한 예측이 가능합니다. 또한, AlphaFold2는 예측 결과의 각 부분(residue)에 대한 신뢰도 점수(per-residue estimates of its reliability)를 제공하여, 어떤 부분을 더 믿을 수 있는지 판단할 수 있게 해줍니다.

간단히 말해, AlphaFold2는 이전에는 풀기 어려웠던 단백질 구조 예측 문제를 획기적으로 개선한 AI 모델이며, 생명 과학 연구에 큰 영향을 미칠 수 있는 기술입니다.

The AlphaFold network

AlphaFold는 protein structures의 evolutionary, physical, and geometric constraints에 기반한 새로운 neural network architectures와 training procedures를 통합하여 structure prediction의 accuracy를 크게 향상시킨다. 특히, 우리는 multiple sequence alignments (MSAs)와 pairwise features를 jointly embed하기 위한 새로운 architecture, 정확한 end-to-end structure prediction을 가능하게 하는 새로운 output representation 및 관련 loss, 새로운 equivariant attention architecture, predictions의 iterative refinement를 달성하기 위한 intermediate losses의 사용, structure와 jointly train하기 위한 masked MSA loss, self-distillation을 사용한 unlabelled protein sequences로부터의 학습, 그리고 self-estimates of accuracy를 제시한다.

AlphaFold network는 primary amino acid sequence와 homologues의 aligned sequences를 inputs으로 사용하여 주어진 protein에 대한 모든 heavy atoms의 3D coordinates를 직접 예측한다(Fig. 1e; databases, MSA construction 및 templates 사용을 포함한 inputs에 대한 자세한 내용은 Methods 참조). 가장 중요한 ideas와 components에 대한 설명은 아래에 제공된다. 전체 network architecture와 training procedure는 Supplementary Methods에 제공된다.

Network는 두 가지 주요 단계로 구성된다. 첫째, network의 trunk는 우리가 Evoformer라고 부르는 새로운 neural network block의 repeated layers를 통해 inputs를 처리하여, processed MSA를 나타내는 Nseq × Nres array (Nseq: number of sequences, Nres: number of residues)와 residue pairs를 나타내는 Nres × Nres array를 생성한다. MSA representation은 raw MSA로 초기화된다(그러나 매우 깊은 MSAs 처리에 대한 자세한 내용은 Supplementary Methods 1.2.7 참조). Evoformer blocks는 attention-based 및 non-attention-based components를 포함한다. 우리는 'Interpreting the neural network'에서 Evoformer blocks 내에서 구체적인 structural hypothesis가 일찍 발생하고 지속적으로 refined된다는 증거를 보여준다. Evoformer block의 핵심적인 혁신은 spatial and evolutionary relationships에 대한 직접적인 reasoning을 가능하게 하는 MSA 및 pair representations 내에서 information을 교환하는 새로운 mechanisms이다.

Network의 trunk 다음에는 protein의 각 residue에 대한 rotation and translation 형태의 explicit 3D structure를 도입하는 structure module이 이어진다(global rigid body frames). 이러한 representations는 모든 rotations이 identity로 설정되고 모든 positions이 원점으로 설정된 trivial state로 초기화되지만, 빠르게 발전하고 precise atomic details를 가진 매우 정확한 protein structure를 refine한다. Network의 이 섹션에서 핵심적인 혁신은 structure의 모든 부분을 동시에 local refinement할 수 있도록 chain structure를 깨는 것, unrepresented side-chain atoms에 대해 암시적으로 reasoning할 수 있도록 하는 새로운 equivariant transformer, 그리고 residues의 orientational correctness에 상당한 가중치를 두는 loss term을 포함한다. Structure module 내부와 전체 network 전체에서, 우리는 outputs에 final loss를 반복적으로 적용한 다음 outputs를 동일한 modules에 recursively하게 feeding함으로써 iterative refinement의 개념을 강화한다. 전체 network를 사용한 iterative refinement('recycling'이라고 하며 computer vision의 approaches와 관련됨)는 약간의 추가 training time으로 accuracy에 크게 기여한다(자세한 내용은 Supplementary Methods 1.8 참조).

The AlphaFold Network 섹션 핵심 정리 노트 (AI 연구자 대상)

개요

AlphaFold network는 protein structure의 evolutionary, physical, and geometric constraints를 활용하는 새로운 neural network architecture와 training procedures를 통해 prediction accuracy를 대폭 향상시켰다.

핵심 Components & Techniques

Evoformer Block:
- Input: raw MSA, residue pairs.
- Output: processed MSA representation (Nseq x Nres array), residue pair representation (Nres x Nres array).
- Attention-based 및 non-attention-based components 포함.
- 핵심 혁신: MSA 및 pair representations 내에서 information exchange를 위한 새로운 mechanisms. 이를 통해 spatial and evolutionary relationships에 대한 direct reasoning 가능.
- Structural hypothesis가 Evoformer block 내에서 초기에 형성되고 지속적으로 refine됨.
Structure Module:
- Input: Evoformer의 output.
- Output: 각 residue에 대한 rotation and translation (global rigid body frames) 형태의 explicit 3D structure.
- 초기에는 trivial state (identity rotation, origin position)로 시작하여, 점차 highly accurate protein structure로 refine됨.
- 핵심 혁신:
  - Chain structure를 breaking하여 structure 전체의 simultaneous local refinement 가능.
  - Equivariant transformer: unrepresented side-chain atoms에 대한 implicit reasoning 가능.
  - Residue orientational correctness에 가중치를 두는 loss term.
Recycling:
- Iterative refinement를 위해 전체 network를 반복적으로 사용.
- Final loss를 outputs에 반복 적용하고, outputs를 다시 동일한 modules에 feed.
- Minimal extra training time으로 accuracy 향상에 크게 기여.
기타 Techniques:
- Masked MSA loss (structure와 jointly training).
- Self-distillation (unlabelled protein sequences 활용).
- Self-estimates of accuracy.

차별점

Joint Embedding: MSA와 pairwise features를 함께 embedding하는 새로운 architecture (Evoformer).
Equivariant Attention: Spatial reasoning을 위한 새로운 attention mechanism.
End-to-end Structure Prediction: Input (amino acid sequence, aligned sequences)에서 3D coordinates를 직접 예측.
Iterative Refinement: Recycling을 통한 반복적인 구조 개선.

쉬운 설명 :

AlphaFold2 network는 크게 두 부분으로 나뉩니다. 첫 번째는 "Evoformer"라는 블록인데, 여기서는 입력으로 들어온 아미노산 sequence와 여러 개의 유사한 sequence들(MSA)을 함께 처리하여, 단백질 구조에 대한 중요한 정보들을 뽑아냅니다. 특히, Evoformer는 sequence들 간의 관계, 그리고 아미노산 residue들 간의 관계를 파악하는 새로운 방법을 사용해서, 단백질의 3차원 구조를 더 잘 예측할 수 있게 해줍니다.

두 번째는 "Structure Module"이라는 부분인데, 여기서는 Evoformer에서 얻은 정보를 바탕으로 단백질의 실제 3차원 구조를 만들어냅니다. 처음에는 대략적인 형태에서 시작해서, 점점 더 자세하고 정확한 구조로 다듬어 나갑니다. 이때, Structure Module은 단백질의 특정 부분(side-chain)을 명시적으로 표현하지 않고도 그 구조를 추론할 수 있는 특별한 기능(equivariant transformer)을 가지고 있습니다.

또한, AlphaFold2는 "Recycling"이라는 기술을 사용하는데, 이것은 전체 과정을 여러 번 반복하면서 예측의 정확도를 높이는 방법입니다. 마치 그림을 그릴 때 스케치를 하고, 조금씩 다듬고, 다시 처음부터 확인하면서 완성도를 높이는 것과 비슷합니다.

결론적으로 AlphaFold2는 이전에는 없던 새로운 기술들을 조합하여 단백질 구조 예측의 정확도를 획기적으로 높인 AI 모델이라고 할 수 있습니다.

Evoformer

Network의 building block의 핵심 원리(Evoformer라고 명명됨) (Figs. 1e, 3a)는 protein structures 예측을 residues in proximity에 의해 edges가 정의되는 3D space에서의 graph inference problem으로 보는 것이다. Pair representation의 elements는 residues 간의 relation에 대한 information을 encode한다 (Fig. 3b). MSA representation의 columns은 input sequence의 individual residues를 encode하는 반면, rows는 해당 residues가 나타나는 sequences를 나타낸다. 이러한 framework 내에서, 우리는 각 block에서 순차적으로 적용되는 여러 update operations을 정의한다.

MSA representation은 MSA sequence dimension에 대해 summed되는 element-wise outer product를 통해 pair representation을 updates한다. 이전 연구와 달리, 이 operation은 network에서 한 번이 아니라 매 block 내에서 적용되어, evolving MSA representation에서 pair representation으로의 continuous communication을 가능하게 한다.

Pair representation 내에는 두 가지 다른 update patterns가 있다. 둘 다 pair representation의 consistency 필요성에 의해 영감을 받았다. — amino acids에 대한 pairwise description이 단일 3D structure로 표현 가능하려면 distances에 대한 triangle inequality를 포함한 많은 constraints가 충족되어야 한다. 이러한 직관에 기초하여, 우리는 세 개의 다른 nodes를 포함하는 triangles of edges 측면에서 pair representation에 대한 update operations을 정렬한다 (Fig. 3c). 특히, axial attention에 extra logit bias를 추가하여 triangle의 'missing edge'를 포함하고, 두 edges를 사용하여 누락된 세 번째 edge를 update하는 non-attention update operation 'triangle multiplicative update'를 정의한다 (자세한 내용은 Supplementary Methods 1.6.5 참조). Triangle multiplicative update는 원래 attention에 대한 더 symmetric하고 cheaper replacement로 개발되었으며, attention 또는 multiplicative update만 사용하는 networks는 모두 high-accuracy structures를 생성할 수 있다. 그러나 두 updates의 combination이 더 정확하다.

우리는 또한 MSA representation 내에서 axial attention의 variant를 사용한다. MSA에서 per-sequence attention 동안, 우리는 pair stack에서 additional logits를 project하여 MSA attention을 bias한다. 이것은 pair representation에서 MSA representation으로 information flow를 다시 제공하여 loop를 닫고, 전체 Evoformer block이 pair and MSA representations 간에 information을 완전히 mix하고 structure module 내에서 structure generation을 준비할 수 있도록 보장한다.

End-to-end structure prediction

Structure module (Fig. 3d)은 pair representation과 trunk에서 온 MSA representation의 original sequence row (single representation)를 사용하여 concrete 3D backbone structure에서 작동한다. 3D backbone structure는 Nres개의 independent rotations and translations로 표현되며, 각각은 global frame (residue gas)에 대한 것이다 (Fig. 3e). N-Cα-C atoms의 geometry를 나타내는 이러한 rotations and translations는 protein backbone의 orientation을 prioritize하여 각 residue의 side chain 위치가 해당 frame 내에서 highly constrained되도록 한다. 반대로, peptide bond geometry는 완전히 unconstrained되며, network는 structure module을 적용하는 동안 chain constraint를 자주 violate하는 것으로 관찰되는데, 이는 이 constraint를 깨면 복잡한 loop closure problems를 해결하지 않고도 chain의 모든 부분을 local refinement할 수 있기 때문이다. Peptide bond geometry의 만족은 violation loss term에 의해 fine-tuning 중에 권장된다. Peptide bond geometry의 정확한 시행은 Amber force field에서 gradient descent에 의한 structure의 post-prediction relaxation에서만 달성된다. 경험적으로, 이 final relaxation은 global distance test (GDT) 또는 lDDT-Cα로 측정된 model의 accuracy를 향상시키지 않지만, accuracy 손실 없이 distracting stereochemical violations를 제거한다.

Residue gas representation은 두 단계로 반복적으로 updated된다 (Fig. 3d). 첫째, 'invariant point attention' (IPA)라고 하는 geometry-aware attention operation을 사용하여 3D positions를 변경하지 않고 Nres set of neural activations (single representation)를 update한 다음, updated activations를 사용하여 residue gas에 대해 equivariant update operation이 수행된다. IPA는 usual attention queries, keys, and values 각각을 각 residue의 local frame에서 produced되는 3D points로 augment하여 final value가 global rotations and translations에 invariant하도록 한다 (자세한 내용은 Methods 'IPA' 참조). 3D queries and keys는 또한 attention에 strong spatial/locality bias를 부과하며, 이는 protein structure의 iterative refinement에 적합하다. 각 attention operation 및 element-wise transition block 이후에, module은 각 backbone frame의 rotation and translation에 대한 update를 계산한다. 각 residue의 local frame 내에서 이러한 updates를 적용하면 overall attention and update block이 residue gas에 대한 equivariant operation이 된다.

Side-chain χ angles의 predictions와 structure의 final, per-residue accuracy (pLDDT)는 network 끝에서 final activations에 대한 작은 per-residue networks로 계산된다. TM-score (pTM)의 estimate는 final pair representation에서 linear projection으로 계산되는 pairwise error prediction에서 얻는다. Final loss (frame-aligned point error (FAPE) (Fig. 3f)라고 함)는 predicted atom positions을 many different alignments 하에서 true positions와 비교한다. Predicted frame (Rk, tk)를 해당 true frame에 aligning하여 정의된 각 alignment에 대해, 우리는 모든 predicted atom positions xi와 true atom positions 간의 distance를 계산한다. 결과적인 Nframes × Natoms distances는 clamped L1 loss로 penalized된다. 이것은 atoms이 각 residue의 local frame에 대해 correct하고 따라서 side-chain interactions에 대해 correct하도록 strong bias를 생성하고, AlphaFold에 대한 chirality의 main source를 제공한다 (Supplementary Methods 1.9.3 및 Supplementary Fig. 9).

Evoformer & End-to-end Structure Prediction 섹션 정리 노트 (AI 연구자 대상)

Evoformer

핵심 아이디어: Protein structure prediction을 3D 공간에서의 graph inference problem으로 간주. Residue proximity를 graph의 edge로 정의.
Input: MSA, residue pairs.
Output: Processed MSA representation, pair representation.
Pair Representation Updates:
- MSA sequence dimension에 대한 element-wise outer product (MSA representation -> pair representation). 매 block마다 적용되어 continuous communication 가능.
- Triangle-based update operations:
  - Triangle multiplicative update: 두 edge를 사용해 missing third edge 업데이트.
  - Axial attention with extra logit bias: Triangle의 "missing edge" 포함.
MSA Representation Updates:
- Axial attention variant 사용.
- Pair stack에서 additional logits를 project하여 MSA attention bias. (Pair representation -> MSA representation, 정보 흐름 loop 형성).
목표: Pair와 MSA representations 간의 information mixing, structure module에서의 structure generation 준비.

End-to-end Structure Prediction (Structure Module)

Input: Pair representation, MSA representation의 original sequence row (single representation).
3D Backbone Structure Representation:
- Nres independent rotations and translations (residue gas). Global frame 기준.
- N-Cα-C atoms geometry 표현. Side chain 위치는 각 frame 내에서 highly constrained.
- Peptide bond geometry는 initially unconstrained (local refinement를 위해). Violation loss term으로 fine-tuning 시 constraint 적용.
Iterative Update:
1. Invariant Point Attention (IPA):
  - 3D positions 변경 없이 Nres set of neural activations (single representation) update.
  - Attention queries, keys, values를 3D points로 augment (각 residue local frame에서 생성).
  - Final value는 global rotations and translations에 invariant.
  - Strong spatial/locality bias 부과.
2. Equivariant Update Operation:
  - Updated activations 사용, residue gas에 대해 수행.
  - Backbone frame의 rotation and translation update.
  - Local frame 내에서 update 적용 -> overall attention and update block이 residue gas에 대한 equivariant operation.
Output:
- Side-chain χ angles predictions.
- Final, per-residue accuracy (pLDDT).
- TM-score estimate (pTM) - pairwise error prediction (final pair representation에서 linear projection).
Frame-Aligned Point Error (FAPE) Loss:
- Predicted atom positions을 multiple alignments에서 true positions와 비교.
- Nframes × Natoms distances를 clamped L1 loss로 penalize.
- Atoms이 각 residue local frame 및 side-chain interactions에 대해 correct하도록 bias.
- AlphaFold chirality의 main source.

차별점

Graph Inference Problem: Protein structure prediction을 graph inference로 접근.
Triangle-based Updates: Pair representation consistency를 위한 triangle-based operations.
Invariant Point Attention (IPA): Geometry-aware attention, global invariance, spatial/locality bias.
Residue Gas Representation: Explicit 3D backbone structure 표현, iterative refinement.
FAPE Loss: Local frame 및 side-chain interactions, chirality 고려.

쉬운 설명 :

Evoformer

Evoformer는 단백질 구조 예측을 마치 3차원 공간에서 점(residue)들을 연결하는 문제(graph inference)로 봅니다. 이 점들 사이의 연결(edge)은 서로 가까이 있는 residue들을 나타냅니다. Evoformer는 MSA(Multiple Sequence Alignment, 여러 단백질 서열들을 정렬한 것)와 residue pair(residue 쌍) 정보를 입력으로 받아, 이 정보들을 융합하고 가공합니다.

Evoformer 내부에서는 크게 두 가지 중요한 업데이트가 일어납니다.

Pair Representation 업데이트: MSA 정보를 활용하여 residue 쌍들 간의 관계를 업데이트합니다. 이때 "triangle" 규칙을 활용하는데, 이는 세 residue가 있을 때 두 residue 간의 관계가 나머지 한 residue와의 관계에 영향을 준다는 아이디어입니다.
MSA Representation 업데이트: Pair representation에서 얻은 정보를 다시 MSA representation에 반영하여, 두 정보 source 간의 순환적인 정보 교환을 가능하게 합니다.

End-to-end Structure Prediction

Structure Module은 Evoformer에서 처리된 정보를 바탕으로 실제 3차원 단백질 구조를 만들어냅니다. 이때, 각 residue의 위치와 방향(rotation and translation)을 나타내는 "residue gas"라는 표현 방식을 사용합니다.

Structure Module은 다음 두 단계를 반복하여 구조를 개선합니다.

Invariant Point Attention (IPA): 각 residue 주변의 3차원 정보를 고려하여 attention을 수행합니다. 이 attention은 단백질 전체를 어떻게 회전시키거나 이동시켜도 결과가 변하지 않는(invariant) 특징을 가지고 있어, 구조 예측의 안정성을 높여줍니다.
Equivariant Update: IPA의 결과를 바탕으로 residue gas를 업데이트하여, 단백질 구조를 더욱 정교하게 만듭니다.

마지막으로, Structure Module은 side-chain의 각도, 각 residue별 예측 정확도(pLDDT), 그리고 전체적인 구조 정확도(TM-score)를 예측합니다.

Frame-Aligned Point Error (FAPE) 라는 특별한 loss function을 사용하여, 예측된 원자 위치와 실제 위치를 여러 각도에서 비교하여 오차를 계산합니다. 이 loss는 각 residue의 local frame과 side-chain interactions를 고려하여 정확한 구조를 예측하도록 돕습니다.

Overall Process

Input (amino acid sequence, MSA)이 Evoformer에 입력됨.
Evoformer는 MSA와 residue pair 정보를 처리하여 refined representations 생성.
Structure Module은 Evoformer의 outputs를 사용하여 explicit 3D backbone structure를 만들고, iterative refinement를 통해 구조를 개선.
Final predictions (3D structure, pLDDT, pTM) 출력.
Frame-aligned point error(FAPE) Loss를 통해 predicted atom positions과 true positions을 비교.
최종적으로, (선택 사항) post-prediction relaxation by gradient descent in the Amber force field, final relaxation.

AlphaFold2의 주요 components와 작동 방식을 요약하면 다음과 같습니다:

1. Input:

Primary amino acid sequence: 예측하고자 하는 protein의 아미노산 서열.
Multiple Sequence Alignment (MSA): Target protein과 evolutionary relationship이 있는 (homologous) protein sequences의 alignment.
Template (optional): Target protein과 유사한 structure가 이미 알려져 있는 경우, template structure 정보를 활용.

2. Embedding:

MSA와 (optional) template 정보를 사용하여 두 가지 초기 representation을 생성:
- MSA Representation: MSA data 자체가 초기 embedding으로 사용됨 (Nseq x Nres).
- Pair Representation: MSA features와 template information을 결합하여 생성 (Nres x Nres).

3. Evoformer:

MSA representation과 pair representation을 반복적으로 update하는 핵심 block.
Outer Product Update: MSA representation의 정보를 pair representation으로 전달.
Triangle Multiplicative Update, Axial Attention with Extra Logit Bias: Pair representation의 consistency를 유지하고 missing edges를 채우기 위한 triangle-based updates.
Axial Attention (within MSA Representation): Pair representation의 정보를 MSA representation으로 다시 반영.
Output: Updated MSA representation (== Residue representation), Updated Pair representation.

4. Structure Module:

Evoformer의 output (updated MSA representation, updated pair representation)을 받아 protein의 3D structure를 명시적으로 예측.
Residue Rotation and Translation: 각 residue의 3차원 공간에서의 orientation과 position을 예측.
Chain Breaking: Chain structure의 제약을 완화하여 각 residue를 independently refine.
Equivariant Transformer: Backbone atoms의 정보를 바탕으로 side-chain atoms의 위치를 암시적으로 추론.
Loss Function:
- FAPE (Frame Aligned Point Error): Predicted structure와 true structure 간의 차이를 계산하여 network를 학습.
- Side-chain atoms에 대한 별도의 loss는 없지만, backbone atoms의 loss와 전체 structure의 loss를 통해 간접적으로 최적화.

5. Iterative Refinement (Recycling):

Structure Module의 output을 다시 Evoformer의 input으로 사용하여 refinement를 반복.

6. Output:

Predicted 3D coordinates of all heavy atoms in the protein.
Per-residue confidence score (pLDDT).