Interactive thesis summary

Unveiling Protein Connections through Graph Neural Networks and ProtT5 Embeddings

A web-native summary of my bachelor’s thesis: predicting missing protein-protein interactions by combining protein language-model embeddings with graph neural networks.

Graph Neural Networks ProtT5 STRING v12 Link prediction Bioinformatics

From protein sequences to interaction scores

The idea is simple: proteins become nodes, known interactions become edges, ProtT5 provides sequence-based node features, and the GNN learns graph-aware protein embeddings that can be scored pairwise.

1

Sequences

Amino-acid strings.

2

ProtT5

1024-dimensional node features.

3

PPI graph

Proteins as nodes, PPIs as edges.

4

GNN encoder

Message passing over the network.

5

Decoder

Pairwise interaction score.

Protein sequences

The starting point is the amino-acid sequence of each protein. This gives the model biological information before seeing the graph.

Dataset: STRING v12 human PPI graph

I used the STRING v12 human protein-protein interaction network and kept only very high-confidence edges. Filtering at confidence ≥ 950 reduces noise and makes the graph more reliable for training, at the cost of removing many lower-confidence interactions.

  • Source: STRING v12, human PPI network.
  • Filtering: confidence score ≥ 950.
  • Final graph: 10,430 proteins and roughly 120 thousand interactions.
  • Training setup: positive edges plus 10× randomly sampled negative pairs.

Architecture: graph encoder + pair decoder

Message passing intuition

TAGConv aggregates multi-hop neighborhood information. Use the buttons to see how a center protein can receive information from 1-hop, 2-hop, and 3-hop neighborhoods.

u 1 1 1 1 2 2 2 2 3 3

Encoder layers

  • TAGConv K=3: aggregates 1-hop, 2-hop, and 3-hop neighborhood information in one layer.
  • TransformerConv: attention-style graph convolution with multiple heads to modulate neighbor influence.
  • GINConv: final nonlinear refinement layer to improve representational power.

Pair decoder

Once the encoder produces graph-aware embeddings, the decoder scores pairs locally. This means the graph can be encoded once, then many candidate pairs can be ranked.

TransformerConv attention

Each neighbor sends a message, but attention decides how much each message should matter before the center protein updates its embedding.

u a b c d
a -> u 0.42
b -> u 0.28
c -> u 0.14

Pair decoder animation

After message passing, the graph is encoded once. The decoder only receives two node embeddings, builds pair features, and maps them to an interaction probability.

zi protein i embedding
zj protein j embedding
zi - zj directional difference
zi × zj shared signal
(zi - zj)2 distance signal
MLP layer 1 mix pair evidence
MLP layer 2 refine nonlinear score
p output scalar

Results

The model achieved strong global separability and very high early precision. The early precision result is especially important because in a biological workflow the goal is often to produce a short list of strong candidates for further validation.

0.96 AUROC
0.89 AUPRC
1.00 Precision@500
82% precision at 0.5 cutoff
  • AUROC 0.96: strong ranking ability between true and negative pairs.
  • AUPRC 0.89: robust performance under the 1:10 positive-negative skew.
  • P@500 = 1.00: all top-500 predictions in the test setup were true positives.
  • Recall: about 81% at threshold 0.5.

Topology bias

The model was strongest in dense regions of the graph. This makes sense: GNNs rely on neighborhood information, so hub proteins and clustered regions provide more signal.

  • Recovered test edges: involved proteins with higher-than-average degree.
  • Dense regions: more shared neighbors and stronger contextual evidence.
  • Low-degree proteins: harder to score confidently because their neighborhoods contain less signal.

Practical interpretation

The model is not only learning sequence compatibility. It is also learning how interaction likelihood behaves inside the observed graph topology. This is useful, but it introduces a blind spot for sparse regions.

NOTCH2 case study

The NOTCH2 analysis compared a denser lower-confidence graph with the high-confidence graph used for training. The main point is that graph context changes model confidence dramatically.

Dense LC-250 graph

In the denser graph, recovered edges showed a sharp high-confidence region near p = 0.95. The model became bolder because NOTCH2 had richer local topology.

Graph context density
High-confidence predictions
Model caution

Limitations and future work

Limitations

  • Topology bias: less reliable on low-degree proteins.
  • Single data source: trained only on STRING@950.
  • No external baseline: hard to compare directly with other studies.
  • Scalability limits: very large graphs still need optimized batching.

Future work

  • Generalization: train and evaluate across multiple PPI sources.
  • Explainability: expose which paths, motifs, or sequence features drive predictions.
  • Scalability: use subgraph batching and curriculum schedules for larger graphs.