Sequences
Amino-acid strings.
Interactive thesis summary
A web-native summary of my bachelor’s thesis: predicting missing protein-protein interactions by combining protein language-model embeddings with graph neural networks.
The idea is simple: proteins become nodes, known interactions become edges, ProtT5 provides sequence-based node features, and the GNN learns graph-aware protein embeddings that can be scored pairwise.
Amino-acid strings.
1024-dimensional node features.
Proteins as nodes, PPIs as edges.
Message passing over the network.
Pairwise interaction score.
The starting point is the amino-acid sequence of each protein. This gives the model biological information before seeing the graph.
I used the STRING v12 human protein-protein interaction network and kept only very high-confidence edges. Filtering at confidence ≥ 950 reduces noise and makes the graph more reliable for training, at the cost of removing many lower-confidence interactions.
TAGConv aggregates multi-hop neighborhood information. Use the buttons to see how a center protein can receive information from 1-hop, 2-hop, and 3-hop neighborhoods.
Once the encoder produces graph-aware embeddings, the decoder scores pairs locally. This means the graph can be encoded once, then many candidate pairs can be ranked.
Each neighbor sends a message, but attention decides how much each message should matter before the center protein updates its embedding.
After message passing, the graph is encoded once. The decoder only receives two node embeddings, builds pair features, and maps them to an interaction probability.
The model achieved strong global separability and very high early precision. The early precision result is especially important because in a biological workflow the goal is often to produce a short list of strong candidates for further validation.
The model was strongest in dense regions of the graph. This makes sense: GNNs rely on neighborhood information, so hub proteins and clustered regions provide more signal.
The model is not only learning sequence compatibility. It is also learning how interaction likelihood behaves inside the observed graph topology. This is useful, but it introduces a blind spot for sparse regions.
The NOTCH2 analysis compared a denser lower-confidence graph with the high-confidence graph used for training. The main point is that graph context changes model confidence dramatically.
In the denser graph, recovered edges showed a sharp high-confidence region near p = 0.95. The model became bolder because NOTCH2 had richer local topology.