Predicting Who Starts the Next Buying Spree!
📈

Predicting Who Starts the Next Buying Spree!

Tags
Computer Science
ML
Graph Theory
Finance
Published
November 29, 2025
Author
Baris Ozakar
Check out the full thesis here:

Predicting Who Starts the Next Buying Spree: Influence in Socio-Financial Networks

If a handful of traders on a social investing app start buying the same stock, how far does that wave travel through the network? And more importantly: can we predict who is likely to trigger the next wave?
That is the core idea behind my thesis: “Time-Aware Centralities and Embeddings of Nodes for Influence Prediction in Evolving Socio-Financial Networks.”
In plain terms: I built a graph-based, time-aware ML pipeline to forecast how influential individual traders will be in triggering cascades of trades in a real social trading platform.

The setting: a real socio-financial network

The data comes from Invstr’s “Fantasy Finance” app, a gamified social network for trading and financial literacy with over 1.5M downloads globally. I work on an anonymized subset focused on trade posts: when a user executes a trade and shares it with followers, that creates a time-stamped interaction.
I model each trade post as a directed temporal edge:
(TradedBy, ReceivedBy, ReceivedOn)
Over a 3 month window at the peak of the 2021 bull market (Oct - Dec 2021), that yields:
  • 39 million shared trade posts
  • 72k active traders
  • 200k post-receiving users
I slice this evolving network into 6-hour snapshots and study how information and behavior propagate through it over time.

Four pillars of the approach

The thesis is built around four conceptual pillars that eventually feed into one prediction task: node-level influence forecasting.
  1. Centrality indices
    1. Classic and temporal network measures for “how important” a node is:
      • Static: in-degree, out-degree, eigenvector, PageRank, closeness, betweenness
      • Temporal: causal (time-respecting) closeness and betweenness, computed on time-respecting paths rather than static shortest paths
  1. A causal measure of influence
    1. I define causal cascade size: starting from a trader’s post, I follow all time-respecting causal paths through the network (under a time delta constraint), build causal trees, and count how many unique users are reached downstream. That count is the trader’s influence score in that time window.
  1. Node embeddings
    1. I learn representations of traders in a latent space:
      • Static: node2vec-style embeddings on aggregated graphs
      • Time-aware: embeddings produced by spatio-temporal GNNs that see both graph structure and its evolution over time
  1. Spatio-temporal graph neural networks
    1. I treat influence prediction as a spatio-temporal node regression problem:
      • Inputs per snapshot: centrality features + node embeddings
      • Target: the node’s future causal cascade size in an extended window (current + next snapshot), capturing current influence, continued relevance, and continued influence.
Concretely, I compare several architectures:
  • Baseline: feed-forward network (FFNN) with centralities only
  • Static embedding model: FFNN + node2vec
  • Spatio-temporal GNNs:
    • GCRN-GRU, GCRN-LSTM
    • T-GCN
    • EvolveGCN-H and EvolveGCN-O

First: what does trading actually look like?

Before predicting influence, I answer two descriptive questions about the network.

RQ1 - What traits does trading activity exhibit?

Some key patterns in the 3-month window:
  • Weekly cycles
    • Trading peaks on Mondays and declines toward the weekend
    • Weekends have consistently low activity
    • notion image
  • Intraday cycles
    • Peak activity around 14:00–15:00 UTC, aligning with the US market open and execution of queued orders
    • Very low activity between 21:00–12:00 UTC, aligning with market close and typical sleep hours
    • notion image
  • Concentration of assets
    • Out of 4,165 tradable instruments, a small fraction drives most trading volume
    • Asset Symbol
      # of Trades
      Percentage of trades
      [All trades]
      1629882
      100%
      [Top 10]
      690477
      42.4%
      BTCUSD
      202083
      12.4%
      ETHUSD
      118164
      7.3%
      Silver
      75936
      4.7%
      Gold
      67671
      4.2%
      TSLA
      67256
      4.1%
      AAPL
      64599
      4.0%
      SNAP
      46022
      2.8%
      AMZN
      45638
      2.8%
  • Macro link
    • Trading activity is positively correlated with BTC absolute daily returns, BTC trading volume, and BTC volume momentum
📈
So the network is not just noisy social chatter. It is tightly coupled to both market microstructure (timing) and crypto market conditions.

RQ2 - How do individual users behave?

Looking at the time between subsequent trades, per user and globally:
  • Inter-trade durations are well modeled by a Weibull distribution with shape k < 1, which implies a decreasing trading rate over time.
notion image
  • This is driven by the app’s mechanics: users get a one-time virtual 1M USD allocation with no recurring top-ups, so many become “holders” rather than constant churners.
  • The same decreasing-trade-rate pattern appears when aggregating across all users, so individual behavior scales up to the global level.
For cascades themselves:
  • The majority of trade posts generate no cascade at all.
notion image
  • The distribution of positive cascade sizes follows an approximate power law:
    • Most cascades are tiny
    • A small minority grow large, reaching up to size 24 and depth 36 generations in the causal tree.
    • notion image
🔥
In other words: most posts are duds, but a few behave like viral memes.

Do classic centralities actually predict influence?

This leads to the next pair of questions:

RQ3: Can centrality measures be used as node features to predict influence?

RQ4: Which centralities are effective?

To probe this, I treat each trader’s centrality time series and causal cascade size time series as signals and run Granger causality tests.
Key result:
  • All explored static and temporal centralities (in-degree, out-degree, eigenvector, PageRank, closeness, betweenness, temporal closeness, temporal betweenness) Granger-cause future cascade sizes at the 5 percent threshold level.
Centrality
Lag 1
Lag 2
Lag 3
Static
Indegree
46%
47%
50.4%
Static
Outdegree
100%
100%
99.9%
Static
PageRank
91.2%
92.1%
90.8%
Static
Closeness
39.4%
38.2%
43.4%
Static
Betweenness
86.7%
80.7%
78.6%
Static
Eigenvector
27.6%
28.6%
29.3%
Temporal
Closeness
45.7%
46.4%
50.1%
Temporal
Betweenness
76.2%
71.4%
69.4%
  • When I temporally shuffle the centrality series (destroying time structure), this predictive signal disappears and p-values jump back to around 5 percent.
💡
So classic centrality is not just a descriptive toy here. It carries real predictive information about who will spark larger cascades.

Adding time: temporal centralities and time-aware embeddings

The last two research questions ask whether explicit time-awareness buys us anything beyond static graphs.

RQ5 - Do temporal centralities help beyond static ones?

I compare models trained with:
  • Only static centralities, vs.
  • Static + temporal centralities
Result: adding temporal centralities improves at least one metric for 5 out of 7 models.
Notable MAE improvements when adding temporal centralities:
  • FFNN: 27.9%
  • node2vec + FFNN: 20.0%
  • GCRN-GRU: 7.33%
So knowing when a node sits on time-respecting paths (not just where it sits in the static graph) helps the model.

RQ6 - Do time-aware node embeddings improve influence prediction?

Here I move from centralities into representation learning and ask whether spatio-temporal embeddings beat static ones.
Benchmarks:
  • No embeddings: FFNN + centralities only
  • Static embeddings: node2vec + FFNN
  • Time-aware embeddings: spatio-temporal GNNs (GCRN variants, T-GCN, EvolveGCN-H/O)
Two layers of comparison:
  1. All embedding-based models vs FFNN (no embeddings)
      • Every node2vec and spatio-temporal GNN model significantly outperforms the FFNN in MAE, with improvements like:
        • node2vec: ~35–28%
        • T-GCN: ~68–36%
        • EvolveGCN-H: ~61–39%
        • EvolveGCN-O: ~61–34%
  1. Spatio-temporal GNNs vs node2vec (static)
      • Almost all spatio-temporal models significantly improve MAE over node2vec:
        • GCRN-LSTM: up to 56% better
        • T-GCN: ~50–10% better
        • EvolveGCN-H/O: ~40–15% better in many setups
Overall conclusion: time-aware node embeddings that learn both graph structure and its evolution in time materially improve influence prediction over static embeddings.

Why this matters

From a research perspective, this work shows that:
📈
  • Peer effects in financial markets can be quantified through causal cascades on real social trading data.
  • Classic network tools still matter, but their temporal variants and learned embeddings amplify predictive power.
🚗
  • Spatio-temporal graph neural networks are not just for traffic prediction and sensor networks; they are effective for socio-financial influence modeling too.
From a practical perspective, this type of modeling could support:
🫧
  • Early detection of herding and bubble-like cascades
🧠
  • Better understanding of who to monitor as potential “super-spreaders” of trading sentiment
  • Risk management and platform design that actively account for network-driven contagion, not just fundamentals

Limitations and next steps

There are some important caveats:
  • The analysis uses only 3 months out of 3.5 years of data, chosen for computational tractability and because they coincide with a bull market. The dynamics may differ in bear or sideways regimes.
  • Time is rescaled from millisecond to minute precision, which may miss ultra-fast reactions, especially from bots.
A natural next step is a contagion study: explicitly seeding influential users and simulating how far and how fast influence spreads under different market regimes, and contrasting simple vs complex contagion in financial peer influence.

If you work at the intersection of ML, graph theory, and finance, the takeaway is simple:
📈
Do not just ask what people are trading. Ask who is connected to whom, when they act, and how that behavior propagates through the network.
Because the next buying spree might already be visible as a pattern of influence in your socio-financial graph.