Am 16.04.20 um 02:54 schrieb Deklan Webster:
I, totally by accident, think I discovered the issue! In the process of setting up the testing between the reconstructive method, get_edges_prob method, and the simple stacked topological similarity classifier I accidentally set the graph to undirected without setting it back to directed. The results appeared to be reasonable finally. I've gone back and tested on the graph I've been testing on, and indeed:
Undirected: ~13/1000 correctly identified Directed: 488/1000 correctly identified
A giant improvement! Of course, I'm relieved since this solves my problem (I can just, as least, set to undirected to compute this).
Let's pause and appreciate the learning opportunity. After considerable amount of unsubstantiated speculation about bugs in the code, and even the underlying soundness of the algorithm, the problem turned out to be a basic misuse. One which was impossible to guess from the information given. It's also a learning on my part, in fact about something that I thought I already knew: It's a pointless exercise to try to troubleshoot something without a _minimal_ and _complete_ example at hand.
But, now I'm wondering what's going on. Is there some predictable reason why the directed case is performing so much worse? Something to do with directed graphs and equlibration? But if it were to do with equlibration why did this issue never affect `get_edges_prob`'s performance? Could there be a bug in how graph-tool is handling the directed case? In the paper it seems you've worked with the directed case before, so I assume you would have noticed if there were.
I'm done with the wild guesses, this is not productive.
To be sure, I tested this on the "polblogs" graph as well. Undirected: reconstructive performance is great, directed: terrible.
The polblogs network I actually have investigated a bit in the past. Despite being directed, most edges of that network are actually reciprocal, so in fact it is mostly an undirected network. The directed SBM does not generate reciprocal edges very well, hence it is not a good predictor in this case. The undirected model is a better fit.
I've read the documentation. The docstring says "Inverse temperature". The relevant section of the docs say,
An alternative is to use a greedy algorithm based on a merge-split MCMC with zero temperature [peixoto-merge-split-2020] <https://graph-tool.skewed.de/static/doc/demos/inference/inference.html#peixoto-merge-split-2020>, which requires a single global state, and thus can reduce memory usage.
Just from this I don't understand. Not setting beta to infinity has already reduced memory usage for me. I don't understand the significance of this being greedy or why it doesn't apply to my situation.If the explanation of all this is in one of the papers please point to it and I'll read it.
The documentation assumes a basic understanding of how MCMC works, without which a proper use of the the methods described is not possible, unfortunately. The MCMC algorithm (for reconstruction) attempts to sample from a distribution \propto P(A,b)^beta where A is the network, b is a partition and beta is the inverse temperate parameter. Setting beta=1 means sampling from P(A,b), whereas beta=infinity means finding a single (A,b) that maximizes the distribution. The choice beta=infinity also breaks ergodicity and detailed balance, making it a greedy heuristic, rather than a proper MCMC. To compute marginal probabilities you need beta=1, i.e. sample from the joint distribution, not maximize it. The point about reducing memory applied only when comparing to minimize_blockmodel_dl().
This kind of comparison has been done already in https://arxiv.org/abs/1802.10582 and https://arxiv.org/abs/1909.07578. The SBM approach is the single best classifier among the over a hundred they consider, which is marginally beat only by a stacking of around 40 other predictors.
I've read both these papers and referred to the latter twice in this thread. We've already discussed that these don't use either `get_edges_prob` nor the full reconstructive marginal edge probability
On page 8 of the former, they're using this as their similarity for SBM: "s_ij = θ_i θ_j * l_gi,gj"
That is in fact very similar to the conditional probability computed by get_edges_prob(). The latter is more accurate, yields an actual normalized probability, and includes information from the priors, but for sufficiently sparse and large graphs, both computations should coincide. Best, Tiago -- Tiago de Paula Peixoto <tiago@skewed.de>