Re: [graph-tool] Latent Poisson questions

16 Apr 2020

      Am 16.04.20 um 02:54 schrieb Deklan Webster:
...
I, totally by accident, think I discovered the issue! In the process of
setting up the testing between the reconstructive method, get_edges_prob
method, and the simple stacked topological similarity classifier I
accidentally set the graph to undirected without setting it back to
directed. The results appeared to be reasonable finally. I've gone back
and tested on the graph I've been testing on, and indeed:
Undirected: ~13/1000 correctly identified
Directed: 488/1000 correctly identified
A giant improvement! Of course, I'm relieved since this solves my
problem (I can just, as least, set to undirected to compute this).
Let's pause and appreciate the learning opportunity.

After considerable amount of unsubstantiated speculation about bugs in
the code, and even the underlying soundness of the algorithm, the
problem turned out to be a basic misuse. One which was impossible to
guess from the information given.

It's also a learning on my part, in fact about something that I thought
I already knew: It's a pointless exercise to try to troubleshoot
something without a _minimal_ and _complete_ example at hand.
...
But,
now I'm wondering what's going on. Is there some predictable reason why
the directed case is performing so much worse? Something to do with
directed graphs and equlibration? But if it were to do with equlibration
why did this issue never affect `get_edges_prob`'s performance? Could
there be a bug in how graph-tool is handling the directed case? In the
paper it seems you've worked with the directed case before, so I assume
you would have noticed if there were.
I'm done with the wild guesses, this is not productive.
...
To be sure, I tested this on the "polblogs" graph as well. Undirected:
reconstructive performance is great, directed: terrible.
The polblogs network I actually have investigated a bit in the past.
Despite being directed, most edges of that network are actually
reciprocal, so in fact it is mostly an undirected network. The directed
SBM does not generate reciprocal edges very well, hence it is not a good
predictor in this case. The undirected model is a better fit.
...
I've read the documentation. The docstring says "Inverse temperature".
The relevant section of the docs say,
...
An alternative is to use a greedy algorithm based on a merge-split
MCMC with zero temperature [peixoto-merge-split-2020]
<https://graph-tool.skewed.de/static/doc/demos/inference/inference.html#peixoto-merge-split-2020>,
which requires a single global state, and thus can reduce memory usage.
Just from this I don't understand. Not setting beta to infinity has
already reduced memory usage for me. I don't understand the significance
of this being greedy or why it doesn't apply to my situation.If the
explanation of all this is in one of the papers please point to it and
I'll read it.
The documentation assumes a basic understanding of how MCMC works,
without which a proper use of the the methods described is not possible,
unfortunately.

The MCMC algorithm (for reconstruction) attempts to sample from a
distribution \propto P(A,b)^beta where A is the network, b is a
partition and beta is the inverse temperate parameter. Setting beta=1
means sampling from P(A,b), whereas beta=infinity means finding a single
(A,b) that maximizes the distribution. The choice beta=infinity also
breaks ergodicity and detailed balance, making it a greedy heuristic,
rather than a proper MCMC.

To compute marginal probabilities you need beta=1, i.e. sample from the
joint distribution, not maximize it.

The point about reducing memory applied only when comparing to
minimize_blockmodel_dl().
...
...
This kind of comparison has been done already in
https://arxiv.org/abs/1802.10582 and https://arxiv.org/abs/1909.07578.
The SBM approach is the single best classifier among the over a hundred
they consider, which is marginally beat only by a stacking of around 40
other predictors.
I've read both these papers and referred to the latter twice in this
thread. We've already discussed that these don't use either
`get_edges_prob` nor the full reconstructive marginal edge probability
On page 8 of the former, they're using this as their similarity for SBM:
"s_ij = θ_i θ_j * l_gi,gj"
That is in fact very similar to the conditional probability computed by
get_edges_prob(). The latter is more accurate, yields an actual
normalized probability, and includes information from the priors, but
for sufficiently sparse and large graphs, both computations should coincide.

Best,
Tiago

-- 
Tiago de Paula Peixoto <tiago@skewed.de>