Thanks for the quick reply
Note that link prediction itself is a valid criterion for model selection, so you might just want to focus on that.
Not sure how to interpret this. Are you saying that even if the non-Poisson model has lower entropy, the Poisson model may still be 'better' for community detection if it's better at link prediction (which you suggest is most likely the case)?
I'm not sure what you mean. The site seems fine. Search on this page https://git.skewed.de/count0/graph-tool has always given me a 500 error.
You should use logsumexp(): Wouldn't that be non-incremental? Wouldn't the incremental version of this be summing the exp(a) of the results at each step, and then taking log at the end?
The problem with this is that it will not resolve very small probabilities, since the edge would never be seen in such a case. But the whole approach would be much faster.
Yes, I ran into this problem. I'll paste the code to make sure I'm not doing anything wrong, ``` n = G.new_ep('int', 1) x = G.new_ep('int', 1) state = gt.MeasuredBlockState(G, n=n, x=x, n_default=1, x_default=0, nested=True, self_loops=True, state_args={'deg_corr': True}) gt.mcmc_equilibrate(state, wait=1000, epsilon=1e-5, mcmc_args=dict(niter=10), multiflip=True, verbose=True ) u = None def collect_marginals(s): global u u = s.collect_marginal(u) gt.mcmc_equilibrate(state, force_niter=1000, mcmc_args=dict(niter=10), multiflip=True, verbose=True, callback=collect_marginals ) eprob = u.ep.eprob non_edge_found_c = 0 edge_not_found_c = 0 for x in G.vertices(): for y in G.vertices(): edge = G.edge(x, y) u_edge = u.edge(x, y) if not edge: if u_edge: non_edge_found_c += 1 print("Non-edge in original, found in marginal graph.") print(u_edge, eprob[u_edge]) print() else: if not u_edge: edge_not_found_c += 1 print("Edge in original graph, but not an edge in marginal graph.") print(edge) print() print(non_edge_found_c, edge_not_found_c) ``` I'm assuming that `u = s.collect_marginal(u)` is taking care of counting the non-edge appearances over every sample (and not just the last one). I assume if u.edge(x, y) is None that means the non-edge was never observed, correct? Well, I ran it just now on a smaller graph to test. The graph is directed with 475 nodes and 20,911 edges. Intuitively I would expect a reasonable number of non-edges to be observed given that there are 21k edges... maybe 500~ at least. Running that code above I see that `non_edge_found_c` is 13. Only 13 non-edges are observed with non-zero probability? The largest of those 13 probabilities is 0.07. And, edge_not_found_c is 2. How should I interpret this situation where the edge is in the original but not in (any of?) the marginals? Am I doing something wrong here? Do I need to adjust n, x, n_default, x_default? Thanks for your help, as always On Tue, Apr 7, 2020 at 5:15 PM Tiago de Paula Peixoto <tiago@skewed.de> wrote:
Am 07.04.20 um 22:03 schrieb Deklan Webster:
To make sure I understand about the model selection, I ran each of the four possibilities thrice for a fixed 1000 iterations
``` Directed: True Num vertices 3672 Num edges 312779
Entropy for degr_corr=True, poisson=True: 1044304.4179465043 Entropy for degr_corr=True, poisson=True: 1044708.7346586153 Entropy for degr_corr=True, poisson=True: 1044863.5150718079
Entropy for degr_corr=False, poisson=True: 1094026.9370843305 Entropy for degr_corr=False, poisson=True: 1093860.5158280218 Entropy for degr_corr=False, poisson=True: 1095110.0929428462
Entropy for degr_corr=True, poisson=False: 943149.5746761584 Entropy for degr_corr=True, poisson=False: 943741.5558787254 Entropy for degr_corr=True, poisson=False: 943772.2769395269
Entropy for degr_corr=False, poisson=False: 1000768.068855249 Entropy for degr_corr=False, poisson=False: 998721.4409976124 Entropy for degr_corr=False, poisson=False: 999301.5197368631 ```
So, is this the following valid?: degree correction improves the result in both cases. But, the Poisson multigraph doesn't make an improvement. So, for community detection I should just stick with a regular degree-corrected NestedBlockState.
The Poisson multigraph will rarely improve things when doing minimization, i.e. finding the most likely partition. This is is because the mode of the distribution tends to be on simple graphs. However, it will improve things when averaging over partitions. But in that case you need to compute the evidence, not the description length, to compare models, but that is more difficult to do (a simple approximation is shown in the documentation).
Note that link prediction itself is a valid criterion for model selection, so you might just want to focus on that.
Does this have any implications for what is optimal for link prediction?
Averaging over partitions is always better for link prediction, and you need a model that works better when averaged. I would be very surprised the Poisson model does not work better for link prediction in general.
Also, I gave `MeasuredBlockState.get_edges_prob` a shot. To try to get the run-time down I only considered vertex pairs at a distance of 1 or 2 (in the directed sense). That gives about 6m pairs. On my reasonably fast laptop each sampling iteration took 4 minutes. I just did 100 iterations total over about ~7 hours. Is it possible to speed this up?
An approach which is linear on the number of nodes, rather than quadratic, is to collect the edge marginals as it is explained in the documentation, rather than the actual conditional probabilities. In other words, you just count how often an edge is seen or not in the posterior distribution.
The problem with this is that it will not resolve very small probabilities, since the edge would never be seen in such a case. But the whole approach would be much faster.
I tried to search for the code on Gitlab to check how it's implemented but I'm getting a 500 error on every search (it's been like this for me forever).
I'm not sure what you mean. The site seems fine.
Am I likely to get substantially improved results with more than 100 iterations?
I have no idea.
To try to save memory I just added the log probs together at each step and took their arithmetic mean at the end (as you can see). Is there a more meaningful mean to use here that can be computed incrementally?
You should use logsumexp():
https://docs.scipy.org/doc/scipy-0.19.0/reference/generated/scipy.misc.logsu...
Anyway, using only 100 sampling iterations and using the arithmetic mean of the log probs, this feature ended up being the most important feature (according to SHAP) by a pretty large margin. Seems to confirm what you were speculating on Twitter. Here are two images illustrating that:
On every test instance: https://i.imgur.com/Cx9tLJu.png
On top 100 most likely test instances: https://i.imgur.com/bfopyQj.png
For reference, the precision@100 here is 96%.
So, pretty good even though it's by far the most time-costly feature to compute. Superposed random walks and Katz index take < 1 minute each, for instance.
Well, there is no free lunch.
Best, Tiago
-- Tiago de Paula Peixoto <tiago@skewed.de>
_______________________________________________ graph-tool mailing list graph-tool@skewed.de https://lists.skewed.de/mailman/listinfo/graph-tool