Re: [graph-tool] Questions about inference algorithms

9 May 2016

      On 09.05.2016 16:33, Andrea Briega wrote:
...
Minimize_blockmodel algorithm takes an hour to finish using my network
with 21000 nodes (like the hierarchical version), and it spends two
days and a half with overlap. However, I have run an hierarchical
analysis with overlap, and it is still running since 14 days ago. So
my first question is: is this time normal, or maybe there is any
problem? Do you know how long could it ussually takes?
The overlapping model is indeed much more time consuming than the
non-overlapping one. This is because in the overlapping case we need to
partition the _half-edges_ instead of the nodes, as we do in the
non-overlapping case. The number of half-edges is twice the number of
edges, which is typically one or two orders of magnitude larger than the
number of nodes. Because of this, and the fact that group mixtures are
more expensive to represent algorithmically, the overlapping model tends
to be slower.

So, the time difference you are observing is expected. Although, it is
difficult to say how long it will in fact take, since it depends on the
number of edges, not nodes.

In my experience, I find that there are comparatively few networks that
are better modelled by the overlapping model (see
http://dx.doi.org/10.1103/PhysRevX.5.011033). Usually, the
non-overlapping model provides a shorter description length, indicating
that it has a better explanatory power. Because of this, I generally
tend to work with the non-overlapping model, unless there is a very
specific reason not to.
...
Secondly, I have repeated some of these analysis with exactly same
options but I get different solutions (similar but different), so I
wonder if the algorithm is heuristic (I thought it was exact).
No, it is not exact. This type of inference problem is NP-Hard, hence
exact algorithms are not feasible. In the default parametrization, the
algorithm implemented is an agglomerative heuristic. The algorithm is
probabilistic, so it will indeed return a slightly different fit every
time. Thus the typical practice is to perform several fits and choose
the one with the smallest description length.

If more precision is desired, the algorithm can also be configured via
the appropriate parameters to either run a unity-temperature MCMC, Gibbs
sampling or simulated annealing. These, of corse, are more expensive
numerically.
...
My last question question regards bipartite analysis. I have two types
of nodes in my network and I wonder if there are any analytical
difference when running these algorithms with the bipartite option
(clabel=True, and different labels in each group of nodes) or not,
because it seems that the program “knows” my network is bipartite in
any case. If there are differences between bipartite and “unipartite”
analysis (clabel=False), is it possible to compare description length
between them to model selection?
The clabel parameter should not be Boolean; it needs to be a property
map that marks the partition constraints for each node. Namely, vertices
with different label values will not be clustered in the same
group. This does not affect the computation of the description length;
it simply guides the algorithm so that violating moves are not
attempted.

Indeed, as you noticed, the algorithm will always learn that you have a
bipartite network if that is the case, hence the 'clabel' parameter does
not need to be used for that purpose. The clabel parameter is useful
only in situations where there are other types of forbidden partitions
that the algorithm would run into otherwise.

Note that if the agglomerative heuristic is turned off, and MCMC is
used instead, the algorithm may in fact put nodes belonging to different
partitions of a bipartite network in the same group. The probability of
this happening will depend on how "strongly" bipartite the network
is. In this case, the clabel parameter is useful to forbid those
non-bipartite divisions, if they are undesired.

Note that there is also a 'pclabel' parameter. This is almost identical
to 'clabel', with the only difference that it _is_ in fact used to
compute the description length. Namely, it assumes that this division is
known a priori, and hence the description length for the partition will
include only subdivisions thereof. This is indeed useful, for example,
when you know from context that your network is bipartite, and this is
not a fact that you wish to learn. This generalizes also to other types
of known divisions. I recommend using this, if this is the case.

Best,
Tiago

-- 
Tiago de Paula Peixoto <tiago@skewed.de>