On 09.05.2016 16:33, Andrea Briega wrote:
Minimize_blockmodel algorithm takes an hour to finish using my network with 21000 nodes (like the hierarchical version), and it spends two days and a half with overlap. However, I have run an hierarchical analysis with overlap, and it is still running since 14 days ago. So my first question is: is this time normal, or maybe there is any problem? Do you know how long could it ussually takes?
The overlapping model is indeed much more time consuming than the non-overlapping one. This is because in the overlapping case we need to partition the _half-edges_ instead of the nodes, as we do in the non-overlapping case. The number of half-edges is twice the number of edges, which is typically one or two orders of magnitude larger than the number of nodes. Because of this, and the fact that group mixtures are more expensive to represent algorithmically, the overlapping model tends to be slower. So, the time difference you are observing is expected. Although, it is difficult to say how long it will in fact take, since it depends on the number of edges, not nodes. In my experience, I find that there are comparatively few networks that are better modelled by the overlapping model (see http://dx.doi.org/10.1103/PhysRevX.5.011033). Usually, the non-overlapping model provides a shorter description length, indicating that it has a better explanatory power. Because of this, I generally tend to work with the non-overlapping model, unless there is a very specific reason not to.
Secondly, I have repeated some of these analysis with exactly same options but I get different solutions (similar but different), so I wonder if the algorithm is heuristic (I thought it was exact).
No, it is not exact. This type of inference problem is NP-Hard, hence exact algorithms are not feasible. In the default parametrization, the algorithm implemented is an agglomerative heuristic. The algorithm is probabilistic, so it will indeed return a slightly different fit every time. Thus the typical practice is to perform several fits and choose the one with the smallest description length. If more precision is desired, the algorithm can also be configured via the appropriate parameters to either run a unity-temperature MCMC, Gibbs sampling or simulated annealing. These, of corse, are more expensive numerically.
My last question question regards bipartite analysis. I have two types of nodes in my network and I wonder if there are any analytical difference when running these algorithms with the bipartite option (clabel=True, and different labels in each group of nodes) or not, because it seems that the program “knows” my network is bipartite in any case. If there are differences between bipartite and “unipartite” analysis (clabel=False), is it possible to compare description length between them to model selection?
The clabel parameter should not be Boolean; it needs to be a property map that marks the partition constraints for each node. Namely, vertices with different label values will not be clustered in the same group. This does not affect the computation of the description length; it simply guides the algorithm so that violating moves are not attempted. Indeed, as you noticed, the algorithm will always learn that you have a bipartite network if that is the case, hence the 'clabel' parameter does not need to be used for that purpose. The clabel parameter is useful only in situations where there are other types of forbidden partitions that the algorithm would run into otherwise. Note that if the agglomerative heuristic is turned off, and MCMC is used instead, the algorithm may in fact put nodes belonging to different partitions of a bipartite network in the same group. The probability of this happening will depend on how "strongly" bipartite the network is. In this case, the clabel parameter is useful to forbid those non-bipartite divisions, if they are undesired. Note that there is also a 'pclabel' parameter. This is almost identical to 'clabel', with the only difference that it _is_ in fact used to compute the description length. Namely, it assumes that this division is known a priori, and hence the description length for the partition will include only subdivisions thereof. This is indeed useful, for example, when you know from context that your network is bipartite, and this is not a fact that you wish to learn. This generalizes also to other types of known divisions. I recommend using this, if this is the case. Best, Tiago -- Tiago de Paula Peixoto <tiago@skewed.de>