Speeding up minimize_nested_blockmodel_dl on large networks?
Hi, I am having some trouble reproducing the performance of minimize_nested_blockmodel_dl on large networks as in the publication "Hierarchical block structures and high-resolution model selection in large networks"? Namely, for large graphs, even with verbose=True, no output is generated, but the CPU usage stays at 100% for hours to days. My network contains 1.5 million nodes and a sorted adjacency list per node such that I can choose the top K edges per node as a parameter. I have tried sampling down to 200,000 nodes and K=5, but minimize_nested_blockmodel_dl does not seem to be proceeding with computation. CPU is being used at 100% while running , and there is plenty of free memory. Are there options that I can use in minimize_nested_blockmodel_dl to improve performance? Other than sampling and limiting K, are there other strategies that I can try? I compiled using the parallel computing option and am using AWS EC2 instances. The largest network that I have been able to get a result for within 24 compute hours on C3 sized EC2 instances has been 100,000 nodes, K=5. (undirected, average degree about 8) -- View this message in context: http://main-discussion-list-for-the-graph-tool-project.982480.n3.nabble.com/... Sent from the Main discussion list for the graph-tool project mailing list archive at Nabble.com.
On 12.08.2014 08:07, ayates wrote:
Hi, I am having some trouble reproducing the performance of minimize_nested_blockmodel_dl on large networks as in the publication "Hierarchical block structures and high-resolution model selection in large networks"? Namely, for large graphs, even with verbose=True, no output is generated, but the CPU usage stays at 100% for hours to days. My network contains 1.5 million nodes and a sorted adjacency list per node such that I can choose the top K edges per node as a parameter. I have tried sampling down to 200,000 nodes and K=5, but minimize_nested_blockmodel_dl does not seem to be proceeding with computation. CPU is being used at 100% while running , and there is plenty of free memory.
Are there options that I can use in minimize_nested_blockmodel_dl to improve performance? Other than sampling and limiting K, are there other strategies that I can try? I compiled using the parallel computing option and am using AWS EC2 instances. The largest network that I have been able to get a result for within 24 compute hours on C3 sized EC2 instances has been 100,000 nodes, K=5. (undirected, average degree about 8)
There are several options in the algorithm which affect performance. The most important ones are 'epsilon', 'nsweeps', 'nmerge_sweeps' and 'r' (see the documentation for their meanings). The default values should be OK for large networks, except for 'epsilon', which has a default value of '0'. You should try a value of 1e-3, or even 1e-2, depending on the size of your network. I recommend you experiment with the algorithm with small networks first (~10000 nodes, something you can easily try on a laptop) to get a feeling on how the parameters affect the performance and quality of the results. I would also recommend trying minimize_blockmodel_dl(), since it is more verbose than minimize_nested_blockmodel_dl(), and will give you a better idea of how the algorithm is progressing. If you use 'verbose=True' you should be able to see how fast each sweep of the network is performed. I have no experience with AWS EC2, but I was able to obtain results for very large networks with a regular dedicated computer cluster, as I describe in the paper. However, for the networks with many millions of edges, it does take a while, maybe even a day or two. Best, Tiago -- Tiago de Paula Peixoto <tiago@skewed.de>
This worked! Thanks. -- View this message in context: http://main-discussion-list-for-the-graph-tool-project.982480.n3.nabble.com/... Sent from the Main discussion list for the graph-tool project mailing list archive at Nabble.com.
participants (2)
-
ayates -
Tiago de Paula Peixoto