minimize_nested_block_model_dl speed and memory tweaks
G'day! I'm struggling with a network with millions of nodes and billions of edges (follow network of Australian Twitter users), trying to apply gt.inference.minimize_nested_block_model_dl(). I'm reducing the size by analysing k-cores and increasing k step-wise to test the limits. With 19k nodes and 12m edges I'm already exceeding the 64GB RAM my machine has at the moment, so given its O(n ln^2(n)) I don't expect the whole network to fit, but I'd like to push the limit at least as far as possible. The only parameter set so far is `mcmc_equilibrate_args={'epsilon': 1e-3}`. I have four questions: 1. I've tried mcmc_args={'sequential': False, 'parallel': True}, but this did not seem to have a great impact regarding speed on my smaller k-cores, could it have on bigger ones? 2. What does the warning here (https://graph-tool.skewed.de/static/doc/inference.html#graph_tool.inference....) regarding parallel sampling mean in worst case and when is it likely that the warning actually applies? 3. I am considering either to try to fund a machine with more (expensive) RAM or to just swap memory on freely available disk space. What would you expact would be the impact on speed? Checking the behaviour of the algorithm so far I don't see it being I/O bound but rather by available memory and the sequential parts of the algo that are running on one core only. Is swapping a considerable option or should I just straight away rent a bigger machine? 4. Is there any rule of thumb for epsilon? And what is its impact if it's too large? Thanks for any help (maybe also hints how to reduce the network size while minimising the impact on the results of the nested SBM with something more sophisticated than k-cores). Cheers, Felix -- Sent from: http://main-discussion-list-for-the-graph-tool-project.982480.n3.nabble.com/
On 21.09.2017 07:24, flxvctr wrote:
1. I've tried mcmc_args={'sequential': False, 'parallel': True}, but this did not seem to have a great impact regarding speed on my smaller k-cores, could it have on bigger ones?
It should give you a 2-3x speedup, but not much more, as part of the algorithm (the actual node updates) has to be serial.
2. What does the warning here (https://graph-tool.skewed.de/static/doc/inference.html#graph_tool.inference....) regarding parallel sampling mean in worst case and when is it likely that the warning actually applies?
I don't know, this is why the warning is there. I generally advise against using the parallel code, since it does not make things that much faster to begin with. I'm even considering removing it in a future version.
3. I am considering either to try to fund a machine with more (expensive) RAM or to just swap memory on freely available disk space. What would you expact would be the impact on speed? Checking the behaviour of the algorithm so far I don't see it being I/O bound but rather by available memory and the sequential parts of the algo that are running on one core only. Is swapping a considerable option or should I just straight away rent a bigger machine?
The memory access during the algorithm is random, so using swap will probably be prohibitively slow.
4. Is there any rule of thumb for epsilon? And what is its impact if it's too large? If it is too large, your estimate of the ground state (the best hierarchical partition) will be worse. Simply, it gives you a quality/speed trade-off.
Best, Tiago -- Tiago de Paula Peixoto <tiago@skewed.de>
participants (2)
-
flxvctr -
Tiago de Paula Peixoto