Am 19.06.2018 um 17:29 schrieb P-M:
I am trying to outsource some of my calculations to the Univeristy's cluster as the compute times for some of my datasets are getting very lengthy. I had a couple of questions arising out of this and was wondering whether you had any thoughts/advice on them:
- The University normally limits runtimes of a given job to 12h and
suggests checkpointing to get around this so that the calculation can simply resume as a new job. Apparently the cluster has "DMTCP: Distributed MultiThreaded CheckPointing" installed in order to allow checkpointing "for some applications which do not have their own support for this". Is this compatible with graph-tool or are you aware of any other ways in which I could stop and restart a job? (What I am after is stopping and restarting a graph-tool function which takes more than 12h to complete as I can obviously pickle results of a calculation straightforwardly already.) If there is no way of currently doing this is this something you might consider doing at some point in the future?
DMTCP is the right solution to this problem, and it should work. I use it myself without any issues.
- As far as I understand using OpenMP I can only ever use a single node at
a time and am thus limited by how many CPUs and how much RAM this node supplies. To be able to use more CPUs I would need to use e.g. MPI. Are there any plans to implement this at some point? (I realise this may not be straightforward but was interested in your thoughts.)
There are no plans to implement MPI in graph-tool. It would require a major redesign of the algorithms, and I have not interest in doing this.
- I would sometimes find it useful to be able to get a progress update for
some of the functions to carry out a rough order-of-magnitude estimate of required compute time. Two use cases: a) If calculating the betweenness centrality for a large network this can take a long time. Having an idea of how many nodes have been covered already would be useful to extrapolate time roughly remaining to see whether the calculation is even feasible in the time available.
It would be feasible to implement this, but I'd rather keep the code simple. It would also interfere with OpenMP, since the progress would need to be computed for each thread and but reported synchronously. A lot of work, for just a minor convenience of having a progress bar... Unless there is a very elegant way of doing this, I'd rather not.
b) If collecting data using mcmc_equilibrate it would be potentially
less meaningful as the process is stochastic, however, information on how many sweeps have been completed would still be useful to give a rough estimate of whether completion will take hours, days, weeks, or months.
This you can get by simply passing "verbose=True".