Hi, I try to create a very large graph using random_graph function. My computer has 128 GB physical memory and 16 GB swap partition. When I try to create a graph with 50 million for number of vertices, and 80 for both numbers of incoming and outgoing edges, it consumes all memory space and then crashes. Do you have any idea how I can overcome the memory limit? Is there any way to make random_graph function more memory efficient? Thanks, Arash -- View this message in context: http://main-discussion-list-for-the-graph-tool-project.982480.n3.nabble.com/... Sent from the Main discussion list for the graph-tool project mailing list archive at Nabble.com.
On 02/01/2013 05:04 PM, Arash wrote:
Hi,
I try to create a very large graph using random_graph function. My computer has 128 GB physical memory and 16 GB swap partition. When I try to create a graph with 50 million for number of vertices, and 80 for both numbers of incoming and outgoing edges, it consumes all memory space and then crashes.
Do you have any idea how I can overcome the memory limit? Is there any way to make random_graph function more memory efficient?
Try to pass the option "random=False". In this case the edges will not be randomized, but it may need less memory. If it fits your memory, then you may do random_rewire() with the option strat="erdos", if you don't have any prescribed degree sequence. This should need less memory, since a list of the edges does not need to be built internally. But note that you are dealing with a very large graph indeed, with 4 billion edges. If you use two 64 bit integers to specify an edge, this already amounts to 80 * 50 * 1e6 * 64 * 2 / (8 * 1024 ** 3) ~ 60 GB. Since the edge indexes are also needed, this is increased by another 30 GB. Thus, simply the storage of such a thing would require at least 90 GB, but probably even more. If some temporary data structure has a size O(E), it will cross over your 128 GB limit pretty easily. The use could be halved by using 32 bit integers instead, but the library would need to be modified. But I can't help but wonder if you really need a random graph this large... Cheers, Tiago -- Tiago de Paula Peixoto <tiago@skewed.de>
Hi Tiago, Thanks for your useful hints. I do research on distributed algorithms for subgraph pattern matching on very large datasets; that is way I need such a large graph for performance measurements. Before trying your suggestions, I added a very large swap file to the system. Now, GraphTool has been running for more than 5 hours. Initially, it was keeping increasing memory usage and its CPU usage was low, but now the memory usage has become stable (about 125 GB on physical memory and 55 GB on swap) and the average CPU usage is decently high. I hope that the process will finish in the next few hours; otherwise, I will try your suggestions. I wonder if you now any other tool which might be more efficient for creating very large graphs? I appreciate your help. Arash On Fri, Feb 1, 2013 at 1:36 PM, Tiago de Paula Peixoto <tiago@skewed.de>wrote:
On 02/01/2013 05:04 PM, Arash wrote:
Hi,
I try to create a very large graph using random_graph function. My computer has 128 GB physical memory and 16 GB swap partition. When I try to create a graph with 50 million for number of vertices, and 80 for both numbers of incoming and outgoing edges, it consumes all memory space and then crashes.
Do you have any idea how I can overcome the memory limit? Is there any way to make random_graph function more memory efficient?
Try to pass the option "random=False". In this case the edges will not be randomized, but it may need less memory. If it fits your memory, then you may do random_rewire() with the option strat="erdos", if you don't have any prescribed degree sequence. This should need less memory, since a list of the edges does not need to be built internally.
But note that you are dealing with a very large graph indeed, with 4 billion edges. If you use two 64 bit integers to specify an edge, this already amounts to 80 * 50 * 1e6 * 64 * 2 / (8 * 1024 ** 3) ~ 60 GB. Since the edge indexes are also needed, this is increased by another 30 GB. Thus, simply the storage of such a thing would require at least 90 GB, but probably even more. If some temporary data structure has a size O(E), it will cross over your 128 GB limit pretty easily.
The use could be halved by using 32 bit integers instead, but the library would need to be modified.
But I can't help but wonder if you really need a random graph this large...
Cheers, Tiago
-- Tiago de Paula Peixoto <tiago@skewed.de>
_______________________________________________ graph-tool mailing list graph-tool@skewed.de http://lists.skewed.de/mailman/listinfo/graph-tool
imho build it in C. so use the boost graph libraries natively?. ..... the graphs your talking about are big enough i don't think libraries are going to make such a difference considering the memory usage is from the graph itself rather than any internal structures iirc.... @Tiago? You might want to try memory mapping since your graph is big enough to not fit in ram. or have part of it memory mapped ... On Fri, Feb 1, 2013 at 5:52 PM, Arash Fard <jalalzadeh@gmail.com> wrote:
Hi Tiago,
Thanks for your useful hints. I do research on distributed algorithms for subgraph pattern matching on very large datasets; that is way I need such a large graph for performance measurements.
Before trying your suggestions, I added a very large swap file to the system. Now, GraphTool has been running for more than 5 hours. Initially, it was keeping increasing memory usage and its CPU usage was low, but now the memory usage has become stable (about 125 GB on physical memory and 55 GB on swap) and the average CPU usage is decently high. I hope that the process will finish in the next few hours; otherwise, I will try your suggestions.
I wonder if you now any other tool which might be more efficient for creating very large graphs?
I appreciate your help.
Arash
On Fri, Feb 1, 2013 at 1:36 PM, Tiago de Paula Peixoto <tiago@skewed.de>wrote:
On 02/01/2013 05:04 PM, Arash wrote:
Hi,
I try to create a very large graph using random_graph function. My computer has 128 GB physical memory and 16 GB swap partition. When I try to create a graph with 50 million for number of vertices, and 80 for both numbers of incoming and outgoing edges, it consumes all memory space and then crashes.
Do you have any idea how I can overcome the memory limit? Is there any way to make random_graph function more memory efficient?
Try to pass the option "random=False". In this case the edges will not be randomized, but it may need less memory. If it fits your memory, then you may do random_rewire() with the option strat="erdos", if you don't have any prescribed degree sequence. This should need less memory, since a list of the edges does not need to be built internally.
But note that you are dealing with a very large graph indeed, with 4 billion edges. If you use two 64 bit integers to specify an edge, this already amounts to 80 * 50 * 1e6 * 64 * 2 / (8 * 1024 ** 3) ~ 60 GB. Since the edge indexes are also needed, this is increased by another 30 GB. Thus, simply the storage of such a thing would require at least 90 GB, but probably even more. If some temporary data structure has a size O(E), it will cross over your 128 GB limit pretty easily.
The use could be halved by using 32 bit integers instead, but the library would need to be modified.
But I can't help but wonder if you really need a random graph this large...
Cheers, Tiago
-- Tiago de Paula Peixoto <tiago@skewed.de>
_______________________________________________ graph-tool mailing list graph-tool@skewed.de http://lists.skewed.de/mailman/listinfo/graph-tool
_______________________________________________ graph-tool mailing list graph-tool@skewed.de http://lists.skewed.de/mailman/listinfo/graph-tool
I don't know about imho! My python task is not yet finished after 11 hours. I forgot to mention that we halved the number of incoming and outgoing edges. Actually, we have defined a function which returns a pair of random integers between 0 and 80 for incoming and outgoing edges, so we expect the average degree of vertices to be 80. Monitoring the processes on our system using top, I can see that all RAM and 80GB on swap is used, and the python program has CPU usage usually higher than 90%. So, it seems that perhaps the memory is not the main issue on run time at this moment. I am not sure how much the 2 randint(0,80) functions we have called for the number of edges are responsible in this CPU load! Thanks for your advice, Arash On Fri, Feb 1, 2013 at 10:45 PM, Ronnie Ghose <ronnie.ghose@gmail.com>wrote:
imho build it in C. so use the boost graph libraries natively?. ..... the graphs your talking about are big enough i don't think libraries are going to make such a difference considering the memory usage is from the graph itself rather than any internal structures iirc.... @Tiago? You might want to try memory mapping since your graph is big enough to not fit in ram. or have part of it memory mapped ...
On Fri, Feb 1, 2013 at 5:52 PM, Arash Fard <jalalzadeh@gmail.com> wrote:
Hi Tiago,
Thanks for your useful hints. I do research on distributed algorithms for subgraph pattern matching on very large datasets; that is way I need such a large graph for performance measurements.
Before trying your suggestions, I added a very large swap file to the system. Now, GraphTool has been running for more than 5 hours. Initially, it was keeping increasing memory usage and its CPU usage was low, but now the memory usage has become stable (about 125 GB on physical memory and 55 GB on swap) and the average CPU usage is decently high. I hope that the process will finish in the next few hours; otherwise, I will try your suggestions.
I wonder if you now any other tool which might be more efficient for creating very large graphs?
I appreciate your help.
Arash
On Fri, Feb 1, 2013 at 1:36 PM, Tiago de Paula Peixoto <tiago@skewed.de>wrote:
On 02/01/2013 05:04 PM, Arash wrote:
Hi,
I try to create a very large graph using random_graph function. My computer has 128 GB physical memory and 16 GB swap partition. When I try to create a graph with 50 million for number of vertices, and 80 for both numbers of incoming and outgoing edges, it consumes all memory space and then crashes.
Do you have any idea how I can overcome the memory limit? Is there any way to make random_graph function more memory efficient?
Try to pass the option "random=False". In this case the edges will not be randomized, but it may need less memory. If it fits your memory, then you may do random_rewire() with the option strat="erdos", if you don't have any prescribed degree sequence. This should need less memory, since a list of the edges does not need to be built internally.
But note that you are dealing with a very large graph indeed, with 4 billion edges. If you use two 64 bit integers to specify an edge, this already amounts to 80 * 50 * 1e6 * 64 * 2 / (8 * 1024 ** 3) ~ 60 GB. Since the edge indexes are also needed, this is increased by another 30 GB. Thus, simply the storage of such a thing would require at least 90 GB, but probably even more. If some temporary data structure has a size O(E), it will cross over your 128 GB limit pretty easily.
The use could be halved by using 32 bit integers instead, but the library would need to be modified.
But I can't help but wonder if you really need a random graph this large...
Cheers, Tiago
-- Tiago de Paula Peixoto <tiago@skewed.de>
_______________________________________________ graph-tool mailing list graph-tool@skewed.de http://lists.skewed.de/mailman/listinfo/graph-tool
_______________________________________________ graph-tool mailing list graph-tool@skewed.de http://lists.skewed.de/mailman/listinfo/graph-tool
_______________________________________________ graph-tool mailing list graph-tool@skewed.de http://lists.skewed.de/mailman/listinfo/graph-tool
On 02/02/2013 07:24 AM, Arash Fard wrote:
I don't know about imho! My python task is not yet finished after 11 hours. I forgot to mention that we halved the number of incoming and outgoing edges. Actually, we have defined a function which returns a pair of random integers between 0 and 80 for incoming and outgoing edges, so we expect the average degree of vertices to be 80.
Don't you mean 40?
Monitoring the processes on our system using top, I can see that all RAM and 80GB on swap is used, and the python program has CPU usage usually higher than 90%. So, it seems that perhaps the memory is not the main issue on run time at this moment. I am not sure how much the 2 randint(0,80) functions we have called for the number of edges are responsible in this CPU load!
If the memory usage has stabilized, then it's possible the graph has been created, and it is being randomly rewired, so the randint() should no longer be called. You should not expect any sort of decent performance if you are using swap. I guess the best approach for you is either to work with smaller graphs, or if you can't do that you have to implement your own data structure which uses less memory. You can use the Boost Graph Library itself, which has some different graph implementations, and allows you to create your own. If you are not generating graphs, but reading them from dist, a very compact representation is the compressed sparse row format: http://www.boost.org/doc/libs/1_52_0/libs/graph/doc/compressed_sparse_row.ht... You may also look at the web graph people, since they are used to working with huge graphs: http://webgraph.di.unimi.it/ Cheers, Tiago -- Tiago de Paula Peixoto <tiago@skewed.de>
Just a suggestion: I tested webgraph for my purposes after meeting one of the developers, and I can say that the memory usage is quite amazing (in a good sense: very low). However, at least 1 year ago, webgraph was unfit for dynamic graphs, so if you plan to remove nodes and so on, I don't think it's the best choice. Giuseppe 2013/2/2 Tiago de Paula Peixoto <tiago@skewed.de>
On 02/02/2013 07:24 AM, Arash Fard wrote:
I don't know about imho! My python task is not yet finished after 11 hours. I forgot to mention that we halved the number of incoming and outgoing edges. Actually, we have defined a function which returns a pair of random integers between 0 and 80 for incoming and outgoing edges, so we expect the average degree of vertices to be 80.
Don't you mean 40?
Monitoring the processes on our system using top, I can see that all RAM and 80GB on swap is used, and the python program has CPU usage usually higher than 90%. So, it seems that perhaps the memory is not the main issue on run time at this moment. I am not sure how much the 2 randint(0,80) functions we have called for the number of edges are responsible in this CPU load!
If the memory usage has stabilized, then it's possible the graph has been created, and it is being randomly rewired, so the randint() should no longer be called.
You should not expect any sort of decent performance if you are using swap. I guess the best approach for you is either to work with smaller graphs, or if you can't do that you have to implement your own data structure which uses less memory. You can use the Boost Graph Library itself, which has some different graph implementations, and allows you to create your own. If you are not generating graphs, but reading them from dist, a very compact representation is the compressed sparse row format:
http://www.boost.org/doc/libs/1_52_0/libs/graph/doc/compressed_sparse_row.ht...
You may also look at the web graph people, since they are used to working with huge graphs:
Cheers, Tiago
-- Tiago de Paula Peixoto <tiago@skewed.de>
_______________________________________________ graph-tool mailing list graph-tool@skewed.de http://lists.skewed.de/mailman/listinfo/graph-tool
On 02/02/2013 11:30 AM, Giuseppe Profiti wrote:
Just a suggestion: I tested webgraph for my purposes after meeting one of the developers, and I can say that the memory usage is quite amazing (in a good sense: very low). However, at least 1 year ago, webgraph was unfit for dynamic graphs, so if you plan to remove nodes and so on, I don't think it's the best choice.
The same is true for the compressed sparse row (CSR) graph in BGL. The graph is essentially immutable. I've planned to include optional CSR support in graph-tool, but I haven't found the time yet to do so. However, things can be improved somewhat easily in graph-tool by a factor 2 by using 32 bit integers, instead of 64 bits. I'll keep this in mind for future releases. Cheers, Tiago -- Tiago de Paula Peixoto <tiago@skewed.de>
Thank you all for the information. My python process is still running after about 17 hours! @Tiago: I have randint(0,80) separately for incoming and outgoing edges; hence, I expect average degree of 80 (incoming + outgoing) for vertices. On Sat, Feb 2, 2013 at 5:43 AM, Tiago de Paula Peixoto <tiago@skewed.de>wrote:
On 02/02/2013 11:30 AM, Giuseppe Profiti wrote:
Just a suggestion: I tested webgraph for my purposes after meeting one of the developers, and I can say that the memory usage is quite amazing (in a good sense: very low). However, at least 1 year ago, webgraph was unfit for dynamic graphs, so if you plan to remove nodes and so on, I don't think it's the best choice.
The same is true for the compressed sparse row (CSR) graph in BGL. The graph is essentially immutable.
I've planned to include optional CSR support in graph-tool, but I haven't found the time yet to do so.
However, things can be improved somewhat easily in graph-tool by a factor 2 by using 32 bit integers, instead of 64 bits. I'll keep this in mind for future releases.
Cheers, Tiago
-- Tiago de Paula Peixoto <tiago@skewed.de>
_______________________________________________ graph-tool mailing list graph-tool@skewed.de http://lists.skewed.de/mailman/listinfo/graph-tool
On 02/02/2013 02:27 PM, Arash Fard wrote:
Thank you all for the information. My python process is still running after about 17 hours!
I've modified the internal representation of graph-tool, such that now less memory is used to represent a graph. I simply dumped boost's adjacency_list with a light-weight custom made implementation. In my tests I get a factor of 2 improvement (!). You can try with the current git version to see if it helps in your case. Cheers, Tiago -- Tiago de Paula Peixoto <tiago@skewed.de>
It is very kind of you. Thanks On Mon, Feb 11, 2013 at 1:23 PM, Tiago de Paula Peixoto <tiago@skewed.de>wrote:
On 02/02/2013 02:27 PM, Arash Fard wrote:
Thank you all for the information. My python process is still running after about 17 hours!
I've modified the internal representation of graph-tool, such that now less memory is used to represent a graph. I simply dumped boost's adjacency_list with a light-weight custom made implementation. In my tests I get a factor of 2 improvement (!). You can try with the current git version to see if it helps in your case.
Cheers, Tiago
-- Tiago de Paula Peixoto <tiago@skewed.de>
_______________________________________________ graph-tool mailing list graph-tool@skewed.de http://lists.skewed.de/mailman/listinfo/graph-tool
participants (5)
-
Arash -
Arash Fard -
Giuseppe Profiti -
Ronnie Ghose -
Tiago de Paula Peixoto