Hi, I've been using graph-tool for the last year or so for calculating shortest-path trees on large-scale road networks. We used to do this in a postgres database with the pgrouting extension, but were continually confronted with unacceptable startup costs. The switch to a python module using graph-tool has considerably sped up our routing queries, but we are suffering from this service using too much memory. I have the feeling I might be using graph-tool in a wrong way, but before I dive into that, it would be good to know what is the expected memory footprint for my use case.
Take for example a road network with 30Mio edges and 31 Mio nodes (the combined road network of Belgium, Netherland, France and Germany in OSM). For this road network, I need to calculate shortest paths using different edge weights (edge property map). What would be a very rough estimate how much memory this would use ? For the network only + per edge-property-map. In our setup, there would be one edge-property-map with edge weights per country. We're currently seeing usage of over 50Gb easily, spiking even higher when we're loading extra cost structures or networks. Is that expected ? Or am I experiencing memory leaks somewhere ?
How I'm using graph-tool right now:
*1) loading network* *nw = dataframe with edges info in the structure: startnode-id, endnode-id, edge-id, country*
G = gt.Graph(directed=True) G.ep["edge_id"] = G.new_edge_property("int") G.ep["country_id"] = G.new_edge_property("int16_t") eprops = [G.ep["edge_id"], G.ep["country_id"]]
n = G.add_edge_list(nw.to_numpy(), hashed=True, eprops=eprops) G.vertex_properties["n"] = n
*2) loading edge costs: I'm using GraphViews*
*countries = list of country-codes* edge_filter = np.in1d(G.ep["country_id"].a, [get_country_id(c) for c in countries]) GV = gt.GraphView(G, efilt=edge_filter)
edges = GV.get_edges([GV.edge_index]) sources = G.vertex_properties["n"].a[edges[:,0]] targets = G.vertex_properties["n"].a[edges[:,1]] idxs = edges[:,2]
*db_costs = pandas dataframe with structure: source, target, cost*
sti = np.vstack((idxs,sources,targets)).T sti_df = pd.DataFrame({'idx': sti[:, 0], 'source': sti[:, 1], 'target': sti[:, 2]}) m = pd.merge(sti_df, db_costs, on=['source', 'target'], how='left', sort=False)[['idx', 'c']] wgts_list = m.sort_values(by=['idx']).T.iloc[1].to_numpy() wgts_list = np.where(wgts_list==np.nan, np.iinfo(np.int32).max, wgts_list)
wgts = GV.new_edge_property("int32_t") wgts.fa = wgts_list wgts.fa = np.where(wgts.fa==-2147483648, np.iinfo(np.int32).max, wgts.fa) GV.edge_properties[cs_ids_str] = wgts
GV2 = gt.GraphView(GV, efilt=wgts.fa != np.inf)
*3) I then use GV2 for calculating Dijkstra and such...*
I could of course work on an MWE of some sorts. But would be very nice to get an estimate on mem footprint, and to see if I'm doing sth really silly in the code above.
Thx!
Hi all. Anyone can provide me with some insights here ? I know it's quite an open question here, and it might take some effort of course. Would anyone be available/willing to do an actual code audit of the code that I have ? This would be compensated of course. Feel free to contact me to discuss.
Kind regards
On Tue, Jun 15, 2021 at 8:45 PM Mathias Versichele < mathias.versichele@gmail.com> wrote:
Hi, I've been using graph-tool for the last year or so for calculating shortest-path trees on large-scale road networks. We used to do this in a postgres database with the pgrouting extension, but were continually confronted with unacceptable startup costs. The switch to a python module using graph-tool has considerably sped up our routing queries, but we are suffering from this service using too much memory. I have the feeling I might be using graph-tool in a wrong way, but before I dive into that, it would be good to know what is the expected memory footprint for my use case.
Take for example a road network with 30Mio edges and 31 Mio nodes (the combined road network of Belgium, Netherland, France and Germany in OSM). For this road network, I need to calculate shortest paths using different edge weights (edge property map). What would be a very rough estimate how much memory this would use ? For the network only + per edge-property-map. In our setup, there would be one edge-property-map with edge weights per country. We're currently seeing usage of over 50Gb easily, spiking even higher when we're loading extra cost structures or networks. Is that expected ? Or am I experiencing memory leaks somewhere ?
How I'm using graph-tool right now:
*1) loading network* *nw = dataframe with edges info in the structure: startnode-id, endnode-id, edge-id, country*
G = gt.Graph(directed=True) G.ep["edge_id"] = G.new_edge_property("int") G.ep["country_id"] = G.new_edge_property("int16_t") eprops = [G.ep["edge_id"], G.ep["country_id"]]
n = G.add_edge_list(nw.to_numpy(), hashed=True, eprops=eprops) G.vertex_properties["n"] = n
*2) loading edge costs: I'm using GraphViews*
*countries = list of country-codes* edge_filter = np.in1d(G.ep["country_id"].a, [get_country_id(c) for c in countries]) GV = gt.GraphView(G, efilt=edge_filter)
edges = GV.get_edges([GV.edge_index]) sources = G.vertex_properties["n"].a[edges[:,0]] targets = G.vertex_properties["n"].a[edges[:,1]] idxs = edges[:,2]
*db_costs = pandas dataframe with structure: source, target, cost*
sti = np.vstack((idxs,sources,targets)).T sti_df = pd.DataFrame({'idx': sti[:, 0], 'source': sti[:, 1], 'target': sti[:, 2]}) m = pd.merge(sti_df, db_costs, on=['source', 'target'], how='left', sort=False)[['idx', 'c']] wgts_list = m.sort_values(by=['idx']).T.iloc[1].to_numpy() wgts_list = np.where(wgts_list==np.nan, np.iinfo(np.int32).max, wgts_list)
wgts = GV.new_edge_property("int32_t") wgts.fa = wgts_list wgts.fa = np.where(wgts.fa==-2147483648, np.iinfo(np.int32).max, wgts.fa) GV.edge_properties[cs_ids_str] = wgts
GV2 = gt.GraphView(GV, efilt=wgts.fa != np.inf)
*3) I then use GV2 for calculating Dijkstra and such...*
I could of course work on an MWE of some sorts. But would be very nice to get an estimate on mem footprint, and to see if I'm doing sth really silly in the code above.
Thx!
Dear Mathias,
It is not reasonable to expect us to make this kind of evaluation just from partial code. As with anything, we need a minimal working example to be able to say something concrete.
I would recommend you to try to separate the pandas dataframe manipulation from the graph-tool side in order to determine which is consuming more memory.
Best, Tiago
Am 22.06.21 um 09:24 schrieb Mathias Versichele:
Hi all. Anyone can provide me with some insights here ? I know it's quite an open question here, and it might take some effort of course. Would anyone be available/willing to do an actual code audit of the code that I have ? This would be compensated of course. Feel free to contact me to discuss.
Kind regards
On Tue, Jun 15, 2021 at 8:45 PM Mathias Versichele <mathias.versichele@gmail.com mailto:mathias.versichele@gmail.com> wrote:
Hi, I've been using graph-tool for the last year or so for calculating shortest-path trees on large-scale road networks. We used to do this in a postgres database with the pgrouting extension, but were continually confronted with unacceptable startup costs. The switch to a python module using graph-tool has considerably sped up our routing queries, but we are suffering from this service using too much memory. I have the feeling I might be using graph-tool in a wrong way, but before I dive into that, it would be good to know what is the expected memory footprint for my use case. Take for example a road network with 30Mio edges and 31 Mio nodes (the combined road network of Belgium, Netherland, France and Germany in OSM). For this road network, I need to calculate shortest paths using different edge weights (edge property map). What would be a very rough estimate how much memory this would use ? For the network only + per edge-property-map. In our setup, there would be one edge-property-map with edge weights per country. We're currently seeing usage of over 50Gb easily, spiking even higher when we're loading extra cost structures or networks. Is that expected ? Or am I experiencing memory leaks somewhere ? How I'm using graph-tool right now: *1) loading network* /nw = dataframe with edges info in the structure: startnode-id, endnode-id, edge-id, country/ G = gt.Graph(directed=True) G.ep["edge_id"] = G.new_edge_property("int") G.ep["country_id"] = G.new_edge_property("int16_t") eprops = [G.ep["edge_id"], G.ep["country_id"]] n = G.add_edge_list(nw.to_numpy(), hashed=True, eprops=eprops) G.vertex_properties["n"] = n *2) loading edge costs: I'm using GraphViews* * * /countries = list of country-codes/ edge_filter = np.in1d(G.ep["country_id"].a, [get_country_id(c) for c in countries])* * GV = gt.GraphView(G, efilt=edge_filter) edges = GV.get_edges([GV.edge_index]) sources = G.vertex_properties["n"].a[edges[:,0]] targets = G.vertex_properties["n"].a[edges[:,1]] idxs = edges[:,2] /db_costs = pandas dataframe with structure: source, target, cost / sti = np.vstack((idxs,sources,targets)).T sti_df = pd.DataFrame({'idx': sti[:, 0], 'source': sti[:, 1], 'target': sti[:, 2]}) m = pd.merge(sti_df, db_costs, on=['source', 'target'], how='left', sort=False)[['idx', 'c']] wgts_list = m.sort_values(by=['idx']).T.iloc[1].to_numpy() wgts_list = np.where(wgts_list==np.nan, np.iinfo(np.int32).max, wgts_list) wgts = GV.new_edge_property("int32_t") wgts.fa = wgts_list wgts.fa = np.where(wgts.fa==-2147483648, np.iinfo(np.int32).max, wgts.fa) GV.edge_properties[cs_ids_str] = wgts GV2 = gt.GraphView(GV, efilt=wgts.fa != np.inf) *3) I then use GV2 for calculating Dijkstra and such...* I could of course work on an MWE of some sorts. But would be very nice to get an estimate on mem footprint, and to see if I'm doing sth really silly in the code above. Thx!
graph-tool mailing list -- graph-tool@skewed.de To unsubscribe send an email to graph-tool-leave@skewed.de
Fair point. I managed to find some time to make a MWE. Here's the link to a zip file containing both the input data (network and costs as csv dumps from our postgres db) and the code: https://www.dropbox.com/s/3iby8nil7zyi7q3/MWE.zip?dl=0
To use: just run "python mwe_main.py". The last line calculates a shortest path tree until 60mins for cost structure "49" (these are costs associated with car travel, the specific id has no meaning). Because this is only Belgium, mem usage when I run this is manageable (under 1Gb).
In general my question is: can this code be made more performant, both in terms of mem and speed ? Am I right in using GraphViews for each cost structure, where the view only contains the edges which have costs associated with them ? Does this make gt.shortest_distance() faster ?
The most production-critical part of the code here is point_drivetimes(). Any gain there in calculation time would be very valuable. The loading of the network and cost structure only need to happen once, so can be a bit slower.
Thx for helping me !
On Tue, Jun 22, 2021 at 9:43 AM Tiago de Paula Peixoto tiago@skewed.de wrote:
Dear Mathias,
It is not reasonable to expect us to make this kind of evaluation just from partial code. As with anything, we need a minimal working example to be able to say something concrete.
I would recommend you to try to separate the pandas dataframe manipulation from the graph-tool side in order to determine which is consuming more memory.
Best, Tiago
Am 22.06.21 um 09:24 schrieb Mathias Versichele:
Hi all. Anyone can provide me with some insights here ? I know it's quite an open question here, and it might take some effort of course. Would anyone be available/willing to do an actual code audit of the code that I have ? This would be compensated of course. Feel free to contact me to discuss.
Kind regards
On Tue, Jun 15, 2021 at 8:45 PM Mathias Versichele <mathias.versichele@gmail.com mailto:mathias.versichele@gmail.com>
wrote:
Hi, I've been using graph-tool for the last year or so for calculating shortest-path trees on large-scale road networks. We used to do this in a postgres database with the pgrouting extension, but were continually confronted with unacceptable startup costs. The switch to a python module using graph-tool has considerably sped up our routing queries, but we are suffering from this service using too much memory. I have the feeling I might be using graph-tool in a wrong way, but before I dive into that, it would be good to know what is the expected memory footprint for my use case. Take for example a road network with 30Mio edges and 31 Mio nodes (the combined road network of Belgium, Netherland, France and Germany in OSM). For this road network, I need to calculate shortest paths using different edge weights (edge property map). What would be a very rough estimate how much memory this would use ? For the network only + per edge-property-map. In our setup, there would be one edge-property-map with edge weights per country. We're currently seeing usage of over 50Gb easily, spiking even higher when we're loading extra cost structures or networks. Is that expected ? Or am I experiencing memory leaks somewhere ? How I'm using graph-tool right now: *1) loading network* /nw = dataframe with edges info in the structure: startnode-id, endnode-id, edge-id, country/ G = gt.Graph(directed=True) G.ep["edge_id"] = G.new_edge_property("int") G.ep["country_id"] = G.new_edge_property("int16_t") eprops = [G.ep["edge_id"], G.ep["country_id"]] n = G.add_edge_list(nw.to_numpy(), hashed=True, eprops=eprops) G.vertex_properties["n"] = n *2) loading edge costs: I'm using GraphViews* * * /countries = list of country-codes/ edge_filter = np.in1d(G.ep["country_id"].a, [get_country_id(c) for c in countries])* * GV = gt.GraphView(G, efilt=edge_filter) edges = GV.get_edges([GV.edge_index]) sources = G.vertex_properties["n"].a[edges[:,0]] targets = G.vertex_properties["n"].a[edges[:,1]] idxs = edges[:,2] /db_costs = pandas dataframe with structure: source, target, cost / sti = np.vstack((idxs,sources,targets)).T sti_df = pd.DataFrame({'idx': sti[:, 0], 'source': sti[:, 1], 'target': sti[:, 2]}) m = pd.merge(sti_df, db_costs, on=['source', 'target'], how='left', sort=False)[['idx', 'c']] wgts_list = m.sort_values(by=['idx']).T.iloc[1].to_numpy() wgts_list = np.where(wgts_list==np.nan, np.iinfo(np.int32).max, wgts_list) wgts = GV.new_edge_property("int32_t") wgts.fa = wgts_list wgts.fa = np.where(wgts.fa==-2147483648, np.iinfo(np.int32).max, wgts.fa) GV.edge_properties[cs_ids_str] = wgts GV2 = gt.GraphView(GV, efilt=wgts.fa != np.inf) *3) I then use GV2 for calculating Dijkstra and such...* I could of course work on an MWE of some sorts. But would be very nice to get an estimate on mem footprint, and to see if I'm doing sth really silly in the code above. Thx!
graph-tool mailing list -- graph-tool@skewed.de To unsubscribe send an email to graph-tool-leave@skewed.de
-- Tiago de Paula Peixoto tiago@skewed.de _______________________________________________ graph-tool mailing list -- graph-tool@skewed.de To unsubscribe send an email to graph-tool-leave@skewed.de
Anyone able to scan if I'm using the library correctly? Or any other tips ? The MWE is provided through a download link.
Much appreciated!
On Wed, Jun 30, 2021 at 10:17 PM Mathias Versichele < mathias.versichele@gmail.com> wrote:
Fair point. I managed to find some time to make a MWE. Here's the link to a zip file containing both the input data (network and costs as csv dumps from our postgres db) and the code: https://www.dropbox.com/s/3iby8nil7zyi7q3/MWE.zip?dl=0
To use: just run "python mwe_main.py". The last line calculates a shortest path tree until 60mins for cost structure "49" (these are costs associated with car travel, the specific id has no meaning). Because this is only Belgium, mem usage when I run this is manageable (under 1Gb).
In general my question is: can this code be made more performant, both in terms of mem and speed ? Am I right in using GraphViews for each cost structure, where the view only contains the edges which have costs associated with them ? Does this make gt.shortest_distance() faster ?
The most production-critical part of the code here is point_drivetimes(). Any gain there in calculation time would be very valuable. The loading of the network and cost structure only need to happen once, so can be a bit slower.
Thx for helping me !
On Tue, Jun 22, 2021 at 9:43 AM Tiago de Paula Peixoto tiago@skewed.de wrote:
Dear Mathias,
It is not reasonable to expect us to make this kind of evaluation just from partial code. As with anything, we need a minimal working example to be able to say something concrete.
I would recommend you to try to separate the pandas dataframe manipulation from the graph-tool side in order to determine which is consuming more memory.
Best, Tiago
Am 22.06.21 um 09:24 schrieb Mathias Versichele:
Hi all. Anyone can provide me with some insights here ? I know it's quite an open question here, and it might take some effort of course. Would anyone be available/willing to do an actual code audit of the
code
that I have ? This would be compensated of course. Feel free to contact me to discuss.
Kind regards
On Tue, Jun 15, 2021 at 8:45 PM Mathias Versichele <mathias.versichele@gmail.com mailto:mathias.versichele@gmail.com>
wrote:
Hi, I've been using graph-tool for the last year or so for calculating shortest-path trees on large-scale road networks. We used to do this in a postgres database with the pgrouting extension, but were continually confronted with unacceptable startup costs. The switch to a python module using graph-tool has considerably sped up our routing queries, but we are suffering from this service using too much memory. I have the feeling I might be using graph-tool in a wrong way, but before I dive into that, it would be good to know what is the expected memory footprint for my use case. Take for example a road network with 30Mio edges and 31 Mio nodes (the combined road network of Belgium, Netherland, France and Germany in OSM). For this road network, I need to calculate shortest paths using different edge weights (edge property map). What would be a very rough estimate how much memory this would use ? For the network only + per edge-property-map. In our setup, there would be one edge-property-map with edge weights per country. We're currently seeing usage of over 50Gb easily, spiking even higher when we're loading extra cost structures or networks. Is that expected ? Or am I experiencing memory leaks somewhere ? How I'm using graph-tool right now: *1) loading network* /nw = dataframe with edges info in the structure: startnode-id, endnode-id, edge-id, country/ G = gt.Graph(directed=True) G.ep["edge_id"] = G.new_edge_property("int") G.ep["country_id"] = G.new_edge_property("int16_t") eprops = [G.ep["edge_id"], G.ep["country_id"]] n = G.add_edge_list(nw.to_numpy(), hashed=True, eprops=eprops) G.vertex_properties["n"] = n *2) loading edge costs: I'm using GraphViews* * * /countries = list of country-codes/ edge_filter = np.in1d(G.ep["country_id"].a, [get_country_id(c) for c in countries])* * GV = gt.GraphView(G, efilt=edge_filter) edges = GV.get_edges([GV.edge_index]) sources = G.vertex_properties["n"].a[edges[:,0]] targets = G.vertex_properties["n"].a[edges[:,1]] idxs = edges[:,2] /db_costs = pandas dataframe with structure: source, target, cost / sti = np.vstack((idxs,sources,targets)).T sti_df = pd.DataFrame({'idx': sti[:, 0], 'source': sti[:, 1], 'target': sti[:, 2]}) m = pd.merge(sti_df, db_costs, on=['source', 'target'], how='left', sort=False)[['idx', 'c']] wgts_list = m.sort_values(by=['idx']).T.iloc[1].to_numpy() wgts_list = np.where(wgts_list==np.nan, np.iinfo(np.int32).max, wgts_list) wgts = GV.new_edge_property("int32_t") wgts.fa = wgts_list wgts.fa = np.where(wgts.fa==-2147483648, np.iinfo(np.int32).max, wgts.fa) GV.edge_properties[cs_ids_str] = wgts GV2 = gt.GraphView(GV, efilt=wgts.fa != np.inf) *3) I then use GV2 for calculating Dijkstra and such...* I could of course work on an MWE of some sorts. But would be very nice to get an estimate on mem footprint, and to see if I'm doing sth really silly in the code above. Thx!
graph-tool mailing list -- graph-tool@skewed.de To unsubscribe send an email to graph-tool-leave@skewed.de
-- Tiago de Paula Peixoto tiago@skewed.de _______________________________________________ graph-tool mailing list -- graph-tool@skewed.de To unsubscribe send an email to graph-tool-leave@skewed.de