Fast method for duplicated edge removal
Hello, I'm using a cvsreader to parse a very big file and create an undirected graph accordingly. The file can contain duplicated edges (i.e. A B in one row, B A in another one), so I'm checking if g.edge(v1,v2)==None: e = g.add_edge(v1,v2) in order to discard them (v1 and v2 are vertices created from what it's read from the file). However the graph contains a lot of edges (few millions) and vertices (many thousands), with a potentially high degree for the vertices, and it takes a lot to process the data. As far as I read in the soruce code, the Graph.edge() method checks all the outgoing edges of the source node, but even if I check which node has the highest degree, it takes a lot of time to build the graph. Is there any other way to remove the duplicated edges? Maybe an edge filter based on some lambda wizardry? Thanks in advance, Giuseppe -- Researcher at University of Bologna, Italy
Hi Giuseppe, On 09/20/2011 04:36 PM, Giuseppe Profiti wrote:
Hello, I'm using a cvsreader to parse a very big file and create an undirected graph accordingly. The file can contain duplicated edges (i.e. A B in one row, B A in another one), so I'm checking if g.edge(v1,v2)==None: e = g.add_edge(v1,v2)
in order to discard them (v1 and v2 are vertices created from what it's read from the file). However the graph contains a lot of edges (few millions) and vertices (many thousands), with a potentially high degree for the vertices, and it takes a lot to process the data. As far as I read in the soruce code, the Graph.edge() method checks all the outgoing edges of the source node, but even if I check which node has the highest degree, it takes a lot of time to build the graph.
Is there any other way to remove the duplicated edges? Maybe an edge filter based on some lambda wizardry?
The library includes a 'remove_parallel_edges()' function which does what you want, and is fast. If you want to mask the parallel edges temporarily, you can use filtering, as such: l = label_parallel_edges(g) g = GraphView(g, efilt=lambda e: l[e] == 0) Cheers, Tiago -- Tiago de Paula Peixoto <tiago@skewed.de>
Hi Tiago, 2011/9/20 Tiago de Paula Peixoto <tiago@skewed.de>
Hi Giuseppe,
[cut]
The library includes a 'remove_parallel_edges()' function which does what you want, and is fast.
thanks for the hint, postponing the check after loading the data speeded up the process by 10 times, however the remove_parallel_edges() does not work in parallel (at least on my machine) and then it's a bit too slow for my purposes. I'm trying a different approach in order to speed things up.
If you want to mask the parallel edges temporarily, you can use filtering, as such:
l = label_parallel_edges(g) g = GraphView(g, efilt=lambda e: l[e] == 0)
Has label_parallel_edges the same complexity of remove_parallel_edges? Thanks again for your help, Giuseppe
participants (2)
-
Giuseppe Profiti -
Tiago de Paula Peixoto