dfs/bfs seem to have overhead depending on total graph size
Hi all, I’ve got a very large undirected graph (407 mil vertices, 522 mil edges, 2 vertex properties) that consists of multiple connected components (ccs). I noticed that when I call e.g. dfs_iterator or dfs_search on a source vertex, it takes around 1 – 2 seconds to return. The upper bound is depending on the component’s size, but the lower bound seems to be the same for all components. I have created a test graph with only a subset of the ccs of the large graph. Iterating the same cc in the test graph takes only a couple of milliseconds instead of ≥ 1 s. This tells me, that the dfs/bfs iterators have some kind of overhead depending on the complete graph size. I wrote a DFSVisitor that collects some timings during iteration to better see which steps consume time. Here are the results (rounded for readability): In [30]: test_dfs(graph, graph.vertex(0)) # first time the function is entered visitor.start_vertex_t 2.3e-05 s visitor.first_discover_vertex_t 0.18 s visitor.first_examine_edge_t 0.18 s visitor.first_tree_edge_t 0.63 s visitor.first_finish_vertex_t 0.63 s # average time between last 2 calls of the function visitor.discover_vertex_t 0.002 s visitor.examine_edge_t 0.001 s visitor.tree_edge_t 0.001 s visitor.finish_vertex_t 0.001 s # last time finished() is called visitor.finished 1.3 s # number of times the functions were called visitor.discovered_vertices 565 visitor.examined_edges 1978 visitor.tree_edges 564 visitor.finished_vertices 565 took 1.38 s As you can see, start_vertex is called immediately, but then it takes a very long time until the other Visitor functions are called for the first time after which the calls are faster again, but still quite slow. On the test graph I think I can see the same trend with smaller numbers because the graph is smaller: In [30]: test_dfs(test_graph, test_graph.vertex(0)) # first time the function is entered visitor.start_vertex_t 2e-05 s visitor.first_discover_vertex_t 0.0007 s visitor.first_examine_edge_t 0.0007 s visitor.first_tree_edge_t 0.0016s visitor.first_finish_vertex_t 0.0015 s # average time between last 2 calls of the function visitor.discover_vertex_t 1.6e-05 s visitor.examine_edge_t 3.7e-06 s visitor.tree_edge_t 1.4e-05 s visitor.finish_vertex_t 1.4e-05 s # last time finished() is called visitor.finished 0.01 s # number of times the functions were called visitor.discovered_vertices 565 visitor.examined_edges 1978 visitor.tree_edges 564 visitor.finished_vertices 565 took 0.01 s If I iterate the same cc in the large graph in python, it takes me only a couple of milliseconds: def dfs(graph, vertex_idx): t = time.time() todo = {vertex_idx} visited = set() while len(todo) > 0: next_vertex = todo.pop() visited.add(next_vertex) todo |= set(graph.iter_all_neighbors(graph.vertex(next_vertex))) - visited print(f'dfs took {time.time() - t} s') return visited In [38]: d = dfs(graph, 0) dfs took 0.01653146743774414 s Because of graph_tool’s slowness for small ccs I ended up writing a heuristic that always attempts a naive python dfs first, but aborts if 1 second is exceeded and only then does the graph_tool dfs. Anyone know what’s going on here? Best My-Tien
Am 26.02.21 um 12:08 schrieb My-Tien Nguyen:
I noticed that when I call e.g. dfs_iterator or dfs_search on a source vertex, it takes around 1 – 2 seconds to return. The upper bound is depending on the component’s size, but the lower bound seems to be the same for all components.
DFS creates and initializes a property map keeping track of the vertices that have been visited, which is an O(N) operation. It's possible to change the code such that the property map can be passed to the algorithm, and then shared between multiple calls, removing this overhead. Please open an issue in the website, and I'll add this for the next release. Best, Tiago -- Tiago de Paula Peixoto <tiago@skewed.de>
participants (2)
-
My-Tien Nguyen -
Tiago de Paula Peixoto