[graph-tool] dfs/bfs seem to have overhead depending on total graph size

26 Feb 2021

      Hi all,

I’ve got a very large undirected graph (407 mil vertices, 522 mil edges, 
2 vertex properties) that  consists of multiple connected components (ccs).

I noticed that when I call e.g. dfs_iterator or dfs_search on a source 
vertex, it takes around 1 – 2 seconds to return. The upper bound is 
depending on the component’s size, but the lower bound seems to be the 
same for all components.

I have created a test graph with only a subset of the ccs of the large 
graph. Iterating the same cc in the test graph takes only a couple of 
milliseconds instead of ≥ 1 s. This tells me, that the dfs/bfs iterators 
have some kind of overhead depending on the complete graph size.

I wrote a DFSVisitor that collects some timings during iteration to 
better see which steps consume time. Here are the results (rounded for 
readability):

    In [30]: test_dfs(graph, graph.vertex(0))

    # first time the function is entered
    visitor.start_vertex_t             2.3e-05 s
    visitor.first_discover_vertex_t       0.18 s
    visitor.first_examine_edge_t          0.18 s
    visitor.first_tree_edge_t             0.63 s
    visitor.first_finish_vertex_t         0.63 s

    # average time between last 2 calls of the function
    visitor.discover_vertex_t             0.002 s
    visitor.examine_edge_t                0.001 s
    visitor.tree_edge_t                   0.001 s
    visitor.finish_vertex_t               0.001 s

    # last time finished() is called
    visitor.finished                      1.3 s

    # number of times the functions were called
    visitor.discovered_vertices           565
    visitor.examined_edges                1978
    visitor.tree_edges                    564
    visitor.finished_vertices             565

    took 1.38 s

As you can see, start_vertex is called immediately, but then it takes a 
very long time until the other Visitor functions are called for the 
first time after which the calls are faster again, but still quite slow. 
On the test graph I think I can see the same trend with smaller numbers 
because the graph is smaller:

In [30]: test_dfs(test_graph, test_graph.vertex(0))

# first time the function is entered
visitor.start_vertex_t             2e-05 s
visitor.first_discover_vertex_t 0.0007 s
visitor.first_examine_edge_t 0.0007 s
visitor.first_tree_edge_t             0.0016s
visitor.first_finish_vertex_t 0.0015 s

# average time between last 2 calls of the function
visitor.discover_vertex_t             1.6e-05 s
visitor.examine_edge_t 3.7e-06 s
visitor.tree_edge_t 1.4e-05 s
visitor.finish_vertex_t 1.4e-05 s

# last time finished() is called
visitor.finished 0.01 s

# number of times the functions were called
visitor.discovered_vertices           565
visitor.examined_edges                1978
visitor.tree_edges                    564
visitor.finished_vertices             565

took 0.01 s

If I iterate the same cc in the large graph in python, it takes me only 
a couple of milliseconds:

    def dfs(graph, vertex_idx):
         t = time.time()
         todo = {vertex_idx}
         visited = set()
         while len(todo) > 0:
             next_vertex = todo.pop()
             visited.add(next_vertex)
             todo |=
    set(graph.iter_all_neighbors(graph.vertex(next_vertex))) - visited
         print(f'dfs took {time.time() - t} s')
         return visited

    In [38]: d = dfs(graph, 0)
    dfs took 0.01653146743774414 s

Because of graph_tool’s slowness for small ccs I ended up writing a 
heuristic that always attempts a naive python dfs first, but aborts if 1 
second is exceeded and only then does the graph_tool dfs. Anyone know 
what’s going on here?

Best

My-Tien

[graph-tool] dfs/bfs seem to have overhead depending on total graph size

My-Tien Nguyen