Hi guys, I'm reaching out because I wrote a script which worked on a smaller scale, but is not working on the full scale for calculating pagerank. I have a 656 gb csv file of "edges" on the internet, like so: Reddit.com Google.com Reddit.com imgur.com BuzzFeed.com cnn.com Etc Originally, I used the pandas library to prepare the data to send over to graph tool, it worked by: 1) read CSV 2) covert everything to lower case 3) create a dataframe of unique vertices (eg. Google.com, BuzzFeed.com, etc) with integer indexes 4) in an effort to speed up and save memory, replace edges dataframe cells with integer representation Eg. Reddit.com Google.com Reddit.com imgur.com Becomes ... 3 1 3 69 5) THEN send over edges dataframe to graph tool 6) graph tool does its magic, calculating pagerank 7) pandas takes every returned pagerank and replaces the integer index with it's matching domain Eg. 3 0.878896437 Becomes ... Reddit.com 0.878896437 The problem I'm having is that the pandas library can't even finish reading my csv file, even after 3 hours, on the largest AWS instance. I know this isn't related to pandas but I was wondering if there was a better way to prepare the data to send over to graph tool that could work at my scale? Is there anything I can do to read the csv file, calculate pagerank, and output all the domain vertices with their respective pagerank scores? I'm scared my only option is using spark in some distributed fashion ... If that is the case, how do I still get the edges data as integers into graph tool anyways? Any feedback is greatly appreciated Thank you!
On 24.05.2017 04:27, B wrote:
The problem I'm having is that the pandas library can't even finish reading my csv file, even after 3 hours, on the largest AWS instance.
I know this isn't related to pandas but I was wondering if there was a better way to prepare the data to send over to graph tool that could work at my scale? Is there anything I can do to read the csv file, calculate pagerank, and output all the domain vertices with their respective pagerank scores?
I'm scared my only option is using spark in some distributed fashion ... If that is the case, how do I still get the edges data as integers into graph tool anyways?
I think the simplest approach is to drop pandas completely, and work with the file directly. You should avoid loading the entire file into memory, and instead use the iterator interface of Python's csv module. As the edges are read, you process them and add them to your graph one by one. If you look closely, you will see that graph-tool already provides a "load_graph_from_csv" function: https://graph-tool.skewed.de/static/doc/graph_tool.html#graph_tool.load_grap... This automates this process and performs some basic processing like hashing the vertex names. You can create some intermediary iterator that converts things to lower case. Now, for a file of size 600 gb, this will still be quite slow. Maybe you should take a look at some fast CSV parsers out there, e.g.: http://www.wise.io/tech/paratext Best, Tiago -- Tiago de Paula Peixoto <tiago@skewed.de>
participants (2)
-
B -
Tiago de Paula Peixoto