Help with large dataset page rank calculation

24 May 2017

      Hi guys,

I'm reaching out because I wrote a script which worked on a smaller scale,
but is not working on the full scale for calculating pagerank.

I have a 656 gb csv file of "edges" on the internet, like so:
Reddit.com Google.com
Reddit.com imgur.com
BuzzFeed.com cnn.com
Etc

Originally, I used the pandas library to prepare the data to send over to
graph tool, it worked by:
1) read CSV
2) covert everything to lower case
3) create a dataframe of unique vertices (eg. Google.com, BuzzFeed.com,
etc) with integer indexes
4) in an effort to speed up and save memory, replace edges dataframe cells
with integer representation
Eg.
Reddit.com Google.com
Reddit.com imgur.com
Becomes ...
3 1
3 69
5) THEN send over edges dataframe to graph tool
6) graph tool does its magic, calculating pagerank
7) pandas takes every returned pagerank and replaces the integer index with
it's matching domain
Eg.
3 0.878896437
Becomes ...
Reddit.com 0.878896437

The problem I'm having is that the pandas library can't even finish reading
my csv file, even after 3 hours, on the largest AWS instance.

I know this isn't related to pandas but I was wondering if there was a
better way to prepare the data to send over to graph tool that could work
at my scale?  Is there anything I can do to read the csv file, calculate
pagerank, and output all the domain vertices with their respective pagerank
scores?

I'm scared my only option is using spark in some distributed fashion ... If
that is the case, how do I still get the edges data as integers into graph
tool anyways?

Any feedback is greatly appreciated

Thank you!

B

Tiago de Paula Peixoto

tags

participants (2)