Dear Tiago,
I have a directed graph of about half a million nodes and approximately a million edges following scale free behaviour and a power law degree distribution. To test some of my hypothesis, I would like to generate random smaller graphs (about 50 up to 200 nodes) representative of the big one.
When I used a sample function that samples straight away from the real distribution of the big network, I have following problems:
- I generate unconnected nodes with both 0 in AND out degree. - I generate small sub parts of a few nodes that are not connected to the main graph. - If only sampling from nodes with at least 1 degree, the generated graph is coherent, but not representative anymore as I need a big portion of nodes with either only one in or one out degree.
Here is the part of my script I used for that, where samples are drawn from dictionaries of the degrees:
def sample_in(): a=np.random.randint(num) k_in = in_degrees[a] return k_in
def sample_out(): if sample_in()==0: b=np.random.randint(num_out) k_out=out_zero_zeros.values()[b] return k_out else: b=np.random.randint(num) k_out=out_degrees[b] return k_out
N=200 g=gt.random_graph(N, lambda:(sample_in(), sample_out()), model="constrained-configuration", directed=True)
I also tried sampling from a list of tuples as you have mentioned before in the forum, but I didn't receive any results, as the tuples randomly drawn from my list might not be combinable.
degs=[(7,1),(4,3),(5,6),(2,4),(6,8),(2,0),(3,5),(0,3),(2,7),(2,1)] g = gt.random_graph(4, lambda i: degs[i], directed=True)
- Is there any option I could active that would help me in those cases I described above? - Is there a better way how to create representative small networks?
Any help on that issue will be much appreciated.
Best wishes, Jana
-- Sent from: http://main-discussion-list-for-the-graph-tool-project.982480.n3.nabble.com/
Am 06.09.18 um 12:37 schrieb JM:
I have a directed graph of about half a million nodes and approximately a million edges following scale free behaviour and a power law degree distribution. To test some of my hypothesis, I would like to generate random smaller graphs (about 50 up to 200 nodes) representative of the big one.
This is an open problem in network science, with no satisfactory solution. Nobody knows how to subsample sparse graphs in a meaningful way. Naive approaches like randomly subsampling nodes and/or edges lead to trivial and highly biased samples.
The code you sent, however, seems to subsample only the degree distribution, not the graph, which is something else entirely, and will not be representative of a larger graph in any other way. I'll not comment further on it, since as I always say, I need a minimal and *self-contained* program that shows whatever problem you might be encountering. Analyzing code fragments like this, decoupled from their context in the larger program, is not a good use of our time.
Best, Tiago
Dear Tiago,
many thanks for your fast reply! Sorry, about the code.
Here is a complete example, that if substituting the loaded "example.gt" by any of your networks will show you my problem.
In that case, I tried to adjust the nodes manually in the end, but I still encounter the problem of small unconnected islands of nodes.
I uploaded the file here:
generation1.py http://main-discussion-list-for-the-graph-tool-project.982480.n3.nabble.com/file/t496114/generation1.py
Best wishes, Jana
-- Sent from: http://main-discussion-list-for-the-graph-tool-project.982480.n3.nabble.com/
Following up this post, I have large datasets for which I already have a NSBM. I wanted to speed up thinking to an approximate model, so I subsampled a fraction of nodes (randomly chosen) and performed NSBM, then performed a sort of label transfer to the original graph. Except for the fact the partitions at level 0 are now larger than the original ones (as expected) I noticed a general concordance between the communities using subsampled and full graphs. Do you have some literature, ideas or hints about analysis of subsamples?
Am 01.09.21 um 19:24 schrieb Davide Cittaro:
Following up this post, I have large datasets for which I already have a NSBM. I wanted to speed up thinking to an approximate model, so I subsampled a fraction of nodes (randomly chosen) and performed NSBM, then performed a sort of label transfer to the original graph. Except for the fact the partitions at level 0 are now larger than the original ones (as expected) I noticed a general concordance between the communities using subsampled and full graphs. Do you have some literature, ideas or hints about analysis of subsamples?
If we assume that the big graph is sampled from a SBM, then the sub-sampled graph would also be sampled from a SBM, but not from the same one, if we are dealing with sparse networks. The sub-sampled SBM would be sparser (smaller average degree), and have a deformed degree distribution in the case of the DC-SBM.
The intuition here is that the evidence for the underlying structure will become weaker after sub-sampling, according to how sparser the network becomes. With the MDL/Bayesian approach in graph-tool, you should see fewer groups in the sub-sampled network, but they should otherwise be similar to the full network.
Tiago de Paula Peixoto wrote:
Am 01.09.21 um 19:24 schrieb Davide Cittaro:
If we assume that the big graph is sampled from a SBM, then the sub-sampled graph would also be sampled from a SBM, but not from the same one, if we are dealing with sparse networks. The sub-sampled SBM would be sparser (smaller average degree), and have a deformed degree distribution in the case of the DC-SBM.
Since my graphs are kNN graphs, would you suggest to recompute them on subsampled data? To be fair I’ve tried and results do not change dramatically
The intuition here is that the evidence for the underlying structure will become weaker after sub-sampling, according to how sparser the network becomes. With the MDL/Bayesian approach in graph-tool, you should see fewer groups in the sub-sampled network, but they should otherwise be similar to the full network.
This is indeed what I observe. I’m thinking to use this possibility to “sniff” the data and, in case needed, one can (and should) use the full network. Also, I’m aware subsampling will generally wipe out small communities which won’t be identified
Thanks again
d