Evaluating "goodness of fit" of stochastic block model
Hello, I am trying to fit a stochastic block model to a network in order to ultimately do link prediction. I am reasonably happy comparing different versions of the SBMs to each other by computing and comparing posterior odds ratios so I can determine which configuration is best at describing my data. What I however am not sure about yet is how good the model is in actually describing the data (after all, if they are all very poor in fitting my particular network than the fact that one is better than the rest doesn't guarantee that it is actually any good at describing my data and thus my later predictions based on it will therefore also be off) and whether there is some metric or procedure I could use to get an idea for this. Does anybody have any advice on this? Thank you for your help in advance! With best wishes, Philipp -- View this message in context: http://main-discussion-list-for-the-graph-tool-project.982480.n3.nabble.com/... Sent from the Main discussion list for the graph-tool project mailing list archive at Nabble.com.
On 06.04.2017 15:58, P-M wrote:
Hello,
I am trying to fit a stochastic block model to a network in order to ultimately do link prediction. I am reasonably happy comparing different versions of the SBMs to each other by computing and comparing posterior odds ratios so I can determine which configuration is best at describing my data. What I however am not sure about yet is how good the model is in actually describing the data (after all, if they are all very poor in fitting my particular network than the fact that one is better than the rest doesn't guarantee that it is actually any good at describing my data and thus my later predictions based on it will therefore also be off) and whether there is some metric or procedure I could use to get an idea for this. Does anybody have any advice on this?
I'm not aware of any usable "absolute" quality of fit criterion for SBMs. One can of course use a test statistic based on any particular metric (say clustering coefficient, spectral gap, etc.) and test whether networks generated from the fitted model match the data, by computing a p-value, for example. But this should not be used as a measure of model selection, since models that overfit will have high p-value. Prediction quality is also a relative measure, since no probabilistic model will predict edges with 100% precision, even if it is the actual model that generated the data. In the end, we can only compare models to each other (via posterior odds/description length, prediction quality, etc). We can never really compare them to the "truth", since we have no access to it. Remember that data is never the "truth" since it contains noise. -- Tiago de Paula Peixoto <tiago@skewed.de>
participants (2)
-
P-M -
Tiago de Paula Peixoto