Am 11.04.20 um 07:24 schrieb Deklan Webster:
It's a sum in log-space, so it's incremental.
Can you show me in code how to use logsumexp in such a way such that I don't need to store 6,000,000 * k floats in memory where k is the number of sampling iterations?
It is so utterly simple, it is *exactly* like doing sums. Here is the log sum of the values from 1 to 1000, without keeping anything in memory: res = -numpy.inf # zero for x in range(1, 1000): res = logsumexp([res, log(x)])
Most other link prediction methods only provide a ranking or scores of the edges. In order to actually predict edges, all these methods require you to determine how many edges should be actually placed.
In your setup you still end up with a probability in the end. What probability cutoff you consider to be a "real missing edge" or not is still arbitrary, correct?
Not quite, at least not up to the choice of a loss function. For binary classification using the 0-1 loss (there are not many alternatives) the optimal threshold is 1/2. (I discuss this in the PRX paper, which I get the impression you did not yet bother to read.)
Using a binary classifier and whatever link prediction features you also can get probabilities in the end.
Not without making an _arbitrary_ choice of normalization, which results in an _arbitrary_ number of edges being predicted.
For instance, and relating to the point of choosing that f ratio you mention, I counted some probabilities to get an estimate on the number of "actually missing" edges. I've confirmed that these probabilities are fairly well calibrated by removing edges and predicting, etc.
39 with probability above 0.80 143 with probability above 0.70 318 with probability above 0.60
Without saying how you normalized the scores to begin with, these values are meaningless.
You should also try to consider using a different framework, where you leave the missing edges "unobserved", instead of transforming them into nonedges. The distinction is important, and changes the reconstruction problem. You can achieve this with MeasuredBlockState by setting the corresponding n value to zero.
Do you mean setting n_default=0? Just tried that and it appears like almost all of my existing edges are missing from the marginal graph. Not sure what's going on there. I tried setting mu and nu to extreme values control this(?) but it doesn't appear to help much. With alpha=1, beta=1, mu=1, nu=9999: 20733 out of 20911 of my original edges are missing. What's happening here?
That is not what I meant. Setting n_default=0 means saying that every non-edge is non-observed. You should set n=0 only to the edges 'removed', and to a comparable number of actual non_edges.
Okay, so to summarize what's happening here: if alpha and beta are both 1 then p is uniform from [0,1], and it appears the model is favoring very low p for my graph. By setting alpha and beta I can control the distribution of p. By maintaining this alpha/(alpha+beta) ratio while increasing alpha and beta I can decrease the variance, which forces the p to be even higher (for the same mean), because it wants to go to the lower tail. That correct?
Right.
Is there any reason not to set something very aggressive like alpha=500, beta=500? I just tested this is as a feature in my binary classifier with 5000 sampling iterations and it performed much worse than the conditional prob setup (with only 100 sampling iterations). I expected this marginal prob setup to work comparably to the conditional prob setup. Thoughts?
Choosing alpha=beta=infinity means being 100% sure that the value of r (probability of missing edges) is 1/2. Does that value make sense to you? Did you remove exactly 1/2 of the edges? Given the true value of r, the optimal choice should be beta=[(1-r)/r]* alpha, and making alpha -> infinity (in practice just a large-enough value). Best, Tiago -- Tiago de Paula Peixoto <tiago@skewed.de>