This notebook explains p-value in a couple simple cases, and then shows the p-value of the distribution used in the direct AGASC-Gaia cross-match.

It also motivates the choice of p-value threshold used for the final selection.

In [1]:
import sys
sys.path.insert(0, '../')
In [2]:
from agasc_gaia import cross_match as xm
from agasc_gaia import datasets
import scipy.interpolate
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

Matching two star catalogs amounts to repeatedly testing whether, given one star from catalog A and one from catalog B, the stars are the same or not. I will take a frequentist approach to hypothesis testing, where the null hypothesis will be that the stars are the same, and the test statistic used is a function of the two stars' magnitude difference and angular separation (including the effects of proper motion). I'll assume that these two factor out:

$ p(s_A, s_B) = p_{mag}(m_A - m_B) p_{dist}(r_{A \rightarrow B}) $

This is what I call the match probability, and I usually denote it $p_{match}$.

The first step is to find the best matches from each catalog: for each star in catalog A, find the star with the best match probability in catalog B. The second step is to test whether the stars are actually a record of the same star. The vast majority of stars will be in both catalogs, and the best match will be correct, but catalog B can be incomplete, or some entries in either catalog could have errors, in which case the best match from catalog B could be incorrect.

To make the decision, one makes a cut in $p_{match}$. To decide the value of the cut, we will use the concept of p-value.

Given a test statistic, and one sample from the test statistic, the p-value of the sample is the probability of randomly getting a more extreme value, where "more extreme" depends on the test being made.

The 1d Gaussian Case¶

Let's consider a test statistic that follows a normal distribution with mean $m$ and standard deviation $\sigma$, and let's consider a two-sided test, in which "more extreme" values can be drawn from either the negative and positive tails of the distribution.

$ p_{m, \sigma}(s) = \frac{1}{\sqrt{2} \sigma}e^{-\frac{(s - m)^2}{2 \sigma^2}} $

Note that this probability density is normalized, in other words:

$ \int_{-\infty}^{\infty} p(s) ds = 1 $

and the probability itself is not necessarilly constrained to the [0,1] interval. For example, $p_{0, 0.1}(0) \simeq 3.99 $

If we draw one value $s_0$ from the distribution, we can then calculate its p-value:

$ p_{value} = \int_{p(s)>p(s_0)} p(s) ds = 1 - \frac{1}{2}\left(\text{erf}\left(\frac{\left| s_0 - m \right| }{\sigma \sqrt{2}}\right) - \text{erf}\left(-\frac{\left| s_0 - m \right|}{\sigma \sqrt{2}}\right)\right) $

In [3]:
# a quick verification of the above, using example values and a simple Riemann sum.
sigma = 0.1
s = np.linspace(-3, 3, 12001)
ds = np.diff(s)
s = s[:-1] + ds/2
p = 1/(np.sqrt(2*np.pi)*sigma) * np.exp(-s**2/(2*sigma**2))
integral = np.sum(p * ds)
print(f"{integral=}, max_p={np.max(p):.3f}")
integral=1.0, max_p=3.989

We can generate a sample following this distribution, to see how the p-values look like.

It should not surprise us that the p-values follow a uniform distribution in [0, 1]

In [4]:
values = np.random.choice(s, p=p * ds, size=10000)
p_values = 1 - 0.5*(scipy.special.erf(np.abs(values)/sigma/np.sqrt(2)) - scipy.special.erf(-np.abs(values)/sigma/np.sqrt(2)))

fig, axes = plt.subplots(1, 2, figsize=(12, 4))
plt.sca(axes[0])
_, bins, _ = plt.hist(
    values,
    bins=np.linspace(-3, 3, 601),
    histtype='step',
    density=True,
)
plt.plot(s, p)
plt.xlim((-3*sigma, 3*sigma))
plt.sca(axes[1])
plt.hist(
    p_values,
    histtype='step',
);
No description has been provided for this image

Now the question is how we do this numerically and in more dimensions.

Note that both $p_{value}$ and $p_0$ are functions of $s_0$. This means there is an implicit function mapping $p_0$ to $p_{value}$ ($p_{value} = f(p_0)$). There are some conditions that need to be met for this to be true, which amounts to requiring that functions should be single-valued. This will be fine in this case.

We want to find this function so we can apply it to a large sample in more dimensions.

To see how the function should look like in this case, we make a scatter plot:

In [5]:
prob = 1/(np.sqrt(2*np.pi)*sigma) * np.exp(-values**2/(2*sigma**2))

fig, axes = plt.subplots(1, 1, figsize=(6, 4))
plt.plot(prob, p_values, '.')
plt.xlabel('Probability')
plt.ylabel('p-value')
Out[5]:
Text(0, 0.5, 'p-value')
No description has been provided for this image
In [6]:
# this is one way to draw a single value and calculate its p_value using the same example
indices = np.arange(s.size)
idx = np.random.choice(indices, p=p * ds)
s_0 = s[idx]
p_0 = p[idx]
p_value = np.sum(p[p < p_0] * ds[p < p_0])
p_value_2 = 1- 0.5*(scipy.special.erf(-s_0/sigma/np.sqrt(2)) - scipy.special.erf(s_0/sigma/np.sqrt(2)))
print(f"value: {s_0}")
print(f"{p_value=} (numerical)")
print(f"p_value={p_value} (analytical)")
value: -0.04125000000000001
p_value=0.6781415866630691 (numerical)
p_value=0.6781415866630691 (analytical)

The 2d Gaussian Case¶

In the 2d case, we will follow three approaches.

The first one the same procedure, using a match probability that is Gaussian with two different $\sigma$:

  • calculate $p_{match}$ on a grid
  • generate a random sample distributed according to $p_{match}$
  • choose a 1d grid for the values of $p_0$ (which generally would be within (0,1) if $p_{match}$ is normalized)
  • calculate the fraction of points in the sample larger than $p_0$ for each $p_0$
  • interpolate this number as a function of $p_0$

The second one will be an improvement on the first one. The difficulty of generating the $p_{value}$ function this way is that it is expensive to populate the tails of the distribution. One way to deal with this is to oversample the tails using a statistical weight.

The third one is the analytical solution which results from integrating the 2d-gaussian, which is very easy.

Note how much faster the weighted function is, and how it matches the analytical solution all the way.

In [7]:
def generate_sample(probability, range_1, range_2, n_sample=10000000, n_grid=100):
    """
    Generate a sample of n_sample values from the given distribution.
    
    This function is not great. The accuracy depends on having a small grid size, given by the
    number of points `n_grid`.
    """
    # we make a grid with sampling probabilities according to the given function
    d_mag = np.linspace(range_1[0], range_1[1], n_grid+1)
    d2d = np.linspace(range_2[0], range_2[1], n_grid+1)
    d_d_mag, d_d2d = np.diff(d_mag), np.diff(d2d)
    d_mag = (d_mag[1:] + d_mag[:-1]) / 2
    d2d = (d2d[1:] + d2d[:-1]) / 2
    d_mag, d2d = np.meshgrid(d_mag, d2d)
    d_d_mag, d_d2d = np.meshgrid(d_d_mag, d_d2d )
    d_area = d_d_mag * d_d2d


    p_match = probability(d_mag, d2d)
    norm_p_match = p_match / np.sum(p_match * d_area) 

    idx = np.random.choice(
        np.arange(len(norm_p_match.flatten())),
        size=n_sample,
        p=(norm_p_match / np.sum(norm_p_match)).flatten()
    )
    p_match_sample = p_match.flatten()[idx]
    return p_match_sample

def get_p_value_function_mc(sigma_1, sigma_2, n_grid=10000, n_sample=10000000):
    # generate a sample according to that distribution
    # this distribution has all weights equal to 1
    # as a result, the tails are not very well sampled.
    p_match_sample = generate_sample(
        lambda x, y: gaussian(x, y, sigma_1=sigma_1, sigma_2=sigma_2),
        (-20 * sigma_1, 20 * sigma_1),
        (-20 * sigma_2, 20 * sigma_2),
        n_sample=n_sample,
        n_grid=n_grid
    )

    # calculate the CDF
    eps = np.finfo(p_match_sample.dtype).eps
    bins = np.logspace(np.log(p_match_sample[p_match_sample>0].min()*(1-eps)), np.log(p_match_sample.max()*(1+eps)), 1000000)
    vals, bins = np.histogram(
        p_match_sample,
        bins=bins,
    )
    n = np.cumsum(vals)/np.sum(vals)
    # and make an interpolator to get the CDF value given the match probability
    x = (bins[1:] + bins[:-1]) / 2
    return scipy.interpolate.interp1d(
        x,
        n,
        fill_value=(np.min(n), np.max(n)),
        bounds_error=False
    )
In [8]:
def get_p_value_function_mc_weighted(sigma_1, sigma_2, n_grid=10000, n_sample=10000000, p_sample_min=1e-3):
    # we make a grid with sampling probabilities according to the given function
    n = n_grid
    m = 30 * sigma_1
    r = 30 * sigma_2
    d_mag, d2d = np.linspace(-m, m, 3*n+1), np.linspace(-r, r, n+1)
    d_d_mag, d_d2d = np.diff(d_mag), np.diff(d2d)
    d_mag = (d_mag[1:] + d_mag[:-1]) / 2
    d2d = (d2d[1:] + d2d[:-1]) / 2
    d_mag, d2d = np.meshgrid(d_mag, d2d)
    d_d_mag, d_d2d = np.meshgrid(d_d_mag, d_d2d )
    d_area = d_d_mag * d_d2d


    p_match = gaussian(d_mag, d2d, sigma_1=sigma_1, sigma_2=sigma_2)
    norm_p_match = p_match / np.sum(p_match * d_area)

    # now generate a sample according to that distribution
    # up-sample the tails, so the probability never goes below the given value
    p_sample_min = p_sample_min * norm_p_match.max()

    sample_prob = np.where(norm_p_match > p_sample_min, norm_p_match, p_sample_min)
    sample_weights = (norm_p_match / sample_prob)
    sample_prob = sample_prob / np.sum(sample_prob)
    idx = np.random.choice(
        np.arange(len(sample_prob.flatten())),
        size=n_sample,
        p=(sample_prob).flatten()
    )
    p_match_sample = p_match.flatten()[idx]
    p_match_sample_weights = sample_weights.flatten()[idx]
    # calculate the CDF
    eps = np.finfo(p_match_sample.dtype).eps
    # bins = np.linspace(p_match_sample.min()*(1-eps), p_match_sample.max()*(1+eps), 1000000)
    bins = np.logspace(np.log(p_match_sample[p_match_sample>0].min()*(1-eps)), np.log(p_match_sample.max()*(1+eps)), 1000000)
    vals, bins = np.histogram(
        p_match_sample,
        bins=bins,
        weights=p_match_sample_weights,
    )
    n = np.cumsum(vals)/np.sum(vals)
    # and make an interpolator to get the CDF value given the match probability
    x = (bins[1:] + bins[:-1]) / 2
    return scipy.interpolate.interp1d(
        x,
        n,
        fill_value=(np.min(n), np.max(n)),
        bounds_error=False
    )
In [9]:
# the 2d gaussian case is easy because we can integrate it,
# and the p-value is the integral from r to infinity, where
# r**2 = (x/sigma_1)**2 + (y/sigma_2)**2
# so p_value = np.exp(-0.5 * ((x/sigma)**2 + (y/sigma)**2))
# which happens to be the same function
# In other words: the function p_value = f(p_match) is the identity in the 2d-gaussian case
def gaussian(x, y, sigma_1=1, sigma_2=2):
    return np.exp(-0.5 * ((x/sigma_1)**2 + (y/sigma_2)**2))
In [10]:
sigma = 1
In [11]:
f = get_p_value_function_mc(sigma, sigma, n_grid=10000, n_sample=100000)
f2 = get_p_value_function_mc(sigma, sigma, n_grid=10000, n_sample=1000000)
f3 = get_p_value_function_mc_weighted(sigma, sigma, n_grid=1000, n_sample=10000)

NOTE: an important detail is that this test case is trivial. The function $p_{value} = f(p_{match})$ is the identity! Which means we are trying to numerically reproduce the identity function.

In the following figure, we plot the p-value as a function of p-match and as a function of displacement from the origin along a given axis out to 10-sigma.

In [12]:
x = np.linspace(0, 10, 50)
y = x
p_0 = gaussian(x, y, sigma_1=sigma, sigma_2=sigma)


fig, axes = plt.subplots(1, 2, figsize=(12, 4))
plt.sca(axes[0])
plt.plot(p_0, p_0, '-', label='identity')
plt.plot(p_0, f3(p_0), '.', label='weight (100000)')
plt.plot(p_0, f2(p_0), '.', label='no weight, 1000000')
plt.plot(p_0, f(p_0), '.', label='no weight, 100000')

plt.xscale('log')
plt.yscale('log')
plt.title(r'p-value function ($p_{value} = f(p_{match})$)')
plt.xlabel(r'$p_{match}$')
plt.ylabel(r'$p_{value}$')
plt.legend()

plt.sca(axes[1])
plt.plot(x, p_0, label='analytical')
plt.plot(x, f3(p_0), '.', label='weight (100000)')
plt.plot(x, f2(p_0), '.', label='no weight, 1000000')
plt.plot(x, f(p_0), '.', label='no weight, 100000')
plt.yscale('log')

plt.title(r'p-value Vs displacement')
plt.xlabel(r'$\Delta x$')
plt.ylabel(r'$p_{value}$')

plt.legend()
# plt.ylim((1e-6, 2))
Out[12]:
<matplotlib.legend.Legend at 0x13609fd90>
No description has been provided for this image

Let's check the distribution of p-values in a sample drawn from the 2d-gaussian. It should not surprise us that the distribution is uniform.

In [13]:
sample = generate_sample(
    lambda x, y: gaussian(x, y, sigma_1=sigma, sigma_2=sigma),
    (-5 * sigma, 5 * sigma),
    (-5 * sigma, 5 * sigma),
    n_sample=1000000,
    n_grid=1000,
)

vals, bins, _ = plt.hist(
    sample,
    histtype='step',
)
print(np.mean(vals), np.std(vals) / np.sqrt(np.mean(vals)))
100000.0 0.7367808357985433
No description has been provided for this image

The Actual Function Used¶

The actual probability distribution use as $p_{match}$ is not Gaussian. You will see that as a result, the function $p_{value} = f(p_{match})$ is not the identity, although it is still a monotonically increasing function from 0 to 1.

Here we have a function that calculates p_value, and we compare that with the function in version control.

In [14]:
def get_p_value_function_mc():
    print("Getting p-value function")
    print("- Defining sampling space")
    n_grid = 3000  # this drives the time
    max_d_mag = 20
    d_mag = np.linspace(-max_d_mag, max_d_mag, 2 * n_grid + 1)
    # d2d = np.logspace(-4, 1, n_grid+1)  # this should not affect the result, but it does
    d2d = np.linspace(0, 10, n_grid + 1)
    d_d_mag, d_d2d = np.diff(d_mag), np.diff(d2d)
    d_mag = (d_mag[1:] + d_mag[:-1]) / 2
    d2d = (d2d[1:] + d2d[:-1]) / 2
    d_mag, d2d = np.meshgrid(d_mag, d2d)
    d_d_mag, d_d2d = np.meshgrid(d_d_mag, d_d2d)
    d_area = d_d_mag * d_d2d * d2d  # d2d is the 2d radius, so the area is approx r dr)

    print("- Calculating match probability")
    p_match = xm.agasc_gaia_match_probability(d_mag, d2d)
    norm_p_match = p_match / np.sum(p_match * d_area)

    print("- Generating Random sample")
    # now generate a sample according to that distribution
    # up-sample the tails, so the probability never goes below the given value
    p_sample_min = 1e-3
    p_sample_min = p_sample_min * norm_p_match.max()

    sample_prob = np.where(norm_p_match > p_sample_min, norm_p_match, p_sample_min)
    sample_weights = norm_p_match / sample_prob
    sample_prob = sample_prob / np.sum(sample_prob)  # need to normalize for np.random

    # in general, the number of samples determines how much we explore the tails and how close we
    # get to zero, but thanks to up-sampling, the tails will be sampled, so we do not need millions
    # of samples
    idx = np.random.choice(
        np.arange(len(norm_p_match.flatten())), size=100000, p=(sample_prob).flatten()
    )
    # d_mag_sample = d_mag.flatten()[idx]
    # d2d_sample = d2d.flatten()[idx]
    p_match_sample = p_match.flatten()[idx]
    p_match_sample_weights = sample_weights.flatten()[idx]
    eps = np.finfo(p_match_sample.dtype).eps
    print("- Calculating CDF")
    bins = np.logspace(
        np.log(p_match_sample[p_match_sample > 0].min() * (1 - eps)),
        np.log(p_match_sample.max() * (1 + eps)),
        1000000,  # the number of bins directly affects the lowest p_value we will get
    )
    vals, bins = np.histogram(p_match_sample, bins=bins, weights=p_match_sample_weights)
    x = (bins[1:] + bins[:-1]) / 2
    n = np.cumsum(vals) / np.sum(vals)
    return scipy.interpolate.interp1d(
        x, n, fill_value=(np.min(n), np.max(n)), bounds_error=False
    )
In [15]:
get_p_value_sample = get_p_value_function_mc()
Getting p-value function
- Defining sampling space
- Calculating match probability
- Generating Random sample
- Calculating CDF
In [16]:
get_p_value_1p8 = xm.get_p_value_function()
In [17]:
# Note that this probability can be larger than one, because this a probability density function
p_max = xm.agasc_gaia_match_probability(0.,0.)
print(p_max, get_p_value_sample(p_max))
3.2217620507337608 0.9994103897131031
In [18]:
def plot_p_value_contours():
    d_mag = np.linspace(-8, 8, 1001)
    d_angle = np.linspace(0, 4, 1001)

    d_mag, d_angle = np.meshgrid(d_mag, d_angle)
    p_match = xm.agasc_gaia_match_probability(d_mag, d_angle)
    p_value = get_p_value_1p8(p_match)

    contour = plt.contour(
        d_mag,
        d_angle,
        np.log(p_value),
        levels=np.log(np.array([0.00001, 0.0001, 0.001, 0.01, 0.02, 0.1, 0.99])),
        cmap='winter',
        vmax=3
    )

    def fmt(x):
        x = 100*np.exp(x)
        if x < 0.01:
            return f"{x:.3f}%"
        if x < 0.1:
            return f"{x:.2f}%"
        if x < 1:
            return f"{x:.1f}%"
        return f"{x:.0f}%"

    plt.clabel(contour, inline=1, fontsize=8, fmt=fmt)
    plt.xlabel("$\Delta$Mag")
    plt.ylabel("radial offset (arcsec)")
    plt.title("p-value contours")

def plot_p_value_v_mag(delta_angle=0):
    delta_mag = np.linspace(0, 8, 1000)
    p_match = xm.agasc_gaia_match_probability(delta_mag, delta_angle)
    plt.plot(delta_mag, get_p_value_sample(p_match))
    plt.yscale("log")
    plt.xlabel('Magnitude difference')
    plt.ylabel('p-value')
    plt.title('p-value Vs magnitude difference')


def plot_p_value_v_angle(delta_mag=0):
    delta_angle = np.linspace(0, 4, 1000)
    p_match = xm.agasc_gaia_match_probability(delta_mag, delta_angle)
    plt.plot(delta_angle, get_p_value_sample(p_match))
    plt.yscale("log")
    plt.xlabel('Angular separation (arcsec)')
    plt.ylabel('p-value')
    plt.title('p-value Vs angular separation')
In [19]:
fix, axes = plt.subplot_mosaic(
    [
        ["function", "2d"],
        ["angle", "mag"]
    ],
    figsize=(12, 8)
)
plt.sca(axes["function"])
x = np.linspace(0, p_max, 100)
plt.plot(x, get_p_value_sample(x), label="this notebook")
plt.plot(x, get_p_value_1p8(x), '--', label="1.8")
plt.xlabel('Match probability')
plt.ylabel('p-value')
plt.title('p-value function')
plt.legend()

plt.sca(axes["angle"])
plot_p_value_v_angle()

plt.sca(axes["mag"])
plot_p_value_v_mag()

plt.sca(axes["2d"])
plot_p_value_contours()
plt.tight_layout()
No description has been provided for this image

Again, let's look at the distribution of p-values (and again we should not be surprised that it is flat):

In [20]:
p_match = generate_sample(
    xm.agasc_gaia_match_probability,
    (-8, 8),
    (-3, 3),
    n_sample=10000,
    n_grid=5000,
)


p_value = get_p_value_sample(p_match)

vals, bins, _ = plt.hist(
    p_value,
    histtype='step',
)
print(np.mean(vals), np.std(vals) / np.sqrt(np.mean(vals)))
1000.0 0.8749857141690944
No description has been provided for this image

p-values in Data¶

The match probability¶

This is the match probability, and the plot showing where it came from, although it is not exactly right. The assumption is that the probability is the product of the radial and magnitude probabilities:

$p_{match}(\Delta mag, \Delta r) = p_{mag}(\Delta mag) p_{r}(\Delta r)$

The magnitude distribution used comes from the agasc-mag-models notebook, and the radial one was developed as follows, adjusting the $\Delta r$ distribution using the indirect method.

The result is not too far off, and will cause no issues, but strictly speaking it is not correct. A section will be added at the end of this notebook to show that.

In [21]:
# agasc_direct_2 = xm.get_agasc_gaia_x_match()
agasc_direct = xm.get_agasc_gaia_x_match_difficult_fixed()
agasc_direct = agasc_direct[agasc_direct["best_match"]]
In [22]:
agasc_summary = datasets.get_agasc_summary()
In [23]:
i = np.searchsorted(agasc_summary["agasc_id"], agasc_direct["agasc_id"])
cols = [
    "class"
]
for col in cols:
    agasc_direct[col] = agasc_summary[col][i]
In [24]:
agasc_direct = agasc_direct[np.in1d(agasc_direct["class"], [0, 2, 6])]
In [25]:
# agasc_direct = agasc_direct[(~agasc_direct['d2d'].mask) & (~agasc_direct['d_mag'].mask)]
agasc_direct['p_match'] = xm.agasc_gaia_match_probability(agasc_direct['d_mag'], agasc_direct['d2d'])
agasc_direct['p_value'] = get_p_value_1p8(agasc_direct['p_match'])
agasc_direct['p_value_notebook'] = get_p_value_sample(agasc_direct['p_match'])

agasc_direct = agasc_direct[['d2d', 'd_mag', 'p_match', 'p_value', 'p_value_notebook']]
In [26]:
agasc_indirect = xm.get_agasc_tycho_gsc_gaia_x_match()

i = np.searchsorted(agasc_summary["agasc_id"], agasc_indirect["agasc_id"])
cols = [
    "class"
]
for col in cols:
    agasc_indirect[col] = agasc_summary[col][i]

agasc_indirect = agasc_indirect[np.in1d(agasc_indirect["class"], [0, 2, 6])]

# agasc_indirect = agasc_indirect[(~agasc_indirect['d2d'].mask) & (~agasc_indirect['d_mag'].mask)]
agasc_indirect['p_match'] = xm.agasc_gaia_match_probability(agasc_indirect['d_mag'], agasc_indirect['d2d'])
agasc_indirect['p_value'] = get_p_value_1p8(agasc_indirect['p_match'])
agasc_indirect['p_value_notebook'] = get_p_value_sample(agasc_indirect['p_match'])

agasc_indirect = agasc_indirect[['d2d', 'd_mag', 'p_match', 'p_value', 'p_value_notebook']]
In [27]:
bins = np.linspace(0, 5, 51)
# dx = np.diff(bins)
dx = 2 * np.pi * (bins[1:]**2 - bins[:-1]**2)
x = bins[:-1] + dx/2

plt.hist(
    agasc_direct['d2d'],
    bins=bins,
    density=True,
    histtype='step',
    label='direct'
)

plt.hist(
    agasc_indirect['d2d'],
    bins=bins,
    density=True,
    histtype='step',
    label='indirect'
)
x2 = np.linspace(0, 4, 1001)
dx2 = np.diff(x2)
x2 = x2[:-1] + dx2 / 2
p_d2d = xm.d2d_probability(x2)
p_d2d /= np.sum(dx2 * p_d2d)
plt.plot(x2, p_d2d)
plt.legend()
plt.yscale('log')
# plt.xlim((0, 1))
No description has been provided for this image
In [28]:
bins = np.linspace(0, 5, 51)
# dx = np.diff(bins)
dx = 2 * np.pi * (bins[1:]**2 - bins[:-1]**2)
x = bins[:-1] + dx/2

p = [0.89020772, 0.09891197, 0.00791296, 0.00296736]

plt.hist(
    agasc_indirect['d2d'],
    bins=bins,
    density=True,
    histtype='step',
    label='AGASC-Tycho2-GSC2.3-Gaia'
)
x2 = np.linspace(0, 4, 1001)
dx2 = np.diff(x2)
x2 = x2[:-1] + dx2 / 2
p_d2d = xm.d2d_probability(x2)
p_d2d /= np.sum(dx2 * p_d2d)
plt.plot(x2, p_d2d, color='r')
y = p[0] * xm.gaussian_d2d_probability_(x2, sigma_d2d=0.25)
plt.plot(x2[y>1e-5], y[y>1e-5], ":", color="r")
y = p[1] * xm.gaussian_d2d_probability_(x2, sigma_d2d=0.5)
plt.plot(x2[y>1e-5], y[y>1e-5], ":", color="r")
y = p[2] * xm.gaussian_d2d_probability_(x2, sigma_d2d=0.8)
plt.plot(x2[y>1e-5], y[y>1e-5], ":", color="r")
y = p[3] * xm.gaussian_d2d_probability_(x2, sigma_d2d=1.2)
plt.plot(x2[y>1e-5], y[y>1e-5], ":", color="r")
plt.legend()
plt.yscale('log')
plt.title("AGASC-Gaia Angular Separation")
plt.xlabel("Angular Separation (arcsec)")
# plt.xlim((0, 1))
Out[28]:
Text(0.5, 0, 'Angular Separation (arcsec)')
No description has been provided for this image

The p-value distribution¶

In [29]:
plt.hist(
    agasc_direct['p_value'],
    bins=np.linspace(0, 1, 101),
    density=False,
    histtype='step',
    label='direct (AGASC-Gaia)'
)
plt.hist(
    agasc_indirect['p_value'],
    bins=np.linspace(0, 1, 101),
    density=False,
    histtype='step',
    label='indirect (AGASC-Tycho2-GSC2.3-Gaia)'
)
plt.legend()
# plt.axvline(0.006, color='r')
plt.xlabel('p-value')
plt.title("p-value distribution")
# plt.yscale('log')
Out[29]:
Text(0.5, 1.0, 'p-value distribution')
No description has been provided for this image

Zooming in on the small-p-value range, near the cutoff:

In [30]:
plt.hist(
    agasc_direct['p_value'],
    bins=np.linspace(0, 0.05, 51),
    density=False,
    histtype='step',
    label='direct (AGASC-Gaia)'
)
plt.hist(
    agasc_indirect['p_value'],
    bins=np.linspace(0, 0.05, 51),
    density=False,
    histtype='step',
    label='indirect (AGASC-Tycho2-GSC2.3-Gaia)'
)
plt.legend()
# plt.axvline(0.02, color='r')
plt.axvline(0.006, linestyle='--', color='r')
# plt.axvline(0.002, linestyle=':', color='r')
plt.xlabel('p-value')
plt.title("p-value distribution")
plt.yscale('log')
No description has been provided for this image
In [31]:
print(f"""
len(agasc_direct),
p_value < 0.02:  {np.count_nonzero(agasc_direct['p_value'] < 0.02)/len(agasc_direct)*100:.2}%
p_value < 0.006: {np.count_nonzero(agasc_direct['p_value'] < 0.006)/len(agasc_direct)*100:.2}%
p_value < 0.002: {np.count_nonzero(agasc_direct['p_value'] < 0.002)/len(agasc_direct)*100:.2}%
""")
len(agasc_direct),
p_value < 0.02:  2.0%
p_value < 0.006: 1.0%
p_value < 0.002: 0.59%

Here is a histogram of all matches, binned by $\Delta mag$ and $d_{2d}$.

One can clearly see that the assumed distribution is not the same as the actual one. Most notably, there are two extra populations:

  • one extending to larger angular separation, up to 4 arcsec, with a magnitude difference of ~1 mag.
  • Another with very small angular separation, and a magnitude difference > 2.

The important questions are:

  • do the stars in these populations are the true matches (Was the original catalog that far off?)
  • if we update AGASC according to the matches in these populations, will it improve performance in some way?

Upon inspection, the first population is mostly made up of pairs of Gaia stars with a corresponding AGASC star placed right in between. In this case, the Gaia stars tend to be fainter than the AGASC star. Updating the magnitude an position of these stars would essentially ignore the second star in the pair, and arguably make the catalog worse. The best action to take would be to replace the one AGASC star with two new ones, but that is beyond the scope of this update.

The second population could be stars that changed magnitude since last observed. If that is the case, it would be worthwhile to update these.

In [32]:
#
sns.histplot(
    # x=agasc_direct['d2d'][agasc_direct["p_value"] > 0.02],
    # y=agasc_direct['d_mag'][agasc_direct["p_value"] > 0.02],
    x=agasc_direct['d2d'],
    y=agasc_direct['d_mag'],
    bins=[
        np.linspace(0, 5, 1000),
        np.linspace(-6, 6, 1000)
    ]
)

d_mag = np.linspace(-6, 6, 101)
d_angle = np.linspace(0, 3.5, 101)

d_mag, d_angle = np.meshgrid(d_mag, d_angle)
p_match = xm.agasc_gaia_match_probability(d_mag, d_angle)
p_value = get_p_value_1p8(p_match)

contour = plt.contour(
    d_angle,
    d_mag,
    p_value,
    # levels=np.array([0.002, 0.006, 0.02]),
    levels=np.array([0.006]),
    cmap='winter',
    vmax=3
)

plt.clabel(contour, inline=1)
plt.ylabel("Magnitude difference")
plt.xlabel("Angular separation (arcsec)")
Out[32]:
Text(0.5, 0, 'Angular separation (arcsec)')
No description has been provided for this image
In [33]:
d2d_probability = xm.d2d_probability(agasc_direct['d2d'])

plt.hist(
    d2d_probability,
    bins=np.linspace(0, 1, 101),
    density=False,
    histtype='step',
    label='direct'
)

plt.legend()
plt.axvline(0.02, color='r')
# plt.yscale('log')
Out[33]:
<matplotlib.lines.Line2D at 0x136c4b850>
No description has been provided for this image
In [34]:
#
sns.histplot(
    # x=agasc_direct['d2d'][agasc_direct["p_value"] > 0.02],
    # y=agasc_direct['d_mag'][agasc_direct["p_value"] > 0.02],
    x=agasc_direct['d2d'],
    y=agasc_direct['d_mag'],
    bins=[
        np.linspace(0, 5, 1000),
        np.linspace(-6, 6, 1000)
    ]
)

d_mag = np.linspace(-6, 6, 101)
d_angle = np.linspace(0, 3.5, 101)

d_mag, d_angle = np.meshgrid(d_mag, d_angle)
p_match = xm.agasc_gaia_match_probability(0, d_angle)
p_value = get_p_value_1p8(p_match)

contour = plt.contour(
    d_angle,
    d_mag,
    p_match,
    levels=np.array([0.002, 0.006, 0.02]),
    cmap='winter',
    vmax=3
)

plt.clabel(contour, inline=1)
Out[34]:
<a list of 3 text.Text objects>
No description has been provided for this image

Looking at "Difficult" Stars¶

Difficult stars are stars that had the same Gaia match.

In [35]:
# agasc_difficult = utils.Cache.load('data/agasc-gaia-x-match-difficult.h5')
agasc_difficult = xm.get_agasc_gaia_x_match_difficult()
In [36]:
plt.hist(
    agasc_direct['p_value'],
    bins=np.linspace(0, 1, 101),
    # density=True,
    histtype='step',
    label='all'
)
plt.hist(
    agasc_difficult['p_value'][agasc_difficult['best_match']],
    bins=np.linspace(0, 1, 101),
    # density=True,
    histtype='step',
    label='difficult'
)
plt.legend()
# plt.axvline(0.02, color='r')
plt.yscale('log')
No description has been provided for this image
In [37]:
# some stars have NaN p-value but are nonetheless best matches.
# This is because d_mag is NaN. Might be good to look into it.
agasc_difficult[np.isnan(agasc_difficult['p_value']) & agasc_difficult['best_match']][["agasc_id", "gaia_id", "group", "d2d", "d_mag"]]
Out[37]:
Table length=29
agasc_idgaia_idgroupd2dd_mag
int64int64int64float16float16
86509569372725894833080325161.329nan
12006259311799376434683947526670.1276nan
17381485733753579024952020489150.2021nan
190317945394211917553485350410090.259nan
193987737123744203288152012810181.021nan
226755120284924509329810931211854.734nan
23083809710559998724545113612093.352nan
25834317773957756289641062413330.2234nan
263982249128376383928393638413530.182nan
30172880931814496510235737615070.1121nan
...............
807803705628117361636818547237610.8857nan
853674521289798455397198873639700.00406nan
881722433617624499263894374440900.11334nan
928386985288423450823110361643220.363nan
1008360913543317678957115916847580.3652nan
1041766425669368547259197017650100.1804nan
1053300329492950813942827046450360.1406nan
1171147665529649752562330521659490.05103nan
1196039553649109890898671116862920.0652nan
1196432633639381666278074444862940.146nan

Better p-match¶

The matching probability shown above is not very good. It works because it is not completely off, but I would like to have a better function, if only for the record.

The issue is in the radial part, which is a function of d2d, the angular separation between an AGASC star and its Gaia counterpart, which I will call r in this section. When one makes a histogram of r, the number of counts is

$p(r, \theta) r dr d\theta$

We want an estimate of $p(r, \theta)$, and we want to make sure this function is monotonically decreasing.

On option is to take a fit to the d2d distribution from the indirect method (the histogram above). What follows estimates just that. It is currently not used, and I believe it will not improve much, but here it is.

In [38]:
# the contents of this section are implemented in the following function of the cross_match module
# This is a smooth interpolation of the d2d distribution in the direct AGASC-Gaia matches:
_, direct_d2d = xm.smooth_d2d_probability(agasc_direct['d2d'])

# and this is the same for the indirect matches
indirect_d2d = xm.d2d_probability_interpolated
In [39]:
# If one uses `density=True` in the call to `plt.hist`, one ends up plotting `r * p(dr, theta)`
# whereas `xm.d2d_probability` is `p(dr, theta)`. The biggest difference is around zero.
# everything looked ok above because of a convenient choice of binning.

bins = np.linspace(0, 5, 801)
# dx = 2 * np.pi * (bins[1:]**2 - bins[:-1]**2)  # dx in cylindrical coordinates integrated over angle
# x = bins[:-1] + dx/2

plt.hist(
    agasc_indirect['d2d'],
    bins=bins,
    density=True,
    histtype='step',
)
x2 = np.linspace(0, 4, 1001)
dx2 = np.diff(x2)
x2 = x2[:-1] + dx2 / 2
p_d2d = xm.d2d_probability(x2)
p_d2d /= np.sum(dx2 * p_d2d)
plt.plot(x2, p_d2d, label="Current implementation")

v = x2 * xm.d2d_probability_interpolated(x2)
v = v / np.sum(v * dx2)
plt.plot(x2, v, label="Smooth interpolation")

plt.xlabel("r (arcsec)")
plt.ylabel("r p(r)")

plt.legend()
plt.xlim((0, 1))
# plt.yscale('log')
# plt.ylim((0.01, 5))

A simple idea is to just estimate the probability density function $p(d2d)$ using a KDE algorithm. However, the KDE algorithm displays biases close to zero arising from the asymmetry of the observed distribution:

In [ ]:
bins = np.logspace(-5, 1, 501)
vals, _ = np.histogram(
    agasc_indirect['d2d'],
    bins=bins,
)
vals = vals / np.diff(bins) / len(agasc_indirect)

plt.step(
    bins,
    np.concatenate([[0], vals]),
    label='indirect'
)

kde = scipy.stats.gaussian_kde(agasc_indirect['d2d'])

b = np.logspace(-4, 0, 40)
plt.plot(b, kde(b), label='kde')

plt.xlabel("r (arcsec)")
plt.ylabel("r p(r)")

plt.xscale('log')
No description has been provided for this image

One can estimate the probability density function as a function of log(d2d) and transform back into d2d-space:

In [ ]:
bins = np.logspace(-5, 1, 501)
vals, _ = np.histogram(
    agasc_indirect['d2d'],
    bins=bins,
)
vals = vals / np.diff(bins) / len(agasc_indirect)

plt.step(
    bins,
    np.concatenate([[0], vals]),
)

kde2 = scipy.stats.gaussian_kde(np.log(agasc_indirect['d2d']))
kde = scipy.stats.gaussian_kde(agasc_indirect['d2d'])
b = bins[:]

y2 = kde2(np.log(b)) / (b)
y = kde(b)

plt.plot(b, y, label='kde')
plt.plot(b, y2, label='kde (log scale)')

plt.xlabel("r (arcsec)")
plt.ylabel("r p(r)")

plt.xscale('log')
plt.legend()
Out[ ]:
<matplotlib.legend.Legend at 0x1419a3250>
No description has been provided for this image

The following is a simple fit to the d2d probability distribution funtion, using the KDE method above, and then fitting a function to smooth out fluctuations and ensure the $p(d2d)$ function is monotonically decreasing.

In [ ]:
def multi_sigmoid(x, grid, scales, widths):
    grid = np.asanyarray(grid)
    scales = np.asanyarray(scales)
    widths = np.asanyarray(widths)
    x, grid, scales, widths = np.broadcast_arrays(x[None,:], grid[:,None], scales[:,None], widths[:,None])
    return np.sum(
        scales / (1 + np.exp((x - grid) / widths)),
        axis=0
    )


def fit_multi_sigmoid(grid, x, y):
    grid = np.asarray(grid, dtype=np.float64)
    widths = np.ones(len(grid)) * np.diff(grid)[0] / 2
    scales = np.ones(len(grid), dtype=np.float64) * np.mean(y)

    def func(x, *params):
        scales = params[:]
        return multi_sigmoid(x, grid, scales, widths)

    bounds = [(0, np.max(y))] * len(grid)
    bounds = list(zip(*bounds))
    params, covariance = scipy.optimize.curve_fit(
        func,
        x,
        y,
        p0=np.concatenate([scales]),
        bounds=bounds,
    )
    scales = params[:]
    return grid, scales, widths
In [ ]:
kde2 = scipy.stats.gaussian_kde(np.log(agasc_indirect['d2d']))

x = np.logspace(-4, 1, 251)
# there are two factors of x in the denominator
# - one from the jacobian of the transformation (d log(r)/dr = 1/r)
# - one from the volume element, because the density from the KDE is `p(r) r` but we want p(r)
y = kde2(np.log(x)) / x**2

log_y = np.log(y)
offset = np.min(log_y)
grid, scales, widths = fit_multi_sigmoid(np.log(np.logspace(-3, 1, 20)), np.log(x), log_y - offset)

y2 = np.exp(multi_sigmoid(np.log(x), grid, scales, widths) + offset)

kde = scipy.interpolate.interp1d(
    x,
    y,
    fill_value=(y[0], 0),
    bounds_error=False
)
prob = scipy.interpolate.interp1d(
    x,
    y2,
    fill_value=(y[0], 0),
    bounds_error=False
)

This is the PDF $p(r)$. Remember that the histogram above is actually $p(r) r dr$

In [ ]:
plt.plot(kde.x, kde.y)
# plt.plot(x, y2)
plt.plot(prob.x, prob.y)
plt.xlabel("r (arcsec)")
plt.ylabel("p(r)")
plt.xscale('log')
# plt.yscale("log")
No description has been provided for this image

and this is how the function compares to the d2d histogram:

In [ ]:
bins = np.logspace(-5, 1, 501)
vals, _ = np.histogram(
    agasc_indirect['d2d'],
    bins=bins,
)

vals = vals / np.diff(bins) / len(agasc_indirect)

plt.step(
    bins,
    np.concatenate([[0], vals]),
    label='indirect'
)

plt.plot(bins, bins * prob(bins), label='fit')
plt.yscale('log')
plt.xscale('log')
plt.ylim((1e-7, 1e1))
plt.xlabel("r (arcsec)")
plt.ylabel("r p(r)")
Out[ ]:
Text(0, 0.5, 'r p(r)')
No description has been provided for this image

All the sigmoids that make the fit

In [ ]:
plt.plot(x, log_y - offset)

yy = multi_sigmoid(np.log(x), grid, scales, widths)
plt.plot(x, yy)

for i in range(len(grid)):
    yy = multi_sigmoid(np.log(x), grid[i:i+1], scales[i:i+1], widths[i:i+1])
    plt.plot(x, yy, label=f"{grid[i]:.2f}")

plt.xscale('log')
# plt.yscale("log")
plt.ylim((1e-6, 1e2))
plt.xlabel("r (arcsec)")
plt.ylabel("multi-sigmoid(r)")
Out[ ]:
Text(0, 0.5, 'multi-sigmoid(r)')
No description has been provided for this image

an example of a sigmoid:

In [ ]:
def sigmoid(x, width, grid):
    return 1 / (1 + np.exp((x - grid) / width))

i = 0
width = widths[i]
grid = grid[i]
scale = scales[i]
plt.plot(np.log(x), scale * sigmoid(np.log(x), width, grid))

plt.axvline(grid, linestyle="--", linewidth=0.5, color='b')
plt.axhline(0.5 * scale, linestyle="--", linewidth=0.5, color='b')

plt.axvline(grid + width, linestyle=":", linewidth=1, color='orange')
plt.axhline(0.2689414213699951 * scale, linestyle=":", linewidth=1, color='orange')
plt.axvline(grid - width, linestyle=":", linewidth=1, color='orange')
plt.axhline((1 - 0.2689414213699951) * scale, linestyle=":", linewidth=1, color='orange')
plt.axhline(scale, linewidth=0.5, color='k')
plt.axhline(0, linewidth=0.5, color='k')
plt.title("Sigmoid")
Out[ ]:
Text(0.5, 1.0, 'Sigmoid')
No description has been provided for this image

The function developed in this section, as implemented in the cross_match module, applied to both the direct and indirect matches.

In [ ]:
bins = np.linspace(0, 5, 801)
dx = 2 * np.pi * (bins[1:]**2 - bins[:-1]**2)
x = bins[:-1] + dx/2

vals, _, _ = plt.hist(
    agasc_direct['d2d'],
    bins=bins,
    density=True,
    histtype='step',
    label='direct',
    color="tab:blue",
    linewidth=1,
)

plt.hist(
    agasc_indirect['d2d'],
    bins=bins,
    density=True,
    histtype='step',
    label='indirect',
    color="tab:orange",
    linewidth=1,
)
x2 = np.linspace(0, 4, 1001)
x2 = x2[:-1] + np.diff(x2) / 2

v = x2 * indirect_d2d(x2)
v = v / np.sum(v * dx2)
plt.plot(
    x2, v,
    color="tab:orange",
    linewidth=1.5,
)

v = x2 * direct_d2d(x2)
v = v / np.sum(v * dx2)
plt.plot(
    x2, v,
    color="tab:blue",
    linewidth=1.5,
)

plt.xlabel("r (arcsec)")
plt.ylabel("r p(r)")
plt.legend()
plt.xlim((0, 1))
# plt.yscale('log')
# plt.ylim((0.01, 5))
Out[ ]:
(0.0, 1.0)
No description has been provided for this image