Implementing your own summary statistic¶

In the development of estats we tried to make an effort to keep the module user-friendly, while also keeping it as general as possible. To do so, we decided to introduce map based statistics as plugins. Users can define their own map based statistics by writing such plugins and the statistics become automatically available in the code. This avoids that one has to modify the source code every time a new statistic is introduced.

To add a new statistic to estats one needs to create a new file in the estats.stats folder. The name of file must be <statistic>.py where <statistic> is the user-defined name of the new map based statistic (for example CLs).

Note that there are some naming conventions for the statistics. If the name of the statistic starts with ‘Cross’ it is assumed that this is a cross-tomographic statistic. For shear statistics it will be passed 2x2 shear maps. For convergence statistics it will be passed 2x1 convergence maps. In the case of a multiscale statistic it is passed a convolved convergence map. On the other hand, if the statistic starts with ‘Full’ the statistic is passed a dictionary containing the shear or convergence maps for all the tomographic bins (this is meant for ML purposes for example).

The plugin may contain any function but estats is searching for some specific functions only. We describe these functions in the following.

The context function must always be defined in each statistic plugin and defines the type of the statistic (shear, convergence or multi) as well as the accepted configuration parameters and their default values.

There must be a function with the same name as the statistics module itself. This function calulates the summary statistic based on the passed maps.

The process function is applied in the estats.summary module to perform a first preprocessing of the raw data vectors.

The decide_binning_scheme function is used by the estats.summary module to decide, based on all the data, how to downbin the data vectors from their original length into wider bins in the postprocessing.

The binnings scheme can then be applied using the slice function. Again this can be decided by the user.

Lastly the filter function can be used to apply custom cuts to the postprocessed data vectors. This is performed in the estats.likelihood module.

We give an example below which illustrates how these functions could look like and what they are used for.

# Copyright (C) 2019 ETH Zurich
# Institute for Particle Physics and Astrophysics
# Author: Dominik Zuercher

import numpy as np
import healpy as hp


def context():
    """
    The context function always needs to be defined.
    estats uses a dictionary to pass parameters between modules and functions.
    In order to make a parameter available on initialization of an instance of
    map, summary or likelihood it needs to be defined here. The parameter par1
    can be passed to the map module using for example:

        from estats import map
        map_instance = map(par1=10)

    Each parameter has an entry in required, defaults and types (the order is
    relevant). In this example two parameters are made available par1, var and
    num_final_bins.
    par1 and num_of_final_bins need to be an integers and by default are set
    to 5 and 40.
    var needs to be a list and by default is set to [0, 1, ..., 99].

    Additionally, the stat_type must be defined. It defines what kind of maps
    are passed to the stat_example function below. There are three different
    types:
        - shear: stat_example receives two shear maps (first and second
                 component).
        - convergence: stat_example receives one convergence map.
        - multi: stat_example receives one smoothed convergence map at a time.
                 Additionally, estats.map.calc_summary_stats will extract the
                 statistics for a range of scales defined through the scale
                 parameter of the map module (defines the FWHM of the Gaussian
                 smoothing kernels that are applied).
    """
    stat_type = 'multi'

    required = ['par1', 'var', 'num_final_bins']
    defaults = [5, [ii for ii in range(100)], 40]
    types = ['int', 'list', 'int']

    return required, defaults, types, stat_type


def stat_example(map, weights, ctx):
    """
    This function is required by estats.map.calc_summary_stats to calculate the
    summary statistic.
    Depending on the stat type the function receives different kinds of maps.
    In addition to that the function also receives a weight map (indicating
    the number of galaxies in each pixel) and a context
    instance (dictionary containing the parameters).

    The user is free to do whatever with the maps. In the end the function
    just needs to return a numpy array.

    Since in this example the stat_type is multi, this function receives a
    smoothed version of a convergence map.

    Note that if copy_obj=False is passed to map.calc_summary_stats modifying
    the maps and weights will modify the objects in the estats.map instance!

    :param map: A smoothed Healpix convergence map
    :param weights: A Healpix map with pixel weights
    :param ctx: Context instance
    :return: numpy array
    """

    # calculating a (weighted) average and standard deviation of the map
    # for example
    w_map = get_weighted_map(map, weights)
    w_mean = np.mean(w_map)
    w_std = np.std(w_map)
    mean = np.mean(map)
    std = np.std(map)

    # using the parameters
    mean *= ctx['var']  # creates 100 entries
    mean += ctx['par1']
    w_mean *= ctx['var']
    w_mean += ctx['par1']

    w_std *= ctx['var']
    std *= ctx['var']

    # The output has the shape (2, 200)
    return np.asarray([[mean, w_mean], [std, w_std]])


def get_weighted_map(map, weights):
    """
    Can also define additional functions that are not required by estats and
    use them.
    """
    w_map = map * weights / np.sum(weights)
    return w_map


def process(data, ctx, scale_to_unity=False):
    """
    This function is required if the summary module should be used.
    The function is used on a set of realizations of data-vectors that are
    read-in using summary. It is used in readin_stat_data and readin_stat_files
    to do a first postprocessing of the data.
    In this case the first argument would be of the kind:
        data = [[means realization1],
                [stds realization1],
                [means realization2],
                [stds realization2],
                ...]

    The second argument is again a context instance and the third argument
    can be used to trigger rescaling of the data for example.

    In this example we use the process function to combine the means and stds
    onto one line resulting in data-vectors of a length of 400.
    """
    new_data = np.zeros(
        (int(data.shape[0] // 2), data.shape[1] * 2))

    for jj in range(int(data.shape[0] // 2)):
        new_data[jj, :] = data[jj * 2:(jj + 1) * 2, :].ravel()
    return new_data


def decide_binning_scheme(data, meta, bin, ctx):
    """
    This function is required if the summary module should be used.
    Defines the binning scheme that is used to downbin the data vector into
    larger data bins. Here we just fix the binning scheme but one could use all
    the realizations that are stored for the statistic for the tomographic bin
    (bin) to decide on it. In addition the meta data table is
    passed to the function (meta) in case one wishes to use that as well to
    decide on the binning scheme. The context is also available via ctx.

    The function is required to return two arrays, bin_edge_indices and
    bin_centers. bin_edge_indices is used in the downbin_data function but
    bin_centers is not used in the summary module.
    bin_edge_indices indicates the edges of the new, larger bins given as bin
    numbers of the original data-vector for each scale.
    bin_centers should contain the central values of the new, large databins.
    It is meant for users who wish to plot their data more easily afterwards
    for example. Here we decided to use the average of the var variable for
    that.
    """

    num_of_scales = len(ctx['scales'])

    bin_edge_indices_ = np.arange(0, 410, 11)
    bin_edge_indices = np.zeros((0, 11))
    for scale in num_of_scales:
        bin_edge_indices = np.vstack((bin_edge_indices, bin_edge_indices_))

    # Decide on central bin values
    bin_centers = np.zeros((0, 10))
    bin_centers_init = ctx['var'] * 4
    for scale in num_of_scales:
        bin_centers_ = np.zeros(0)
        for jj in range(len(bin_edge_indices) - 1):
            bin_centers_ = np.append(
                bin_centers_,
                np.nanmean(bin_centers_init[jj:jj + 1]))
        bin_centers = np.vstack((bin_centers, bin_centers_))

    return bin_edge_indices, bin_centers


def slice(ctx):
    """
    This function is required if the summary module should be used.
    The slice function defines how exactly the binning scheme is used to
    downbin the original data-vectors. It should return 4 different arguments.
    1.) Number of scales: For multi statistic usually the number of smoothing
    scales and for convergence and shear 1.
    2.) n_bins_sliced: The number of bins after the binning scheme is applied
    (usually len(bin_edge_indices) - 1)
    3.) operation: Either sum up or average the individual bin values
    (sum or mean)
    4.) mult: How many times the binning scheme is applied per data-vector
    (could have defined it in decide_binning_scheme for only 0 - 100 and put
    this to 4 for example). Usually 1.

    NOTE: In case of a tomographic analysis including cross-correlations or not
    the downbinning is performed for each individual part of the data-vector.
    """

    mult = 1
    num_of_scales = len(ctx['scales'])
    operation = 'mean'
    n_bins_sliced = ctx['num_final_bins']

    return num_of_scales, n_bins_sliced, operation, mult


def filter(ctx):
    """
    This function is required if the likelihood module should be used.
    It can be used to do a final trimming of the data-vectors that are used
    in the inference process by providing a boolean array to mask certain bins
    of the data-vectors. This is meant to explore different combinations of
    scales for multi statistics for example.
    In this example we remove all data-bins that have been recorded on scales
    not included in the selected_scales entry of the likelihood module.
    NOTE-1: In case of a tomographic analysis the filter is applied to each
    individual data-vector of each bin.
    NOTE-2: In case of a tomographic analysis including cross-correlation the
    filter is applied to each individual data-vector.
    """
    filter = np.zeros(0)
    for scale in ctx['scales']:
        if scale in ctx['selected_scales']:
            f = [True] * \
                ctx['num_final_bins']
            f = np.asarray(f)
        else:
            f = [False] * \
                ctx['num_final_bins']
            f = np.asarray(f)
        filter = np.append(filter, f)
    return filter