AbacusHOD#

The AbacusHOD module loads halo catalogs from the AbacusSummit simulations and outputs multi-tracer mock galaxy catalogs. The code is highly efficient and contains a large set of HOD extensions such as secondary biases (assembly biases), velocity biases, and satellite profile flexibilities. The baseline HODs are based on those from Zheng et al. 2007 and Alam et al. 2020. The HOD extensions are first explained in Yuan et al. 2018, and more recently summarized in Yuan et al. 2020b . This HOD code also supports RSD and incompleteness. The code is fast, completeling a \((2Gpc/h)^3\) volume in 80ms per tracer on a 32 core desktop system, and the performance should be scalable. The module also provides efficient correlation function and power spectrum calculators. This module is particularly suited for efficiently sampling HOD parameter space. We provide examples of docking it onto emcee and dynesty samplers.

The module defines one class, AbacusHOD, whose constructor takes the path to the simulation volume, and a set of HOD parameters, and runs the staging function to compile the simulation halo catalog as a set of arrays that are saved on memory. The run_hod function can then be called to generate galaxy catalogs.

The output takes the format of a dictionary of dictionaries, where each subdictionary corresponds to a different tracer. Currently, we have enabled tracer types: LRG, ELG, and QSO. Each subdictionary contains all the mock galaxies of that tracer type, recording their properties with keys x, y , z, vx, vy, vz, mass, id, Ncent. The coordinates are in Mpc/h, and the velocities are in km/s. The mass refers to host halo mass and is in units of Msun/h. The id refers to halo id, and the Ncent key refers to number of central galaxies for that tracer. The first Ncent galaxies in the catalog are always centrals and the rest are satellites.

The galaxies can be written to disk by setting the write_to_disk flag to True in the argument of run_hod. However, the I/O is slow and the write_to_disk flag defaults to False.

The core of the AbacusHOD code is a two-pass memory-in-place algorithm. The first pass of the halo+particle subsample computes the number of galaxies generated in total. Then an empty array for these galaxies is allocated in memory, which is then filled on the second pass of the halos+particles. Each pass is accelerated with numba parallel. The default threading is set to 16.

Theory#

The baseline HOD for LRGs comes from Zheng et al. 2007:

\[\bar{n}_{\mathrm{cent}}(M) = \frac{1}{2}\mathrm{erfc} \left[\frac{\ln(M_{\mathrm{cut}}/M)}{\sqrt{2}\sigma}\right],\]

\[\bar{n}_{\textrm{sat}}(M) = \left[\frac{M-\kappa M_{\textrm{cut}}}{M_1}\right]^{\alpha}\bar{n}_{\mathrm{cent}}(M),\]

The baseline HOD for ELGs and QSOs comes from Alam et al. 2020. The actual calculation is complex and we refer the readers to section 3 of said paper for details.

In the baseline implementation, the central galaxy is assigned to the center of mass of the halo, with the velocity vector also set to that of the center of mass of the halo. Satellite galaxies are assigned to particles of the halo with equal weights. When multiple tracers are enabled, each halo/particle can only host a single tracer type. However, we have not yet implemented any prescription of conformity.

The secondary bias (assembly bias) extensions follow the recipes described in Xu et al. 2020 , where the secondary halo property (concentration or local overdensity) is directly tied to the mass parameters in the baseline HOD (\(M_{\mathrm{cut}}\) and \(M_1\)):

\[\log_{10} M_{\mathrm{cut}}^{\mathrm{mod}} = \log_{10} M_{\mathrm{cut}} + A_c(c^{\mathrm{rank}} - 0.5) + B_c(\delta^{\mathrm{rank}} - 0.5)\]

\[\log_{10} M_{1}^{\mathrm{mod}} = \log_{10} M_{1} + A_s(c^{\mathrm{rank}} - 0.5) + B_s(\delta^{\mathrm{rank}} - 0.5)\]

where \(c\) and \(\delta\) represent the halo concentration and local overdensity, respectively. These secondary properties are ranked within narrow halo mass bins, and the rank are normalized to range from 0 to 1, as noted by the \(\mathrm{rank}\) superscript. \((A_c, B_c, A_s, B_s)\) form the four parameters describing secondary biases in the HOD model. The default for these parameters are 0.

The velocity bias extension follows the common prescription as described in Guo et al. 2015 .

\[\sigma_c = \alpha_c \sigma_h\]

\[v_s - v_h = \alpha_s (v_p - v_h)\]

where the central velocity bias parameter \(\alpha_c\) sets the ratio of central velocity dispersion vs. halo velocity dispersion. The satellite velocity bias parameter \(\alpha_c\) sets the ratio between the satellite peculiar velocity to the particle peculiar velocity. The default for these two parameters are 0 and 1, respectively.

We additionaly introduce a set of satellite profile parameters \((s, s_v, s_p, s_r)\) that allow for flexibilities in how satellite galaxies are distributed within a halo. They respecctively allow the galaxy weight per particle to depend on radial position (\(s\)), peculair velocity (\(s_v\)), perihelion distance of the particle orbit (\(s_p\)), and the radial velocity (\(s_v\)). The default values for these parameters are 1. A detailed description of these parameters are available in Yuan et al. 2018, and more recently in Yuan et al. 2020b .

Some brief examples and technical details about the module layout are presented below, followed by the full module API.

Short Example#

The first step is to create the configuration file such as config/abacus_hod.yaml, which provides the full customizability of the HOD code. By default, it lives in your current work directory under a subdirectory ./config. A template with default settings are provided under abacusutils/scripts/hod/config.

With the first use, you should define which simulation box, which redshift, the path to simulation data, the path to output datasets, the various HOD flags and an initial set of HOD parameters. Other decisions that need to be made initially (you can always re-do this but it would take some time) include: do you only want LRGs or do you want other tracers as well? Do you want to enable satellite profile flexibilities (the \(s, s_v, s_p, s_r\) parameters)? If so, you need to turn on want_ranks flag in the config file. If you want to enable secondary bias, you need to set want_AB flag to true in the config file. The local environment is defined by total mass within 5 Mpc/h but beyond r98.

Tip

Running this code is a two-part process. First, you need to run the prepare_sim code, which generates the necessary data files for that simulation. Then you can run the actual HOD code. The first step only needs to be done once for a simulation box, but it can be slow, depending on the downsampling and the features you choose to enable.

So first, you need to run the prepare_sim script, this extracts the simulation outputs and organizes them into formats that are suited for the HOD code. This code can take approximately an hour depending on your configuration settings and system capabilities. We recommend setting the Nthread_load parameter to min(sys_core_count, memoryGB_divided_by_30). You can run load_sims on command line with

python -m abacusnbody.hod.prepare_sim --path2config PATH2CONFIG

Within Python, you can run the same script with from abacusnbody.hod import prepare_sim and then prepare_sim.main("/path/to/config.yaml").

If your config file lives in the default location, i.e. ./config, then you can ignore the -path2config flag. Once that is finished, you can construct the AbacusHOD object and run fast HOD chains. A code template is given in abacusutils/scripts/run_hod.py for running a few example HODs and abacusutils/scripts/run_emcee.py for integrating with the emcee sampler.

To use the given run_hod.py script to run a custom configuration file, you can simply run the given script in bash

python run_hod.py --path2config PATH2CONFIG

You can also consruct the AbacusHOD object yourself within Python and run HODs from there. Here we show the scripts within run_hod.py for reference.:

import os
import glob
import time
import yaml
import numpy as np
import argparse

from abacusnbody.hod.abacus_hod import AbacusHOD

path2config = 'config/abacus_hod.yaml' # path to config file

# load the config file and parse in relevant parameters
config = yaml.safe_load(open(path2config))
sim_params = config['sim_params']
HOD_params = config['HOD_params']
clustering_params = config['clustering_params']

# additional parameter choices
want_rsd = HOD_params['want_rsd']
write_to_disk = HOD_params['write_to_disk']

# create a new AbacusHOD object
newBall = AbacusHOD(sim_params, HOD_params, clustering_params)

# first hod run, slow due to compiling jit, write to disk
mock_dict = newBall.run_hod(newBall.tracers, want_rsd, write_to_disk = write_to_disk, Nthread = 16)

# run the 10 different HODs for timing
for i in range(10):
    newBall.tracers['LRG']['alpha'] += 0.01
    print("alpha = ",newBall.tracers['LRG']['alpha'])
    start = time.time()
    mock_dict = newBall.run_hod(newBall.tracers, want_rsd, write_to_disk = False, Nthread = 64)
    print("Done iteration ", i, "took time ", time.time() - start)

The class also provides fast 2PCF calculators. For example to compute the redshift-space 2PCF (\(\xi(r_p, \pi)\)):

# load the rp pi binning from the config file
bin_params = clustering_params['bin_params']
rpbins = np.logspace(bin_params['logmin'], bin_params['logmax'], bin_params['nbins'])
pimax = clustering_params['pimax']
pi_bin_size = clustering_params['pi_bin_size']    # the pi binning is configrured by pi_max and bin size

mock_dict = newBall.run_hod(newBall.tracers, want_rsd, write_to_disk=write_to_disk)
xirppi = newBall.compute_xirppi(mock_dict, rpbins, pimax, pi_bin_size)

Light Cones#

AbacusHOD supports generating HOD mock catalogs from halo light cone catalogs (PR #28). Details on the usage will be provided here soon.

Notes#

Currently, when RSD effects are enabled in the HOD code for the halo light cones, the factor velz2kms, determining the size of the RSD correction to the position along the line of sight, is the same for all galaxies at a given redshift catalog.

Parallelism#

Some of the HOD routines accept an Nthread argument to enable parallelism. This parallelism is backed by Numba, for the most part.

This has implications if the HOD module is used as part of a larger code. Numba supports multiple threading backends, some of which will not work if the larger code uses other flavors of parallelism. In such cases, you may need to instruct Numba to use a fork-safe and/or thread-safe backend (called forksafe and threadsafe), depending on whether the larger code is forking subprocesses or spawning threads. This can be accomplished by setting the environment variable NUMBA_THREADING_LAYER=forksafe or assigning numba.config.THREADING_LAYER='forksafe' in Python before importing the HOD code (likewise for threadsafe). There is also a safe backend that is both fork- and thread-safe, but it is only available if Intel TBB is available. See the Numba threading layer docs for more info.

API#

class abacusnbody.hod.abacus_hod.AbacusHOD(sim_params, HOD_params, clustering_params=None, chunk=-1, n_chunks=1, skip_staging=False)[source]#

Bases: object

A highly efficient multi-tracer HOD code for the AbacusSummmit simulations.

__init__(sim_params, HOD_params, clustering_params=None, chunk=-1, n_chunks=1, skip_staging=False)[source]#

Loads simulation. The sim_params dictionary specifies which simulation volume to load. The HOD_params specifies the HOD parameters and tracer configurations. The clustering_params specifies the summary statistics configurations. The HOD_params and clustering_params can be set to their default values in the config/abacus_hod.yaml file and changed later. The sim_params cannot be changed once the AbacusHOD object is created.

Parameters:

sim_params (dict) –
Dictionary of simulation parameters. Load from config/abacus_hod.yaml. The dictionary should contain the following keys:
- sim_name: str, name of the simulation volume, e.g. ‘AbacusSummit_base_c000_ph006’.
- sim_dir: str, the directory that the simulation lives in, e.g. ‘/path/to/AbacusSummit/’.
- output_dir: str, the diretory to save galaxy to, e.g. ‘/my/output/galalxies’.
- subsample_dir: str, where to save halo+particle subsample, e.g. ‘/my/output/subsamples/’.
- z_mock: float, which redshift slice, e.g. 0.5.
HOD_params (dict) –
HOD parameters and tracer configurations. Load from config/abacus_hod.yaml. It contains the following keys:
- tracer_flags: dict, which tracers is enabled:
  
  LRG: bool, default True.
  
  ELG: bool, default False.
  
  QSO: bool, default False.
- want_ranks: bool, enable satellite profile flexibilities. If False, satellite profile follows the DM, default True.
- want_rsd: bool, enable RSD? default True. # want RSD?
- Ndim: int, grid density for computing local environment, default 1024.
- density_sigma: float, scale radius in Mpc / h for local density definition, default 3.
- write_to_disk: bool, output to disk? default False. Setting to True decreases performance.
- LRG_params: dict, HOD parameter values for LRGs. Default values are given in config file.
- ELG_params: dict, HOD parameter values for ELGs. Default values are given in config file.
- QSO_params: dict, HOD parameter values for QSOs. Default values are given in config file.
clustering_params (dict, optional) –
Summary statistics configuration parameters. Load from config/abacus_hod.yaml. It contains the following keys:
- clustering_type: str, which summary statistic to compute. Options: wp, xirppi, default: xirppi.
- bin_params: dict, transverse scale binning.
  
  logmin: float, \(\log_{10}r_{\mathrm{min}}\) in Mpc/h.
  
  logmax: float, \(\log_{10}r_{\mathrm{max}}\) in Mpc/h.
  
  nbins: int, number of bins.
- pimax: int, \(\pi_{\mathrm{max}}\).
- pi_bin_size: int, size of bins along of the line of sight. Need to be divisor of pimax.
chunk (int, optional) – Index of current chunk. Must be between 0 and n_chunks-1. Files associated with this chunk are written out as {tracer}s_{chunk}.dat. Default is -1 (no chunking).
n_chunks (int, optional) – Number of chunks to split the input from the halo+particle subsample and number of output files in which to write out the galaxy catalogs following the format {tracer}s_{chunk}.dat.

staging()[source]#: Constructor call this function to load the halo+particle subsamples onto memory.

run_hod(tracers=None, want_rsd=True, want_nfw=False, NFW_draw=None, reseed=None, write_to_disk=False, Nthread=16, verbose=False, fn_ext=None)[source]#

Runs a custom HOD.

Parameters:

tracers (dict) – dictionary of multi-tracer HOD. tracers['LRG'] is the dictionary of LRG HOD parameters, overwrites the LRG_params argument in the constructor. Same for keys 'ELG' and 'QSO'.
want_rsd (bool) – enable RSD? default True.
want_nfw (bool) – Distribute satellites on NFW instead of particles? default False. Needs to feed in a long array of random numbers drawn from an NFW profile. !!! NFW profile is unoptimized. It has different velocity bias. It does not support lightcone. !!!
NFW_draw (np.array) – A long array of random numbers drawn from an NFW profile. P(x) = 1./(x*(1+x)**2)*x**2. default None. Only needed if want_nfw == True.
reseed (int) – re-generate random numbers? supply random number seed. This overwrites the pre-generated random numbers, at a performance cost. Default None.
write_to_disk (bool) – output to disk? default False. Setting to True decreases performance.
Nthread (int) – number of threads in the HOD run. Default 16.
verbose (bool,) – detailed stdout? default False.
fn_ext (str) – filename extension for saved files. Only relevant when write_to_disk = True.

Returns:

mock_dict – dictionary of galaxy outputs. Contains keys 'LRG', 'ELG', and 'QSO'. Each tracer key corresponds to a sub-dictionary that contains the galaxy properties with keys 'x', 'y', 'z', 'vx', 'vy', 'vz', 'mass', 'id', Ncent'. The coordinates are in Mpc/h, and the velocities are in km/s. The 'mass' refers to host halo mass and is in units of Msun/h. The 'id' refers to halo id, and the 'Ncent' key refers to number of central galaxies for that tracer. The first 'Ncent' galaxies in the catalog are always centrals and the rest are satellites.

Return type:

dict

compute_ngal(tracers=None, Nthread=16)[source]#

Computes the number of each tracer generated by the HOD

Parameters:

tracers (dict) – dictionary of multi-tracer HOD. tracers['LRG'] is the dictionary of LRG HOD parameters, overwrites the LRG_params argument in the constructor. Same for keys 'ELG' and 'QSO'.
Nthread (int) – Number of threads in the HOD run. Default 16.

Returns:

ngal_dict (dict)
dictionary of number of each tracer.
fsat_dict (dict)
dictionary of satellite fraction of each tracer.

compute_clustering(mock_dict, *args, **kwargs)[source]#

Computes summary statistics, currently enabling wp and xirppi.

Parameters:

mock_dict (dict) – dictionary of tracer positions. Output of run_hod.
Ntread (int) – number of threads in the HOD run. Default 16.
rpbins (np.array) – array of transverse bins in Mpc/h.
pimax (int) – maximum bin edge along the line of sight direction, in Mpc/h.
pi_bin_size (int) – size of bin along the line of sight. Currently, we only support linear binning along the line of sight.

Returns:

clustering – dictionary of summary statistics. Auto-correlations/spectra can be accessed with keys such as 'LRG_LRG'. Cross-correlations/spectra can be accessed with keys such as 'LRG_ELG'.

Return type:

dict

compute_xirppi(mock_dict, rpbins, pimax, pi_bin_size, Nthread=8)[source]#

Computes \(\xi(r_p, \pi)\).

Parameters:

mock_dict (dict) – dictionary of tracer positions. Output of run_hod.
Ntread (int) – number of threads in the HOD run. Default 16.
rpbins (np.array) – array of transverse bins in Mpc/h.
pimax (int) – maximum bin edge along the line of sight direction, in Mpc/h.
pi_bin_size (int) – size of bin along the line of sight. Currently, we only support linear binning along the line of sight.

Returns:

clustering – dictionary of summary statistics. Auto-correlations/spectra can be accessed with keys such as 'LRG_LRG'. Cross-correlations/spectra can be accessed with keys such as 'LRG_ELG'.

Return type:

dict

compute_multipole(mock_dict, rpbins, pimax, sbins, nbins_mu, orders=[0, 2], Nthread=8)[source]#

compute_power(mock_dict, nbins_k, nbins_mu, k_hMpc_max, logk, poles=[], paste='TSC', num_cells=550, compensated=False, interlaced=False)[source]#

Computes \(P(k, \mu)\) and/or \(P_\ell(k)\).

TODO: parallelize, document, include deconvolution and aliasing, cross-correlations

Parameters:

mock_dict (dict) – dictionary of tracer positions. Output of run_hod.
nbins_k (int) – number of k bin centers (same convention as other correlation functions).
nbins_mu (int) – number of mu bin centers (same convention as other correlation functions).
k_hMpc_max (float) – maximum wavemode k in units of [Mpc/h]^-1. Note that the minimum k mode is currently set as the fundamental mode of the box
logk (bool) – flag determining whether the k bins are defined in log space or in normal space.
poles (list) – list of integers determining which multipoles to compute. Default is [], i.e. none.
paste (str) – scheme for painting particles on a mesh. Can be one of TSC or CIC.
num_cells (int) – number of cells per dimension adopted for the particle gridding.
compensated (bool) – flag determining whether to apply the TSC/CIC grid deconvolution.
interlaced (bool) – flag determining whether to apply interlacing (i.e., aliasing).

Returns:

clustering – dictionary of summary statistics. Auto-correlations/spectra can be accessed with keys such as 'LRG_LRG' and 'LRG_LRG_ell' for the multipoles with number of modes per bin, 'LRG_LRG[_ell]_modes'. Cross-correlations/spectra can be accessed with keys such as 'LRG_ELG' and 'LRG_ELG_ell' for the multipoles with number of modes per bin, 'LRG_LRG[_ell]_modes'. Keys k_binc and mu_binc contain the bin centers of k and mu, respectively. The power spectrum P(k, mu) has a shape (nbins_k, nbins_mu), whereas the multipole power spectrum has shape (len(poles), nbins_k). Cubic box only.

Return type:

dict

apply_zcv(mock_dict, config, load_presaved=False)[source]#: Apply control variates reduction of the variance to a power spectrum observable.

apply_zcv_xi(mock_dict, config, load_presaved=False)[source]#: Apply control variates reduction of the variance to a power spectrum observable.

compute_wp(mock_dict, rpbins, pimax, pi_bin_size, Nthread=8)[source]#

Computes \(w_p\).

Parameters:

mock_dict (dict) – dictionary of tracer positions. Output of run_hod.
Ntread (int) – number of threads in the HOD run. Default 16.
rpbins (np.array) – array of transverse bins in Mpc/h.
pimax (int) – maximum bin edge along the line of sight direction, in Mpc/h.
pi_bin_size (int) – size of bin along the line of sight. Currently, we only support linear binning along the line of sight.

Returns:

clustering – dictionary of summary statistics. Auto-correlations/spectra can be accessed with keys such as 'LRG_LRG'. Cross-correlations/spectra can be accessed with keys such as 'LRG_ELG'.

Return type:

dict

gal_reader(output_dir=None, simname=None, sim_dir=None, z_mock=None, want_rsd=None, tracers=None)[source]#

Loads galaxy data given directory and return a mock_dict dictionary.

Parameters:

sim_name (str) – name of the simulation volume, e.g. ‘AbacusSummit_base_c000_ph006’.
sim_dir (str) – the directory that the simulation lives in, e.g. ‘/path/to/AbacusSummit/’.
output_dir (str) – the diretory to save galaxy to, e.g. ‘/my/output/galalxies’.
z_mock (floa) – which redshift slice, e.g. 0.5.
want_rsd (bool) – RSD?
tracers (dict) – dictionary of tracer types to load, e.g. {‘LRG’, ‘ELG’}.

Returns:

``mock_dict`` – dictionary of tracer positions. Output of run_hod.

Return type:

dict

AbacusHOD

Contents

AbacusHOD#

Theory#

Short Example#

Light Cones#

Notes#

Parallelism#

API#