compaso_halo_catalog module loads halo catalogs from CompaSO, Abacus’s
on-the-fly halo finder. The module defines one class,
whose constructor takes the path to a halo catalog as an argument.
Users should use this class as the primary interface to load
and manipulate halo catalogs.
The halo catalogs and particle subsamples are stored on disk in ASDF files and are loaded into memory as Astropy tables. Each column of an Astropy table is essentially a Numpy array and can be accessed with familiar Numpy-like syntax. More on Astropy tables here: http://docs.astropy.org/en/stable/table/
Beyond just loading the halo catalog files into memory, this module performs a few other manipulations. Many of the halo catalog columns are stored in bit-packed formats (e.g. floats are scaled to a ratio from 0 to 1 then packed in 16-bit ints), so these columns are unpacked as they are loaded.
Furthermore, the halo catalogs for big simulations are divided
across a few dozen files. These files are transparently loaded
into one monolithic Astropy table if one passes a directory
CompaSOHaloCatalog; to save memory by loading only one file,
pass just that file as the argument to
Importantly, because ASDF and Astropy tables are both column-
oriented, it can be much faster to load only the subset of
halo catalog columns that one needs, rather than all 60-odd
columns. Use the
fields argument to the
constructor to specify a subset of fields to load. Similarly, the
particles can be quite large, and one can use the
argument to restrict the particles to the subset one needs.
Some brief examples and technical details about the halo catalog layout are presented below, followed by the full module API. Examples of using this module to work with AbacusSummit data can be found on the AbacusSummit website here: https://abacussummit.readthedocs.io
>>> from abacusnbody.data.compaso_halo_catalog import CompaSOHaloCatalog >>> # Load the RVs and PIDs for particle subsample A >>> cat = CompaSOHaloCatalog('/storage/AbacusSummit/AbacusSummit_base_c000_ph000/halos/z0.100', subsamples=dict(A=True,pos=True)) >>> print(cat.halos[:5]) # cat.halos is an Astropy Table, print the first 5 rows id npstartA npstartB ... sigmavrad_L2com sigmavtan_L2com rvcirc_max_L2com -------- -------- -------- ... --------------- --------------- ---------------- 25000000 0 2 ... 0.9473971 0.96568024 0.042019103 25000001 11 12 ... 0.86480814 0.8435805 0.046611086 48000000 18 15 ... 0.66734606 0.68342227 0.033434115 58000000 31 18 ... 0.52170926 0.5387341 0.042292822 58001000 38 23 ... 0.4689916 0.40759262 0.034498636 >>> print(cat.halos['N','x_com'][:5]) # print the position and mass of the first 5 halos N x_com  --- ------------------------ 278 -998.88525 .. -972.95404 45 -998.9751 .. -972.88416 101 -999.7485 .. -947.8377 82 -998.904 .. -937.6313 43 -999.3252 .. -937.5813 >>> # Examine the particle subsamples associated with the 5th halo >>> h5 = cat.halos >>> print(cat.subsamples['pos'][h5['npstartA']:h5['npstartA'] + h5['npoutA']]) pos  ------------------------ -999.3019 .. -937.5229 -999.33435 .. -937.5515 -999.38965 .. -937.58777 >>> # At a glance, the pos fields match that of the 5th halo above, so it appears we have indexed correctly!
The catalogs are stored in a directory structure that looks like:
- SimulationName/ - halos/ - z0.100/ - halo_info/ halo_info_000.asdf halo_info_001.asdf ... - halo_rv_A/ halo_rv_A_000.asdf halo_rv_A_001.asdf ... - <field & halo, rv & PID, subsample A & B directories> - <other redshift directories, some with particle subsamples, others without>
The file numbering roughly corresponds to a planar chunk of the simulation (all y and z for some range of x). The matching of the halo_info file numbering to the particle file numbering is important; halo_info files index into the corresponding particle files.
The halo catalogs are stored on disk in ASDF files (https://asdf.readthedocs.io/). The ASDF files start with a human-readable header that describes the data present in the file and metadata about the simulation from which it came (redshift, cosmology, etc). The rest of the file is binary blobs, each representing a column (i.e. a halo property).
Internally, the ASDF binary portions are usually compressed. This should be transparent to users, although you may be prompted to install the blosc package if it is not present. Decompression should be fast, at least 500 MB/s per core.
We define two disjoint sets of “subsample” particles, called “subsample A” and “subsample B”. Subsample A is a few percent of all particles, with subsample B a few times larger than that. Particles membership in each group is a function of PID and is thus consistent across redshift.
At most redshifts, we only output halo catalogs and halo subsample particle PIDs. This aids with construction of merger trees. At a few redshifts, we provide subsample particle positions as well as PIDs, for both halo particles and non-halo particles, called “field” particles. Only halo particles (specifically, L1 particles) may be loaded through this module; field particles and L0 halo particles can be loaded by reading the particle files directly with read_abacus module.
subsamples argument to the constructor to specify loading
subsample A and/or B, and which fields—pos, vel, pid—to load. Note that
if only one of pos & vel is specified, the IO amount is the same, because
the pos & vel are packed together in RVint format.
But the memory usage and time to unpack will be lower.
Halo File Types¶
Each file type (for halos, particles, etc) is grouped into a subdirectory. These subdirectories are:
halo_info/The primary halo catalog files. Contains stats like CoM positions and velocities and moments of the particles. Also indicates the index and count of subsampled particles in the
halo_pid_B/The 64-bit particle IDs of particle subsamples A and B. The PIDs contain information about the Lagrangian position of the particles, whether they are tagged, and their local density.
The following subdirectories are only present for the redshifts for which we output particle subsamples and not just halo catalogs:
halo_rv_B/The positions and velocities of the halo subsample particles, in “RVint” format. The halo associations are recoverable with the indices in the
halo_rv_<A|B>/, but only for the field (non-halo) particles.
halo_pid_<A|B>/, but only for the field (non-halo) particles.
The “RVint” format packs six fields (x,y,z, and vx,vy,vz) into three ints (12 bytes). Positions are stored to 20 bits (global), and velocities 12 bits (max 6000 km/s).
The PIDs are 8 bytes and encode a local density estimate, tag bits for merger trees, and a unique particle id, the last of which encodes the Lagrangian particle coordinate.
These are described in more detail on the AbacusSummit Data Model page.
unpack_bits argument of the
CompaSOHaloCatalog constructor to specify
which PID bit fields you want unpacked. Be aware that some of them might use a lot of
memory; e.g. the Lagrangian positions are three 4-byte floats per subsample particle.
Also be aware that some of the returned arrays use narrow int dtypes to save memory,
such as the
lagr_idx field using
int16. It is easy to silently overflow such
narrow int types; make sure your operations stay within the type width and cast
Field Subset Loading¶
Because the ASDF files are column-oriented, it is possible to load just one or a few
columns (halo catalog fields) rather than the whole file. This can save huge amounts
of IO, memory, and CPU time (the latter due to the decompression). Use the
CompaSOHaloCatalog constructor to specify the list of columns you want.
In detail, some columns are stored as ratios to other columns. For example,
is stored as a ratio relative to
r100. So to properly unpack
r100 column must also be read.
CompaSOHaloCatalog knows about
these dependencies and will load the minimum set necessary to return the requested
columns to the user. However, this may result in more IO than expected. The
constructor flag or the
dependency_info field of the
object may be useful for diagnosing exactly what data is being loaded.
Despite the potential extra IO and CPU time, the extra memory usage is granular at the level of individual files. In other words, when loading multiple files, the concatenated array will never be constructed for columns that only exist for dependency purposes.
Superslab (Chunk) Processing¶
The halo catalogs are divided across multiple files, called “superslabs”, which are
typically planar chunks of the simulation volume (all y,z for some range of x, with
a bit of overlap at the boundaries). Applications that can process the volume
superslab-by-superslab can save a substantial amount of memory compared to loading
the full volume. To load a single superslab, pass the corresponding
file as the
cat = CompaSOHaloCatalog('AbacusSummit_base_c000_ph000/halos/z0.100/halo_info/halo_info_000.asdf')
If your application needs one slab of padding, you can pass a list of files and proceed in a rolling fashion:
cat = CompaSOHaloCatalog(['AbacusSummit_base_c000_ph000/halos/z0.100/halo_info/halo_info_033.asdf', 'AbacusSummit_base_c000_ph000/halos/z0.100/halo_info/halo_info_000.asdf', 'AbacusSummit_base_c000_ph000/halos/z0.100/halo_info/halo_info_001.asdf'])
Another way to save memory is to use the
filter_func argument. This function
will be called for each superslab, and must return a mask representing the rows
to keep. For example, to drop all halos with less than 100 particles, use:
cat = CompaSOHaloCatalog(..., filter_func=lambda h: h['N'] >= 100)
Because this mask is applied on each superslab immediately after loading, the full, unfiltered table is never constructed, thus saving memory.
The filtering adds some CPU time, but in many cases loading catalogs is IO limited, so this won’t add much overhead.
The Blosc compression we use inside the ASDF files supports multi-threaded decompression. We have packed AbacusSummit files with 4 Blosc blocks (each ~few MB) per ASDF block, so 4 Blosc threads is probably the optimal value. This is the default value, unless fewer cores are available (as determined by the process affinity mask).
Loading a CompaSOHaloCatalog will use 4 decompression threads by default.
You can control the number of decompression threads with:
import abacusnbody.data.asdf abacusnbody.data.asdf.set_nthreads(N)
- class abacusnbody.data.compaso_halo_catalog.CompaSOHaloCatalog(path, cleaned=True, subsamples=False, convert_units=True, unpack_bits=False, fields='DEFAULT_FIELDS', verbose=False, cleandir=None, filter_func=None, halo_lc=None, **kwargs)¶
A halo catalog from Abacus’s on-the-fly group finder.
- __init__(path, cleaned=True, subsamples=False, convert_units=True, unpack_bits=False, fields='DEFAULT_FIELDS', verbose=False, cleandir=None, filter_func=None, halo_lc=None, **kwargs)¶
Loads halos. The
halosfield of this object will contain the halo records; and the
subsamplesfield will contain the corresponding halo/field subsample positions and velocities and their ids (if requested via
headerfield contains metadata about the simulation.
Whether a particle is tagged or not is returned when loading the halo and field pids, as it is encoded for each in the 64-bit PID. The local density of the particle is also encoded in the PIDs and returned upon loading those.
path (path-like or list of path-like) –
The halo catalog directory, like
MySimulation/halos/z1.000/. Or a single halo info file, or a list of halo info files. Will accept
halo_infodirs or “redshift” dirs (e.g.
To load cleaned catalogs, you do not need to pass a different argument to the
cleaned=Trueinstead and the path to the cleaning info will be detected automatically (or see
cleaned (bool, optional) – Loads the “cleaned” version of the halo catalogues. Always recommended. Assumes there is a directory called
cleaning/at the same level as the top-level simulation directory (or see
cleandir). Default: True. False returns the out-of-the-box CompaSO halos. May be useful for specific applications.
subsamples (bool or dict, optional) –
Load halo particle subsamples. True or False may be specified to load all particles or none, or a dict to specify whether to load subsample A and/or B, with pos, vel, and/or pid fields:
subsamples=dict(A=True, B=True, pos=True, vel=True, pid=True)
rvkey may be used as shorthand to set both
vel. False (the default) loads nothing.
convert_units (bool, optional) – Convert positions from unit-box units to BoxSize-box units, velocities already come in km/s. Default: True.
unpack_bits (bool, or list of str, optional) – Extract information from the PID field of each subsample particle info about its Lagrangian position, whether it is tagged, and its current local density. If False, only the particle ID part will be extracted. Note that this per-particle information can be large. Can be a list of str, in which case only those fields will be unpacked. Field names are: (‘pid’, ‘lagr_pos’, ‘tagged’, ‘density’, ‘lagr_idx’). Default: False.
fields (str or list of str, optional) – A list of field names/halo properties to load. Selecting a small subset of fields can be substantially faster than loading all fields because the file IO will be limited to the desired fields. See
compaso_halo_catalog.user_dtor the AbacusSummit Data Model page for a list of available fields. See
compaso_halo_catalog.clean_dtfor the list of cleaned halo fields that will be loaded. ‘all’ will also load main progenitor information, which could be slow. Default: ‘DEFAULT_FIELDS’
verbose (bool, optional) – Print informational messages. Default: False
cleandir (str, optional) – Where the halo catalog cleaning files are located (usually called
cleaning/). Default of None will try to detect it automatically. Only has any effect if using
filter_func (function, optional) –
A mask function to be applied to each superslab as it is loaded. The function must take one argument (a halo table) and return a boolean array or similar mask on the rows. Simple lambda expressions are particularly useful here; for example, to load all halos with 100 particles or more, use:
filter_func=lambda h: h['N'] >= 100
halo_lc (bool or None, optional) – Whether the catalog is a halo light cone catalog, i.e. an output of the CompaSO halo light cone pipeline. Default of None means to detect based on the catalog path.
- nbytes(halos=True, subsamples=True)¶
Return the memory usage of the big arrays: the halo catalog and the particle subsamples
- static is_path_halo_lc(path)¶
- abacusnbody.data.compaso_halo_catalog.join_arrays(offset, array_original, array_merge, array_joined, npstart_original, npout_original, npstart_merge, npout_merge, np_total)¶