Unix Pipes: pipe_asdf

pipe_asdf is a Python script to unpack Abacus ASDF files (such as halo catalog or particle data) and write them out via a Unix pipe (stdout). The intention is to provide a simple way for C, C++, Fortran, etc, codes to read ASDF files while letting Python handle the details of the file formats, compression, and other things Python does well.

Usage

pipe_asdf [-h] [-f FIELD] [--nthread NTHREAD] asdf-file [asdf-file ...] | ./client

positional arguments

asdf-file

An ASDF file. Multiple may be specified.

optional arguments

-h, --help

show this help message and exit

-f FIELD, --field FIELD

A field/column to pipe. Multiple -f flags are allowed, in which case fields will be piped in the order they are specified. (default: None)

--nthread NTHREAD

Number of blosc decompression threads (when applicable). For AbacusSummit, use 1 to 4. (default: 4)

Binary Format of Piped Data

The binary format of the piped data is simple:

  1. an 8-byte int indicating the number of data values

  2. a 4-byte int indicating the width of the primitive data type that composes the data (e.g. 4 for float, 8 for double). Largely provided as a sanity check.

  3. the data, consisting of a number of bytes equal to the product of the preceeding ints

  4. Repeat from (1) for all fields requested

So the expected pattern for the client code is to read the int64 and int32, take the product, allocate that many bytes, then read the data into that allocation.

When passing multiple files, a single column will be read from all files before moving to the next column. In other words, the client sees the concatenated data.

From a performance perspective, the pipe operation probably amounts to a memcpy. So a small performance hit, but likely vanishingly small compared to the actual IO and analysis.

Ultimately, this pipe scheme is not a replacement for direct access to the files, but it may be helpful for applications with simple data access patterns.

Entry Points

Technically, pipe_asdf is a “console script” alias provided by setuptools to invoke the abacusnbody.data.pipe_asdf module as a script. This alias is usually installed in a user’s PATH environment variable when installing abacusutils via pip, but if not, one could equivalently invoke the script with:

$ python3 -m abacusnbody.data.pipe_asdf

The abacusnbody/pipe_asdf directory also contains a symlink to this file, so from this directory one can also run

$ ./pipe_asdf.py

To-do

  • Add a “-k/–key” flag to read header fields. Decide on a wire protocol.

  • Add CompaSOHaloCatalog hooks to pipe the unpacked data (?)

Python API

abacusnbody.data.pipe_asdf.unpack_to_pipe(asdf_fns, fields, data_key='data', header_key='header', pipe=<_io.BufferedWriter name='<stdout>'>, nthread=4, verbose=True)
abacusnbody.data.pipe_asdf.main()

Invoke the command-line interface

Example C Client Code

An example C program called client.c that receives data over a pipe is given in the abacusutils/pipe_asdf directory.

From this directory, one can build the client program by running

$ make

and run it with:

$ ./pipe_asdf.py halo_info_000.asdf -f N -f x_com | ./client

You can use the example halo_info_000.asdf file symlinked in the pipe_asdf directory to test this.

This program is a stand-in for an analysis code. In this case, it just reads the raw binary data for two columns, N and x_com, and prints the values.