What it is
LOFS (short for "Lack Of File System" or "Leigh Orf File System") is (a) a methodology for organizing and writing numerical model output on distributed memory supercomputers (b) a set of tools for reading back the data for post-processing, visualization and analysis. LOFS is not a traditional file system like ext4, xfs, btrfs, NTFS, etc; LOFS sits atop a "real" file system and is entirely file based, hence the "lack of" a file system." LOFS was designed specifically for distributed memory supercomputer simulations where thousands to millions of MPI ranks must participate in I/O, and where I/O occurs frequently. The format of the files that comprise LOFS files is HDF5. Each HDF5 files that makes up LOFS can contain multiple time levels per file and be organized across many directories.
LOFS source code can be found on Github at https://github.com/leighorf.
On the write side, LOFS is comprised of Fortran 95 code that supplements the CM1 cloud model, but could be applied to other models utilizing MPI. LOFS utilizes the core HDF5 driver where HDF5 files are buffered to memory before being flushed to disk. For typical mappings of large runs to machines like the Blue Waters supercomputer, the CM1 model only comprises 4-5% of available memory on each node, which allows a lot of headroom for buffering data to memory. 50 to 100 time levels are typically buffered to memory before being flushed to disk.
On the read side, a set of tools written in C and C++ provides an interface into the LOFS format that does not require any manual "stitching together" of files or any foreknowledge of the specific file layout. This is accomplished using low-level I/O calls that traverse the directory tree until any HDF5 file is found, after which metadata is extracted that reveals the full structure of the saved data, enabling data retrieval. The user specifies a variable name, model time, and array indices, relative to the full model domain, in each direction (i, j, and k), and a floating point buffer is returned. LOFS operations are serial - no "parallel I/O" options are used with HDF5. LOFS performs very well when one executes multiple serial LOFS reads on a multicore machine with a parallel file system such as Lustre. A single LOFS process may read steadily at, say, 20 MB/s, but on a modern parallel file system, a second LOFS read will also achieve 20 MB/s, such that reads scale linearly and one can easily saturate the I/O infrastructure of a modern computer. Hence, operations that require the same analysis applied concurrently to many model times perform quite efficiently.
I developed LOFS for the specific purpose of being able to save large amounts of 3D floating data frequently for post-hoc analysis/visualization in thunderstorm simulations. LOFS has been used to save, analyze, and visualize extremely high resolution thunderstorm data as frequently as every model time step (see for example https://youtu.be/jcBzk3NJkGw). An istotropic supercell simulation spanning 1/4 trillion grid zones (ten meter domain-wide isotropic grid spacing) was being saved with LOFS, with I/O only taking up between a third and a half of total wallclock time with data being saved every 1/5 second. One of the ways this performance is achieved is through the use of lossy floating point compression, in this case, ZFP. A recent publication that goes into more detail about the usage of LOFS in a quarter trillion grid zone supercell thunderstorm simulation can be found here: https://doi.org/10.3390/atmos10100578. A link to a recent AGU poster presentation focusing on the use of ZFP compression in LOFS can be found here: http://orf.media/wp-content/uploads/2017/12/agu2017-orf.pdf.
Numerical models periodically output data to disk where it can be visualized and analyzed. Since the turn of the century, supercomputing hardware has transitioned from large shared memory machines to even larger distributed memory architectures with globally accessible file systems (such as Lustre or GPFS). Each compute node, itself made up of dozens to hundreds of compute cores, has direct write access to the underlying file system. It is no longer feasible (nor desirable) to save output data to a single file in very large simulations: Writing this way is inefficient, and extremely large output files are difficult to manage and archive. However, having each MPI rank save its own file during each save cycle will results in millions of files, each of which contains only a fragment of the full model domain. LOFS is a way of writing, organizing, and post-processing model data that lies between these two extremes. One file per compute node is written, and each file can contain dozens of time levels. This dramatically reduces the number of files written to disk. LOFS exploits the HDF5 core driver which allows HDF file to be grown in memory (buffered) before being written to disk. The use of the core driver reduces the number of times the underlying file system is accessed, reducing latency associated with many frequent writes to disk. Files are spread among many directories avoiding performance issues associated with holding tens of thousands or more files in a single directory.
LOFS is essentially a form of "poor man's parallel IO", or using more modern terminology, it utilizes Multiple Independent Files (MIF), as opposed to Single Shared Files (SSF) and includes tools that allow access to the underlying model data. In summary, LOFS utilizes the MIF file structure but has a API that enables SSF operations. For more on the advantages and disadvantages of MIF as opposed to SSF, see this very informative blog posting by Mark Miller at Lawrence Livermore.
At the time of this writing, the CM1 model offers several output formats including netCDF, a widely utilized data format in the atmospheric sciences. However, the netCDF I/O options for CM1 either do not scale well to many (tens of thousands or more) MPI ranks on supercomputing hardware, or scale well but produce too many files, with each file containing a small fragment of the full model domain. The latter situation makes post-processing and analysis very challenging.
At this time, LOFS has been used only with CM1, but it could be applied to any model that utilizes a 2D (or with modifications, a 3D) domain decomposition strategy. One of the features of LOFS is the use of redundant metadata, with each HDF5 file containing enough information to reconstruct the layout of the entire file system.
LOFS is designed specifically for large supercomputer simulations. Goals of the approach are to:
- reduce the number of files written to a reasonable number
- minimize the number of times you are doing actual I/O to the underlying file system
- write big files (but not too big)
- make it easy for users to read in data for analysis and visualization after it is written
For simulations where under a few hundred MPI ranks are used, no significant benefits are found from using LOFS, and users are encouraged to use one of the existing CM1 output options.
Components of LOFS: Writing
LOFS consists of modifications and additions to the CM1 model for writing data as well as tools written in C and C++ for reading data back and converting it to other formats such as netCDF. In CM1, LOFS completely replaces the I/O driver options of CM1. It consists of Fortran95 code that:
- Creates a new MPI communicator containing reordered MPI ranks where each compute node contains a continuous chunk of the physical model domain
- Creates directories and subdirectories on the fly to contain the HDF5 data
- Applies a strict file and directory naming system wherein metadata can be extracted from the file and directory names themselves
- Collects data from individual compute cores on a node and assembles them into 3D arrays that are stored in each HDF5 file
- Optionally applies data compression utilizing existing HDF5 options, including lossless gzip and lossy ZFP
Prior to model execution, the user must specify
- the number of files to contain per directory
- the number of times to buffer to an individual HDF5 file
- the number of cores spanning each shared memory node in both dimensions (x and y) of the 2D decomposition (for example, on the Blue Waters supercomputer, which contains 16 floating point cores per node, corex=4 and corey=4)
LOFS contains subroutines that handle the writing of both metadata and data, as well as the logic associated with assembling and flushing the data to disk. LOFS utilizes serial HDF, not parallel HDF, in its operations.
Components of LOFS: Reading
One of the problems of "poor man's parallel I/O" is that model data is spread across many files, fragmenting the physical model domain. While it is relatively straightforward to write code that assembles all the files into a single file, this is slow and results in 2x the data you started with.
LOFS was designed from the start to not just spread data out into multiple files, but to provide an application programmer interface (API) into the data after it is written that requires no knowledge of the underlying filesystem. The API interface requires the user to specify a variable name, model time, and the beginning and end array index values in all three dimensions (x, y, and z). These indices are relative to the full model domain. For instance, consider a simulation which spans 2200 points in x, 2200 points in y, and 380 points in z, spread over 10,000 MPI ranks with 16 MPI ranks per node. The user wishes to explore a 3D chunk of the domain in the the vicinity of the center of the model domain where the modeled storm is centered. This region spans (x0,y0,z0) to (x1,y1,z1) where x0=1000, x1=1200, y0=1000, y1=1200, z0=0, and z1=200. The user specifies these indexes along with the variable, and the LOFS code goes out and gets the 3D data from the proper files, assembles the data into an array, and sends back a pointer to the buffer that contains the requested chunk of data that can now be analyzed, visualized, etc. For retrieving 2D slices of data, the user may specify the start and end indices to be the same; for instance if one wished to view surface data, one would specify z0=z1=0.
The primary tool for converting LOFS data is called hdf2nc. CF compliant netCDF files are output from the hdf2nc command. Such files can be read directly by popular visualization programs such as VisIt, Vapor, and Paraview.