daf.storage.memory_mapping¶

Functions to implement memory-mapped files of 1D/2D data.

The format used here was chosen for simplicity, making it easy for “any” (even non-Python) systems to access the data. It is so trivial it can hardly be called a “format” at all, and is explicitly not the format used by numpy.memmap, which is terribly complicated and can only be accessed using numpy in Python (in order to support many use cases we don’t care about in daf).

Note

The code here assumes all the machines accessing memory-mapped data use the same (little-endian) byte order and IEEE floating point formats, which is true for all modern CPUs.

Functions:

`exists_memory_mapped_array`(path)	Returns whether the disk files for a memory-mapped `numpy.ndarray` exist.
`remove_memory_mapped_array`(path)	Remove the disk files (which must exist) for a memory-mapped `numpy.ndarray`.
`create_memory_mapped_array`(path, shape, dtype)	Create new disk files for a memory-mapped `numpy.ndarray` of some `shape` and `dtype` in some `path`.
`open_memory_mapped_array`(path, mode)	Open memory-mapped `numpy.ndarray` disk files.
`exists_memory_mapped_sparse`(path)	Returns whether the disk files for a memory-mapped `SparseInRows` matrix exist.
`remove_memory_mapped_sparse`(path)	Remove the disk files for a memory-mapped `SparseInRows` matrix, if they exist.
`write_memory_mapped_sparse`(path, sparse)	Write the disk files for a memory-mapped `SparseInRows` matrix, if they exist.
`open_memory_mapped_sparse`(path, mode)	Open memory-mapped `SparseInRows` matrix disk files.
`allow_maximal_open_files`()	Increase the maximal number of open files as much as possible, and return the updated limit.
`mmap_file`(*, path, mode, fd[, offset, size])	Memory-map a whole file at some `path` opened using some `fd` using some `mode` (`r`, `r+`).
`bytes_as_ndarray`(memory, *, name, shape, dtype)	View the bytes in `memory` as a `numpy.ndarray` with some `shape` and `dtype`.

Data:

MMAP_CACHE

Cache memory-mapped files so they are not mapped twice.

daf.storage.memory_mapping.exists_memory_mapped_array(path: str) → bool[source]¶: Returns whether the disk files for a memory-mapped numpy.ndarray exist.

daf.storage.memory_mapping.remove_memory_mapped_array(path: str) → None[source]¶: Remove the disk files (which must exist) for a memory-mapped numpy.ndarray.

daf.storage.memory_mapping.create_memory_mapped_array(path: str, shape: Union[int, Tuple[int, int]], dtype: Union[str, dtype]) → None[source]¶

Create new disk files for a memory-mapped numpy.ndarray of some shape and dtype in some path.

This will silently overwrite existing files. In particular, it will delete <path>.data, <path>.indices and/or <path>.indptr files if the exists.

The array element type must be one of FIXED_DTYPES, that is, one can’t create a memory-mapped file of strings or objects of strange types.

This creates two disk files:

<path>.array which contains just the data elements. For 2D data, this is always in ROW_MAJOR layout.
<path>.yaml which contains a mapping with three keys:
- version is a list of two integers, the major and minor format version numbers, to protect against future extensions of the format. This version of the library will generate [1, 0] files and will accept any files with a major version of 1.
- dtype specifies the array element data type.
- shape specifies the number of elements (as a sequence of one or two integers).

This simple representation makes it easy for other systems to directly access the data. However, it basically makes it impossible to automatically report the type of the files (e.g., using the Linux file command).

Note

This just creates the files, filled with zeros. To access the data, you’ll need to call open_memory_mapped_array. Also, actual disk space for the data file is not allocated yet; all of the file except is a “hole”. Actual disk pages are only created when the data is actually written for the first time. This makes creating large data files very quick, and even filling them with data is quick as long as the operating system doesn’t need to actually flush the pages to the disk, which might even be deferred until after the program exits. In fact, if the file is deleted before the program exits, the data need not touch the disk at all.

daf.storage.memory_mapping.open_memory_mapped_array(path: str, mode: str) → Union[Vector, DenseInRows][source]¶

Open memory-mapped numpy.ndarray disk files.

The mode must be one of r+ for read-write access, or r for read-only access (which returns is_frozen data).

Note

This only maps the data to memory, it does not actually read it from the disk. This makes opening large data files very quick, and even accessing the data may be fast as long as the operating system doesn’t need to actually get the specific used pages from the disk (e.g. the pages were previously read, or were just written, so they are already in RAM). Therefore mapping a large array and only accessing small parts of it would be much faster than reading all the array to memory in advance.

This does consume virtual address space to cover the whole data, but because the data is memory-mapped to a disk file, it allows accessing data that is larger than the physical RAM; the operating system brings in disk pages as necessary when data is accessed, and is free to flush/forget them to release space, so only a small subset of them must exist in RAM at any given time.

daf.storage.memory_mapping.exists_memory_mapped_sparse(path: str) → bool[source]¶: Returns whether the disk files for a memory-mapped SparseInRows matrix exist.

daf.storage.memory_mapping.remove_memory_mapped_sparse(path: str) → None[source]¶: Remove the disk files for a memory-mapped SparseInRows matrix, if they exist.

daf.storage.memory_mapping.write_memory_mapped_sparse(path: str, sparse: SparseInRows) → None[source]¶

Write the disk files for a memory-mapped SparseInRows matrix, if they exist.

This will silently overwrite existing files. In particular, it will delete a <path>.array file if one exists.

This creates four disk files:

<path>.data which contains just the non-zero data elements.
<path>.indices which contains the column indices of the non-zero data elements.
<path>.indptr which contains the ranges of the entries of the non-zero data elements of the rows.
<path>.yaml which contains a mapping with six keys:
- version is a list of two integers, the major and minor version numbers, to protect against future extensions of the format. This version of the library will generate [1, 0] files and will accept any files with a major version of 1.
- data_dtype specifies the non-zero data array element data type.
- indices_dtype specifies the non-zero column indices array element data type.
- indptr_dtype specifies the rows entries ranges indptr array element data type.
- shape specifies the number of elements (as a sequence of one or two integers).
- nnz specifies the number of non-zero data elements.

This simple representation makes it easy for other systems to directly access the data. However, it basically makes it impossible to automatically report the type of the files (e.g., using the Linux file command).

daf.storage.memory_mapping.open_memory_mapped_sparse(path: str, mode: str) → SparseInRows[source]¶

Open memory-mapped SparseInRows matrix disk files.

The mode must be one of w for read-write access, or r for read-only access.

daf.storage.memory_mapping.allow_maximal_open_files() → int[source]¶

Increase the maximal number of open files as much as possible, and return the updated limit.

Every time you open_memory_mapped_array or open_memory_mapped_sparse, the relevant file(s) are memory-mapped which counts as “open files”. The operating system restricts the maximal number of such open files per process. When you reach this limit you will see an error complaining about “too many open files” or “running out of file descriptors”.

Luckily, modern operating system allow for a large number of open files so this isn’t a problem for common usage.

If you do reach this limit, call this function which will use resource.setrlimit(resource.RLIMIT_NOFILE, ...) to increase the maximal number of open files to the maximum (“hard”) limit allowed by the operating system, as opposed of to the lower “soft” limit used by default. This higher hard limit (the return value) is even higher in modern operating systems, and should be enough for most “uncommon” usage.

If even that isn’t enough, you should probably reflect on whether what you are trying to do makes sense in the first place. If you are certain it does, then most operating systems provide a way to raise the hard limit of open files to “any” value. This requires administrator privileges and is beyond the scope of this package.

daf.storage.memory_mapping.MMAP_CACHE: WeakValueDictionary[Tuple[str, str, int, int], mmap] = <WeakValueDictionary>¶

Cache memory-mapped files so they are not mapped twice.

The key is the path, the mode, offset and size (size=0 means the whole file), the value is the mmap object.

daf.storage.memory_mapping.mmap_file(*, path: str, mode: str, fd: int, offset: int = 0, size: int = 0) → Union[bytes, bytearray][source]¶

Memory-map a whole file at some path opened using some fd using some mode (r, r+).

If both the offset and the size are zero, maps the whole file.

If there already exists a compatible mapping, return it instead of creating a new one. That is:

A mapping of the whole file will be used (sliced as needed) to return any mapping for the same file.
A mapping for “r+” data will be used to return any mapping for both “r” and “r+” data.

If the mode is r returns an immutable bytes object, if it is r+ returns a mutable bytearray object.

Note

Since we are caching the mmap, if the file was changed in the file system, this need not be reflected in the result of a following call to mmap_file. E.g., if the file was deleted, the mmap may survive (the OS will actually delete the data only once the final mmap is garbage collected, or the Python process exits). However writing into the file (updating the content) will immediately update the content of the mmap (and any numpy.ndarray built from it), which may cause subtle problems issues if the code assumes the data is immutable.

daf.storage.memory_mapping.bytes_as_ndarray(memory: Union[bytes, bytearray], *, name: str, shape: Collection[int], dtype: Union[str, dtype]) → ndarray[source]¶

View the bytes in memory as a numpy.ndarray with some shape and dtype.

If the bytes array has the wrong size, complain using the name.