daf.storage.memory_mapping¶
Functions to implement memory-mapped files of 1D/2D data.
The format used here was chosen for simplicity, making it easy for “any” (even non-Python) systems to access the data.
It is so trivial it can hardly be called a “format” at all, and is explicitly not the format used by
numpy.memmap
, which is terribly complicated and can only be accessed using numpy
in Python (in order to support
many use cases we don’t care about in daf
).
Note
The code here assumes all the machines accessing memory-mapped data use the same (little-endian) byte order and IEEE floating point formats, which is true for all modern CPUs.
Functions:
Returns whether the disk files for a memory-mapped |
|
Remove the disk files (which must exist) for a memory-mapped |
|
|
Create new disk files for a memory-mapped |
|
Open memory-mapped |
Returns whether the disk files for a memory-mapped |
|
Remove the disk files for a memory-mapped |
|
|
Write the disk files for a memory-mapped |
|
Open memory-mapped |
Increase the maximal number of open files as much as possible, and return the updated limit. |
|
|
Memory-map a whole file at some |
|
View the bytes in |
Data:
Cache memory-mapped files so they are not mapped twice. |
- daf.storage.memory_mapping.exists_memory_mapped_array(path: str) bool [source]¶
Returns whether the disk files for a memory-mapped
numpy.ndarray
exist.
- daf.storage.memory_mapping.remove_memory_mapped_array(path: str) None [source]¶
Remove the disk files (which must exist) for a memory-mapped
numpy.ndarray
.
- daf.storage.memory_mapping.create_memory_mapped_array(path: str, shape: Union[int, Tuple[int, int]], dtype: Union[str, dtype]) None [source]¶
Create new disk files for a memory-mapped
numpy.ndarray
of someshape
anddtype
in somepath
.This will silently overwrite existing files. In particular, it will delete
<path>.data
,<path>.indices
and/or<path>.indptr
files if the exists.The array element type must be one of
FIXED_DTYPES
, that is, one can’t create a memory-mapped file of strings or objects of strange types.This creates two disk files:
<path>.array
which contains just the data elements. For 2D data, this is always inROW_MAJOR
layout.<path>.yaml
which contains a mapping with three keys:version
is a list of two integers, the major and minor format version numbers, to protect against future extensions of the format. This version of the library will generate[1, 0]
files and will accept any files with a major version of1
.dtype
specifies the array element data type.shape
specifies the number of elements (as a sequence of one or two integers).
This simple representation makes it easy for other systems to directly access the data. However, it basically makes it impossible to automatically report the type of the files (e.g., using the Linux
file
command).Note
This just creates the files, filled with zeros. To access the data, you’ll need to call
open_memory_mapped_array
. Also, actual disk space for the data file is not allocated yet; all of the file except is a “hole”. Actual disk pages are only created when the data is actually written for the first time. This makes creating large data files very quick, and even filling them with data is quick as long as the operating system doesn’t need to actually flush the pages to the disk, which might even be deferred until after the program exits. In fact, if the file is deleted before the program exits, the data need not touch the disk at all.
- daf.storage.memory_mapping.open_memory_mapped_array(path: str, mode: str) Union[Vector, DenseInRows] [source]¶
Open memory-mapped
numpy.ndarray
disk files.The
mode
must be one ofr+
for read-write access, orr
for read-only access (which returnsis_frozen
data).Note
This only maps the data to memory, it does not actually read it from the disk. This makes opening large data files very quick, and even accessing the data may be fast as long as the operating system doesn’t need to actually get the specific used pages from the disk (e.g. the pages were previously read, or were just written, so they are already in RAM). Therefore mapping a large array and only accessing small parts of it would be much faster than reading all the array to memory in advance.
This does consume virtual address space to cover the whole data, but because the data is memory-mapped to a disk file, it allows accessing data that is larger than the physical RAM; the operating system brings in disk pages as necessary when data is accessed, and is free to flush/forget them to release space, so only a small subset of them must exist in RAM at any given time.
- daf.storage.memory_mapping.exists_memory_mapped_sparse(path: str) bool [source]¶
Returns whether the disk files for a memory-mapped
SparseInRows
matrix exist.
- daf.storage.memory_mapping.remove_memory_mapped_sparse(path: str) None [source]¶
Remove the disk files for a memory-mapped
SparseInRows
matrix, if they exist.
- daf.storage.memory_mapping.write_memory_mapped_sparse(path: str, sparse: SparseInRows) None [source]¶
Write the disk files for a memory-mapped
SparseInRows
matrix, if they exist.This will silently overwrite existing files. In particular, it will delete a
<path>.array
file if one exists.This creates four disk files:
<path>.data
which contains just the non-zero data elements.<path>.indices
which contains the column indices of the non-zero data elements.<path>.indptr
which contains the ranges of the entries of the non-zero data elements of the rows.<path>.yaml
which contains a mapping with six keys:version
is a list of two integers, the major and minor version numbers, to protect against future extensions of the format. This version of the library will generate[1, 0]
files and will accept any files with a major version of1
.data_dtype
specifies the non-zero data array element data type.indices_dtype
specifies the non-zero column indices array element data type.indptr_dtype
specifies the rows entries ranges indptr array element data type.shape
specifies the number of elements (as a sequence of one or two integers).nnz
specifies the number of non-zero data elements.
This simple representation makes it easy for other systems to directly access the data. However, it basically makes it impossible to automatically report the type of the files (e.g., using the Linux
file
command).
- daf.storage.memory_mapping.open_memory_mapped_sparse(path: str, mode: str) SparseInRows [source]¶
Open memory-mapped
SparseInRows
matrix disk files.The
mode
must be one ofw
for read-write access, orr
for read-only access.
- daf.storage.memory_mapping.allow_maximal_open_files() int [source]¶
Increase the maximal number of open files as much as possible, and return the updated limit.
Every time you
open_memory_mapped_array
oropen_memory_mapped_sparse
, the relevant file(s) are memory-mapped which counts as “open files”. The operating system restricts the maximal number of such open files per process. When you reach this limit you will see an error complaining about “too many open files” or “running out of file descriptors”.Luckily, modern operating system allow for a large number of open files so this isn’t a problem for common usage.
If you do reach this limit, call this function which will use
resource.setrlimit(resource.RLIMIT_NOFILE, ...)
to increase the maximal number of open files to the maximum (“hard”) limit allowed by the operating system, as opposed of to the lower “soft” limit used by default. This higher hard limit (the return value) is even higher in modern operating systems, and should be enough for most “uncommon” usage.If even that isn’t enough, you should probably reflect on whether what you are trying to do makes sense in the first place. If you are certain it does, then most operating systems provide a way to raise the hard limit of open files to “any” value. This requires administrator privileges and is beyond the scope of this package.
- daf.storage.memory_mapping.MMAP_CACHE: WeakValueDictionary[Tuple[str, str, int, int], mmap] = <WeakValueDictionary>¶
Cache memory-mapped files so they are not mapped twice.
The key is the path, the mode, offset and size (size=0 means the whole file), the value is the
mmap
object.
- daf.storage.memory_mapping.mmap_file(*, path: str, mode: str, fd: int, offset: int = 0, size: int = 0) Union[bytes, bytearray] [source]¶
Memory-map a whole file at some
path
opened using somefd
using somemode
(r
,r+
).If both the
offset
and thesize
are zero, maps the whole file.If there already exists a compatible mapping, return it instead of creating a new one. That is:
A mapping of the whole file will be used (sliced as needed) to return any mapping for the same file.
A mapping for “r+” data will be used to return any mapping for both “r” and “r+” data.
If the mode is
r
returns an immutablebytes
object, if it isr+
returns a mutablebytearray
object.Note
Since we are caching the
mmap
, if the file was changed in the file system, this need not be reflected in the result of a following call tommap_file
. E.g., if the file was deleted, themmap
may survive (the OS will actually delete the data only once the finalmmap
is garbage collected, or the Python process exits). However writing into the file (updating the content) will immediately update the content of themmap
(and anynumpy.ndarray
built from it), which may cause subtle problems issues if the code assumes the data is immutable.