The API¶
HARDS defines its API both in code and text (docstrings).
The code defines the objects and interfaces required to implement the tree-like structure seen in the introduction. The types of arguments and return values provides some information about the API and the types of data it can handle. However, the docstrings will provide important implementation details, caveats, and important constraints on the functionality and valid data types.
Models¶
HARDS provides three classes that model how data is stored and related.
The database class handles the accessing of top-level datasets and–by extension–manages the physical location of all data (e.g. filesystem, web server).
The dataset class handels the accessing of related datapoints and datasets (called sub-datasets). A dataset can manage many datapoints and many datasets but only has one parent–either another dataset or the database.
Finally, a datapoint is a class that models a single unit of data. It has no children and is the child of a single dataset, although it will be available to all children of its parent.
An example of this model is found in the prologue of Hierarchical Arbitrary Data Storage (HARDS): numerical optimisation of magnetic coils. The top-level datasets here are the initial Monte Carlo samples of a given version of the simulator. The sub-datasets are the results of intelligent sampling algorithms that use the available data to train decision-making models.
Data and Files¶
Datasets and datapoints can have data and files attached to them.
Data is JSON serialisable and will often be a key-value store (dictionary) of data. Data attached to a dataset should be considered metadata that is relevant to all of the datapoints–however this merely a suggestion. Implementations of HARDS may place additional constraints on the type of data that is legal. For example, they may enforce a key-value store with no nested structures.
Files can be of any format and contain any data. They are copied into the database and can only be modified
by re-adding a file with the same name. Files are made accessible back to the user
via a pathlib.Path object (again, this is read-only).
Parallel Workflows¶
HARDS API makes no prescriptions to handle concurrent or parallel workflows safely. Generally, a parallel or concurrent workflow should not rely on changes made in another. Some implementations may not immediately commit changes to the data or new files to the database which could lead to one thread using outdated data/files. Similarly, the creation of a dataset or datapoints may not be immediately actioned within the database–especially in implementations that are non-local.
Therefore, it is good practice to create the dataset in a single thread before spawning parallel processes. Each of these parallel processes can be responsible for creating its own datapoints and updating their data and files. When the parallel processes are finished, analysis of the data can then occur on a single thread because all of the data should have been safely stored.
API Documentation¶
The abstract API for managing hierarchical datasets in Python.
- class hards.api.AbstractDatabase¶
Bases:
_TreeNode,_AbstractHasChildrenDatasetsMixinAbstract base class for a Database.
A database defines the root node of the hierarchical data management tree.
- abstract property children: Sequence[str]¶
The names of the object’s current children (datasets).
- abstract create_dataset(name: str) AbstractDataset¶
Create a dataset as a child of this object.
- Parameters:
name (str) – The name of the new dataset.
- Returns:
The new dataset object.
- Return type:
- Raises:
AlreadyExistsError – If a database with the same name already exists.
InvalidNameError – If the dataset name contains invalid character.
- database() AbstractDatabase¶
Return the database (this object).
- fullname() str¶
Return full name of the object.
This is the name that, when calling a recursive get method on the database would return a new instantiation of this object.
- abstract get_dataset(name: str) AbstractDataset¶
Return a dataset that is a child of this object.
- Parameters:
name (str) – The name of the dataset to get
- Returns:
The dataset with the given name.
- Return type:
- Raises:
DoesNotExistError – If a dataset with the given name does not exist.
- abstract has_dataset(name: str) bool¶
Indicate if the object has a child dataset with the given name.
- Parameters:
name (str) – The name of the dataset to check exists.
- Returns:
True if dataset exists, else False.
- Return type:
bool
- property is_database: bool¶
Returns true because this object does represent a database.
- abstract property name: str¶
The objects name.
- property parent: _TreeNode | None¶
The object’s parent.
- path_to_database() list[str]¶
Return the names of this object and the intermediates to the database.
- recursively_get_datapoint(name: str) AbstractDatapoint¶
Recurisvely follow a tree of datasets and return a datapoint.
- Parameters:
name (str) – The name of the datasets to follow and the datapoint to return of the form <intermediate dataset>/<intermediate dataset>/<…>/<datapoint>.
- Returns:
The datapoint object.
- Return type:
- Raises:
DoesNotExistError – If any of the intermediate datasets or the datapoint does not exist.
- recursively_get_dataset(name: str) AbstractDataset¶
Recursively follow a tree of datasets and return the final dataset.
- Parameters:
name (str) – The name of the datasets to follow and the dataset to return of the form <intermediate dataset>/<intermediate dataset>/<…>/<dataset of interest>
- Returns:
The dataset object.
- Return type:
- Raises:
DoesNotExistError – If any of the intermediate datasets do not exist.
- class hards.api.AbstractDatapoint¶
Bases:
_TreeNode,_AbstractHasDataAndFilesMixinAbstract base class for a Datapoint.
- abstract add_data(new_data: dict[str, Any]) None¶
Add new data to the object.
Adding new data does not remove old data unless a key already exists, in which case the old data of that key is overwritten by the newer data.
- Parameters:
new_data (dict[str, Any]) – New data, in key-value form, to be added to this objects data store.
- abstract add_file(file: Path, *, name: str | None = None) None¶
Add a file to be managed by this object.
Files are copied into the database and should be treated therein as read-only. They can be ‘modified’ by re-adding an updated file with the same name.
- Parameters:
file (Path) – A path to the file to add to this object.
name (str | None, optional) – Give the file a new name (include the extension).
- Raises:
DoesNotExistError – If the file does not exist or is not a file (e.g. it is a directory).
InvalidNameError – If the filename contains invalid characters. The filename may not be checked if a name is not explicitly provided.
- database() _TreeNode¶
Recursively find the database.
- abstract property files: Sequence[str]¶
The list of file names (including extensions).
- fullname() str¶
Return full name of the object.
This is the name that, when calling a recursive get method on the database would return a new instantiation of this object.
- abstract get_file(name: str) Path¶
Return the path to the file with the given name.
- Parameters:
name (str) – The name of the file (including its extension)
- Returns:
The path to the file (read-only).
- Return type:
Path
- Raises:
DoesNotExistError – If a file with the given name does not exist.
- abstract has_file(name: str) bool¶
Return a bool indicating whether the file with the given name exists.
The name must include the file extension.
- property is_database: bool¶
True if the object is the database (root node).
False by default.
- abstract property name: str¶
The objects name.
- abstract property parent: _TreeNode | None¶
The object’s parent.
- path_to_database() list[str]¶
Return the names of this object and the intermediates to the database.
- class hards.api.AbstractDataset¶
Bases:
_TreeNode,_AbstractHasChildrenDatasetsMixin,_AbstractHasDataAndFilesMixinAbstract base class for a Dataset.
A dataset contains (meta)data and several datapoints that form the dataset.
Datasets have a parent from whom they inherit additional datapoints (except for when the parent is the Database). It follows that Datasets can have many children which share their data.
- abstract add_data(new_data: dict[str, Any]) None¶
Add new data to the object.
Adding new data does not remove old data unless a key already exists, in which case the old data of that key is overwritten by the newer data.
- Parameters:
new_data (dict[str, Any]) – New data, in key-value form, to be added to this objects data store.
- abstract add_file(file: Path, *, name: str | None = None) None¶
Add a file to be managed by this object.
Files are copied into the database and should be treated therein as read-only. They can be ‘modified’ by re-adding an updated file with the same name.
- Parameters:
file (Path) – A path to the file to add to this object.
name (str | None, optional) – Give the file a new name (include the extension).
- Raises:
DoesNotExistError – If the file does not exist or is not a file (e.g. it is a directory).
InvalidNameError – If the filename contains invalid characters. The filename may not be checked if a name is not explicitly provided.
- abstract property children: Sequence[str]¶
The names of the object’s current children (datasets).
- abstract create_datapoint(name: str) AbstractDatapoint¶
Create and return a datapoint with a given name.
- Parameters:
name (str) – The name of the new datapoint.
- Returns:
The new object.
- Return type:
- Raises:
AlreadyExistsError – If a datapoint with the same name exists.
InvalidNameError – If the datapoint name contains invalid characters.
- abstract create_dataset(name: str) AbstractDataset¶
Create a dataset as a child of this object.
- Parameters:
name (str) – The name of the new dataset.
- Returns:
The new dataset object.
- Return type:
- Raises:
AlreadyExistsError – If a database with the same name already exists.
InvalidNameError – If the dataset name contains invalid character.
- database() _TreeNode¶
Recursively find the database.
- abstract property datapoints: Sequence[str]¶
The names of the Dataset’s current datapoints.
- abstract property files: Sequence[str]¶
The list of file names (including extensions).
- fullname() str¶
Return full name of the object.
This is the name that, when calling a recursive get method on the database would return a new instantiation of this object.
- abstract get_datapoint(name: str) AbstractDatapoint¶
Get the datapoint with a given name.
- Parameters:
name (str) – The name of the datapoint.
- Returns:
The datapoint object.
- Return type:
- Raises:
DoesNotExistError – If the datapoint does not exist.
- abstract get_dataset(name: str) AbstractDataset¶
Return a dataset that is a child of this object.
- Parameters:
name (str) – The name of the dataset to get
- Returns:
The dataset with the given name.
- Return type:
- Raises:
DoesNotExistError – If a dataset with the given name does not exist.
- abstract get_file(name: str) Path¶
Return the path to the file with the given name.
- Parameters:
name (str) – The name of the file (including its extension)
- Returns:
The path to the file (read-only).
- Return type:
Path
- Raises:
DoesNotExistError – If a file with the given name does not exist.
- abstract has_datapoint(name: str) bool¶
Indicate if the object has a datapoint with the given name.
- abstract has_dataset(name: str) bool¶
Indicate if the object has a child dataset with the given name.
- Parameters:
name (str) – The name of the dataset to check exists.
- Returns:
True if dataset exists, else False.
- Return type:
bool
- abstract has_file(name: str) bool¶
Return a bool indicating whether the file with the given name exists.
The name must include the file extension.
- property is_database: bool¶
True if the object is the database (root node).
False by default.
- abstract property name: str¶
The objects name.
- abstract property parent: _TreeNode | None¶
The object’s parent.
- path_to_database() list[str]¶
Return the names of this object and the intermediates to the database.
- recursively_get_datapoint(name: str) AbstractDatapoint¶
Recurisvely follow a tree of datasets and return a datapoint.
- Parameters:
name (str) – The name of the datasets to follow and the datapoint to return of the form <intermediate dataset>/<intermediate dataset>/<…>/<datapoint>.
- Returns:
The datapoint object.
- Return type:
- Raises:
DoesNotExistError – If any of the intermediate datasets or the datapoint does not exist.
- recursively_get_datapoints(*, reconstruct: bool = True, parents: bool = True) list[AbstractDatapoint]¶
Get all datapoints of this dataset and its parents (iff parents is True).
Follows the parents until the database, collecting their datapoints.
- Parameters:
reconstruct (bool) – Reconstruct the entire tree above this object and re-calls this method on a new instance of this dataset. This mitigates the situation where this datasets parent has been modified in another instance. This makes this method safe in single-threaded synchronous applications but makes no guarantees about parallel or asynchronous applications.
parents (bool) – If False, only the datapoints for this dataset are returned. Ie this method acts as a safe way to get the instantiated datapoints of this dataset only.
Notes
Reconstruction does not guarantee the safety of this method. See the relevant documentation sections for considerations.
- recursively_get_dataset(name: str) AbstractDataset¶
Recursively follow a tree of datasets and return the final dataset.
- Parameters:
name (str) – The name of the datasets to follow and the dataset to return of the form <intermediate dataset>/<intermediate dataset>/<…>/<dataset of interest>
- Returns:
The dataset object.
- Return type:
- Raises:
DoesNotExistError – If any of the intermediate datasets do not exist.