The API

HARDS defines its API both in code and text (docstrings).

The code defines the objects and interfaces required to implement the tree-like structure seen in the introduction. The types of arguments and return values provides some information about the API and the types of data it can handle. However, the docstrings will provide important implementation details, caveats, and important constraints on the functionality and valid data types.

Models

HARDS provides three classes that model how data is stored and related.

The database class handles the accessing of top-level datasets and–by extension–manages the physical location of all data (e.g. filesystem, web server).

The dataset class handels the accessing of related datapoints and datasets (called sub-datasets). A dataset can manage many datapoints and many datasets but only has one parent–either another dataset or the database.

Finally, a datapoint is a class that models a single unit of data. It has no children and is the child of a single dataset, although it will be available to all children of its parent.

An example of this model is found in the prologue of Hierarchical Arbitrary Data Storage (HARDS): numerical optimisation of magnetic coils. The top-level datasets here are the initial Monte Carlo samples of a given version of the simulator. The sub-datasets are the results of intelligent sampling algorithms that use the available data to train decision-making models.

Data and Files

Datasets and datapoints can have data and files attached to them.

Data is JSON serialisable and will often be a key-value store (dictionary) of data. Data attached to a dataset should be considered metadata that is relevant to all of the datapoints–however this merely a suggestion. Implementations of HARDS may place additional constraints on the type of data that is legal. For example, they may enforce a key-value store with no nested structures.

Files can be of any format and contain any data. They are copied into the database and can only be modified by re-adding a file with the same name. Files are made accessible back to the user via a pathlib.Path object (again, this is read-only).

Parallel Workflows

HARDS API makes no prescriptions to handle concurrent or parallel workflows safely. Generally, a parallel or concurrent workflow should not rely on changes made in another. Some implementations may not immediately commit changes to the data or new files to the database which could lead to one thread using outdated data/files. Similarly, the creation of a dataset or datapoints may not be immediately actioned within the database–especially in implementations that are non-local.

Therefore, it is good practice to create the dataset in a single thread before spawning parallel processes. Each of these parallel processes can be responsible for creating its own datapoints and updating their data and files. When the parallel processes are finished, analysis of the data can then occur on a single thread because all of the data should have been safely stored.

API Documentation

The abstract API for managing hierarchical datasets in Python.

class hards.api.AbstractDatabase

Bases: _TreeNode, _AbstractHasChildrenDatasetsMixin

Abstract base class for a Database.

A database defines the root node of the hierarchical data management tree.

abstract property children: Sequence[str]

The names of the object’s current children (datasets).

abstract create_dataset(name: str) AbstractDataset

Create a dataset as a child of this object.

Parameters:

name (str) – The name of the new dataset.

Returns:

The new dataset object.

Return type:

AbstractDataset

Raises:
  • AlreadyExistsError – If a database with the same name already exists.

  • InvalidNameError – If the dataset name contains invalid character.

database() AbstractDatabase

Return the database (this object).

fullname() str

Return full name of the object.

This is the name that, when calling a recursive get method on the database would return a new instantiation of this object.

abstract get_dataset(name: str) AbstractDataset

Return a dataset that is a child of this object.

Parameters:

name (str) – The name of the dataset to get

Returns:

The dataset with the given name.

Return type:

AbstractDataset

Raises:

DoesNotExistError – If a dataset with the given name does not exist.

abstract has_dataset(name: str) bool

Indicate if the object has a child dataset with the given name.

Parameters:

name (str) – The name of the dataset to check exists.

Returns:

True if dataset exists, else False.

Return type:

bool

property is_database: bool

Returns true because this object does represent a database.

abstract property name: str

The objects name.

property parent: _TreeNode | None

The object’s parent.

path_to_database() list[str]

Return the names of this object and the intermediates to the database.

recursively_get_datapoint(name: str) AbstractDatapoint

Recurisvely follow a tree of datasets and return a datapoint.

Parameters:

name (str) – The name of the datasets to follow and the datapoint to return of the form <intermediate dataset>/<intermediate dataset>/<…>/<datapoint>.

Returns:

The datapoint object.

Return type:

AbstractDatapoint

Raises:

DoesNotExistError – If any of the intermediate datasets or the datapoint does not exist.

recursively_get_dataset(name: str) AbstractDataset

Recursively follow a tree of datasets and return the final dataset.

Parameters:

name (str) – The name of the datasets to follow and the dataset to return of the form <intermediate dataset>/<intermediate dataset>/<…>/<dataset of interest>

Returns:

The dataset object.

Return type:

AbstractDataset

Raises:

DoesNotExistError – If any of the intermediate datasets do not exist.

class hards.api.AbstractDatapoint

Bases: _TreeNode, _AbstractHasDataAndFilesMixin

Abstract base class for a Datapoint.

abstract add_data(new_data: dict[str, Any]) None

Add new data to the object.

Adding new data does not remove old data unless a key already exists, in which case the old data of that key is overwritten by the newer data.

Parameters:

new_data (dict[str, Any]) – New data, in key-value form, to be added to this objects data store.

abstract add_file(file: Path, *, name: str | None = None) None

Add a file to be managed by this object.

Files are copied into the database and should be treated therein as read-only. They can be ‘modified’ by re-adding an updated file with the same name.

Parameters:
  • file (Path) – A path to the file to add to this object.

  • name (str | None, optional) – Give the file a new name (include the extension).

Raises:
  • DoesNotExistError – If the file does not exist or is not a file (e.g. it is a directory).

  • InvalidNameError – If the filename contains invalid characters. The filename may not be checked if a name is not explicitly provided.

database() _TreeNode

Recursively find the database.

abstract property files: Sequence[str]

The list of file names (including extensions).

fullname() str

Return full name of the object.

This is the name that, when calling a recursive get method on the database would return a new instantiation of this object.

abstract get_file(name: str) Path

Return the path to the file with the given name.

Parameters:

name (str) – The name of the file (including its extension)

Returns:

The path to the file (read-only).

Return type:

Path

Raises:

DoesNotExistError – If a file with the given name does not exist.

abstract has_file(name: str) bool

Return a bool indicating whether the file with the given name exists.

The name must include the file extension.

property is_database: bool

True if the object is the database (root node).

False by default.

abstract property name: str

The objects name.

abstract property parent: _TreeNode | None

The object’s parent.

path_to_database() list[str]

Return the names of this object and the intermediates to the database.

class hards.api.AbstractDataset

Bases: _TreeNode, _AbstractHasChildrenDatasetsMixin, _AbstractHasDataAndFilesMixin

Abstract base class for a Dataset.

A dataset contains (meta)data and several datapoints that form the dataset.

Datasets have a parent from whom they inherit additional datapoints (except for when the parent is the Database). It follows that Datasets can have many children which share their data.

abstract add_data(new_data: dict[str, Any]) None

Add new data to the object.

Adding new data does not remove old data unless a key already exists, in which case the old data of that key is overwritten by the newer data.

Parameters:

new_data (dict[str, Any]) – New data, in key-value form, to be added to this objects data store.

abstract add_file(file: Path, *, name: str | None = None) None

Add a file to be managed by this object.

Files are copied into the database and should be treated therein as read-only. They can be ‘modified’ by re-adding an updated file with the same name.

Parameters:
  • file (Path) – A path to the file to add to this object.

  • name (str | None, optional) – Give the file a new name (include the extension).

Raises:
  • DoesNotExistError – If the file does not exist or is not a file (e.g. it is a directory).

  • InvalidNameError – If the filename contains invalid characters. The filename may not be checked if a name is not explicitly provided.

abstract property children: Sequence[str]

The names of the object’s current children (datasets).

abstract create_datapoint(name: str) AbstractDatapoint

Create and return a datapoint with a given name.

Parameters:

name (str) – The name of the new datapoint.

Returns:

The new object.

Return type:

AbstractDatapoint

Raises:
  • AlreadyExistsError – If a datapoint with the same name exists.

  • InvalidNameError – If the datapoint name contains invalid characters.

abstract create_dataset(name: str) AbstractDataset

Create a dataset as a child of this object.

Parameters:

name (str) – The name of the new dataset.

Returns:

The new dataset object.

Return type:

AbstractDataset

Raises:
  • AlreadyExistsError – If a database with the same name already exists.

  • InvalidNameError – If the dataset name contains invalid character.

database() _TreeNode

Recursively find the database.

abstract property datapoints: Sequence[str]

The names of the Dataset’s current datapoints.

abstract property files: Sequence[str]

The list of file names (including extensions).

fullname() str

Return full name of the object.

This is the name that, when calling a recursive get method on the database would return a new instantiation of this object.

abstract get_datapoint(name: str) AbstractDatapoint

Get the datapoint with a given name.

Parameters:

name (str) – The name of the datapoint.

Returns:

The datapoint object.

Return type:

AbstractDatapoint

Raises:

DoesNotExistError – If the datapoint does not exist.

abstract get_dataset(name: str) AbstractDataset

Return a dataset that is a child of this object.

Parameters:

name (str) – The name of the dataset to get

Returns:

The dataset with the given name.

Return type:

AbstractDataset

Raises:

DoesNotExistError – If a dataset with the given name does not exist.

abstract get_file(name: str) Path

Return the path to the file with the given name.

Parameters:

name (str) – The name of the file (including its extension)

Returns:

The path to the file (read-only).

Return type:

Path

Raises:

DoesNotExistError – If a file with the given name does not exist.

abstract has_datapoint(name: str) bool

Indicate if the object has a datapoint with the given name.

abstract has_dataset(name: str) bool

Indicate if the object has a child dataset with the given name.

Parameters:

name (str) – The name of the dataset to check exists.

Returns:

True if dataset exists, else False.

Return type:

bool

abstract has_file(name: str) bool

Return a bool indicating whether the file with the given name exists.

The name must include the file extension.

property is_database: bool

True if the object is the database (root node).

False by default.

abstract property name: str

The objects name.

abstract property parent: _TreeNode | None

The object’s parent.

path_to_database() list[str]

Return the names of this object and the intermediates to the database.

recursively_get_datapoint(name: str) AbstractDatapoint

Recurisvely follow a tree of datasets and return a datapoint.

Parameters:

name (str) – The name of the datasets to follow and the datapoint to return of the form <intermediate dataset>/<intermediate dataset>/<…>/<datapoint>.

Returns:

The datapoint object.

Return type:

AbstractDatapoint

Raises:

DoesNotExistError – If any of the intermediate datasets or the datapoint does not exist.

recursively_get_datapoints(*, reconstruct: bool = True, parents: bool = True) list[AbstractDatapoint]

Get all datapoints of this dataset and its parents (iff parents is True).

Follows the parents until the database, collecting their datapoints.

Parameters:
  • reconstruct (bool) – Reconstruct the entire tree above this object and re-calls this method on a new instance of this dataset. This mitigates the situation where this datasets parent has been modified in another instance. This makes this method safe in single-threaded synchronous applications but makes no guarantees about parallel or asynchronous applications.

  • parents (bool) – If False, only the datapoints for this dataset are returned. Ie this method acts as a safe way to get the instantiated datapoints of this dataset only.

Notes

Reconstruction does not guarantee the safety of this method. See the relevant documentation sections for considerations.

recursively_get_dataset(name: str) AbstractDataset

Recursively follow a tree of datasets and return the final dataset.

Parameters:

name (str) – The name of the datasets to follow and the dataset to return of the form <intermediate dataset>/<intermediate dataset>/<…>/<dataset of interest>

Returns:

The dataset object.

Return type:

AbstractDataset

Raises:

DoesNotExistError – If any of the intermediate datasets do not exist.