Sub-Datasets in the Filesystem Implementation

A sub-dataset is a dataset that exists under another dataset, not the database. This allows data to be organised and shared in a hierarchical structure.

[1]:
from pathlib import Path
from tempfile import TemporaryDirectory

from hards.filesystem import FilesystemDatabase

temp_dir = TemporaryDirectory()

database = FilesystemDatabase.create_database(Path(temp_dir.name) / "database")
dataset = database.create_dataset("dataset")

One can create sub-datasets of sub-datasets ad infinitum (subject to the limits of the filesystem).

For example, we can create the following structure:

.
└── database
    ├── dataset
    │   └── sub_dataset
    │       └── sub_dataset
    │           └── sub_dataset
    └── <other datasets>

NOTE: there is no functional difference between a dataset and sub-dataset, the distinction is only made for clarity of this example.

[2]:
sub_dataset = dataset.create_dataset("sub_dataset")

# NOTE the names can be the same because they are not on the same level!
sub_sub_dataset = sub_dataset.create_dataset("sub_dataset")
sub_sub_sub_dataset = sub_sub_dataset.create_dataset("sub_dataset")

The rules of data sharing are very simple: a dataset has access to its parent(s) direct data.

For example, adding a datapoint to the dataset will make it available to all of the sub-datasets.

[3]:
dataset.create_datapoint("my_datapoint")

print(f"Dataset number of datapoints: {len(dataset.recursively_get_datapoints())}")
Dataset number of datapoints: 1
[4]:
print(
    f"Sub-dataset number of datapoints: {len(sub_dataset.recursively_get_datapoints())}"
)
print(
    "Sub-sub-dataset number of datapoints: "
    f"{len(sub_sub_dataset.recursively_get_datapoints())}"
)
print(
    "Sub-sub-sub-dataset number of datapoints: "
    f"{len(sub_sub_sub_dataset.recursively_get_datapoints())}"
)
Sub-dataset number of datapoints: 1
Sub-sub-dataset number of datapoints: 1
Sub-sub-sub-dataset number of datapoints: 1
[5]:
# NOTE that the .datapoints property only provides access to direct datapoints
# of a dataset. E.g.
print(f"Dataset direct datapoints: {dataset.datapoints}")
print(f"Sub-dataset direct datapoints: {sub_dataset.datapoints}")
Dataset direct datapoints: ['my_datapoint']
Sub-dataset direct datapoints: []

Therefore, if we add a datapoint onto the sub-dataset, it will not be available to the dataset but will be to the other sub-datasets.

[6]:
sub_dataset.create_datapoint("sub_datapoint")

print(f"Dataset number of datapoints: {len(dataset.recursively_get_datapoints())}")
print(
    f"Sub-dataset number of datapoints: {len(sub_dataset.recursively_get_datapoints())}"
)
print(
    "Sub-sub-dataset number of datapoints: "
    f"{len(sub_sub_dataset.recursively_get_datapoints())}"
)
Dataset number of datapoints: 1
Sub-dataset number of datapoints: 2
Sub-sub-dataset number of datapoints: 2