Sub-Datasets in the Filesystem Implementation¶
A sub-dataset is a dataset that exists under another dataset, not the database. This allows data to be organised and shared in a hierarchical structure.
[1]:
from pathlib import Path
from tempfile import TemporaryDirectory
from hards.filesystem import FilesystemDatabase
temp_dir = TemporaryDirectory()
database = FilesystemDatabase.create_database(Path(temp_dir.name) / "database")
dataset = database.create_dataset("dataset")
One can create sub-datasets of sub-datasets ad infinitum (subject to the limits of the filesystem).
For example, we can create the following structure:
.
└── database
├── dataset
│ └── sub_dataset
│ └── sub_dataset
│ └── sub_dataset
└── <other datasets>
NOTE: there is no functional difference between a dataset and sub-dataset, the distinction is only made for clarity of this example.
[2]:
sub_dataset = dataset.create_dataset("sub_dataset")
# NOTE the names can be the same because they are not on the same level!
sub_sub_dataset = sub_dataset.create_dataset("sub_dataset")
sub_sub_sub_dataset = sub_sub_dataset.create_dataset("sub_dataset")
The rules of data sharing are very simple: a dataset has access to its parent(s) direct data.
For example, adding a datapoint to the dataset will make it available to all of the sub-datasets.
[3]:
dataset.create_datapoint("my_datapoint")
print(f"Dataset number of datapoints: {len(dataset.recursively_get_datapoints())}")
Dataset number of datapoints: 1
[4]:
print(
f"Sub-dataset number of datapoints: {len(sub_dataset.recursively_get_datapoints())}"
)
print(
"Sub-sub-dataset number of datapoints: "
f"{len(sub_sub_dataset.recursively_get_datapoints())}"
)
print(
"Sub-sub-sub-dataset number of datapoints: "
f"{len(sub_sub_sub_dataset.recursively_get_datapoints())}"
)
Sub-dataset number of datapoints: 1
Sub-sub-dataset number of datapoints: 1
Sub-sub-sub-dataset number of datapoints: 1
[5]:
# NOTE that the .datapoints property only provides access to direct datapoints
# of a dataset. E.g.
print(f"Dataset direct datapoints: {dataset.datapoints}")
print(f"Sub-dataset direct datapoints: {sub_dataset.datapoints}")
Dataset direct datapoints: ['my_datapoint']
Sub-dataset direct datapoints: []
Therefore, if we add a datapoint onto the sub-dataset, it will not be available to the dataset but will be to the other sub-datasets.
[6]:
sub_dataset.create_datapoint("sub_datapoint")
print(f"Dataset number of datapoints: {len(dataset.recursively_get_datapoints())}")
print(
f"Sub-dataset number of datapoints: {len(sub_dataset.recursively_get_datapoints())}"
)
print(
"Sub-sub-dataset number of datapoints: "
f"{len(sub_sub_dataset.recursively_get_datapoints())}"
)
Dataset number of datapoints: 1
Sub-dataset number of datapoints: 2
Sub-sub-dataset number of datapoints: 2