{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Sub-Datasets in the Filesystem Implementation\n", "\n", "A sub-dataset is a dataset that exists under another dataset, not the database. This allows data to be organised and shared in a _hierarchical_ structure." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from pathlib import Path\n", "from tempfile import TemporaryDirectory\n", "\n", "from hards.filesystem import FilesystemDatabase\n", "\n", "temp_dir = TemporaryDirectory()\n", "\n", "database = FilesystemDatabase.create_database(Path(temp_dir.name) / \"database\")\n", "dataset = database.create_dataset(\"dataset\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One can create sub-datasets of sub-datasets ad infinitum (subject to the limits of the filesystem).\n", "\n", "For example, we can create the following structure:\n", "```\n", ".\n", "└── database\n", " ├── dataset\n", " │ └── sub_dataset\n", " │ └── sub_dataset\n", " │ └── sub_dataset\n", " └── \n", "```\n", "\n", "_**NOTE:**_ there is no functional difference between a dataset and sub-dataset, the distinction is only made for clarity of this example." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sub_dataset = dataset.create_dataset(\"sub_dataset\")\n", "\n", "# NOTE the names can be the same because they are not on the same level!\n", "sub_sub_dataset = sub_dataset.create_dataset(\"sub_dataset\")\n", "sub_sub_sub_dataset = sub_sub_dataset.create_dataset(\"sub_dataset\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The rules of data sharing are very simple: a dataset has access to its parent(s) direct data.\n", "\n", "For example, adding a datapoint to the dataset will make it available to all of the sub-datasets." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dataset.create_datapoint(\"my_datapoint\")\n", "\n", "print(f\"Dataset number of datapoints: {len(dataset.recursively_get_datapoints())}\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(\n", " f\"Sub-dataset number of datapoints: {len(sub_dataset.recursively_get_datapoints())}\"\n", ")\n", "print(\n", " \"Sub-sub-dataset number of datapoints: \"\n", " f\"{len(sub_sub_dataset.recursively_get_datapoints())}\"\n", ")\n", "print(\n", " \"Sub-sub-sub-dataset number of datapoints: \"\n", " f\"{len(sub_sub_sub_dataset.recursively_get_datapoints())}\"\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# NOTE that the .datapoints property only provides access to direct datapoints\n", "# of a dataset. E.g.\n", "print(f\"Dataset direct datapoints: {dataset.datapoints}\")\n", "print(f\"Sub-dataset direct datapoints: {sub_dataset.datapoints}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Therefore, if we add a datapoint onto the sub-dataset, it will not be available to the dataset but will be to the other sub-datasets." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sub_dataset.create_datapoint(\"sub_datapoint\")\n", "\n", "print(f\"Dataset number of datapoints: {len(dataset.recursively_get_datapoints())}\")\n", "print(\n", " f\"Sub-dataset number of datapoints: {len(sub_dataset.recursively_get_datapoints())}\"\n", ")\n", "print(\n", " \"Sub-sub-dataset number of datapoints: \"\n", " f\"{len(sub_sub_dataset.recursively_get_datapoints())}\"\n", ")" ] } ], "metadata": { "kernelspec": { "display_name": ".venv", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.13" } }, "nbformat": 4, "nbformat_minor": 2 }