{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Filesystem Implementation Introduction\n", "\n", "This notebook aims to provide an overview of the functionality of `hards`. \n", "\n", "Throughout the notebook, we will use two temporary directories called `project_dir` and `database_dir`. These can both be thought of as arbitrary directories on the filesystem. The former might be where we have some software running a simulation that produces some data files, we wish to store that data in the latter so we can perform some aggregate analysis at a later date. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from pathlib import Path\n", "from tempfile import TemporaryDirectory\n", "\n", "temp_dir = TemporaryDirectory()\n", "\n", "project_dir = Path(temp_dir.name) / \"project_dir\"\n", "database_dir = Path(temp_dir.name) / \"database_dir\"\n", "\n", "project_dir.mkdir()\n", "database_dir.mkdir()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First, we need to create a database since one does not already exist." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from hards.filesystem import FilesystemDatabase\n", "\n", "database = FilesystemDatabase.create_database(database_dir / \"database\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Datasets\n", "\n", "A dataset has a name which must be unique on this level (ie. the `database` cannot have another dataset named `database_1`). The dataset name (and all names of `hards` objects) should only contain ASCII letters and digits, full stops (periods), hypens, and underscores. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dataset_1 = database.create_dataset(\"dataset_1\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can assign data to the dataset using a dictionary with a string key and JSON-serializable values. Note that data on a dataset should be thought of as _metadata_, the datapoints should contain your actual data (e.g. one datapoint per run of a simulation)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dataset_1.add_data({\n", " \"version\": 1,\n", " \"cost\": 123.78,\n", "})\n", "\n", "# Data can be added multiple times\n", "dataset_1.add_data({\n", " \"name\": \"database1\",\n", " \"owners\": [\"you\", 1234],\n", "})\n", "\n", "# The data attribute reflects the data from both calls\n", "print(dataset_1.data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you attempt to add data with an existing key, the old data is overwritten." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dataset_1.add_data({\"version\": 2})\n", "\n", "print(dataset_1.data[\"version\"])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can also attach files to a dataset." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "my_file = project_dir / \"my_file.txt\"\n", "with my_file.open(\"w\") as f:\n", " f.write(\"my important data!\")\n", "\n", "dataset_1.add_file(my_file)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Files are copied, so the original is preserved and a new file exists within the datasets structure." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(f\"Does my_file.txt exist? {my_file.exists()}\")\n", "\n", "print(f\"{dataset_1.name} has the following files: {dataset_1.files}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can access this file from the dataset and read it like a regular file. It is not advised to attempt to modify the files in the dataset structure. Instead, modify the original file and re-add it under the same name (the file will be overwritten with the new version)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "my_file_in_dataset = dataset_1.get_file(\"my_file.txt\")\n", "\n", "with my_file_in_dataset.open() as f:\n", " print(f.read())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Datapoints\n", "\n", "A dataset can contain many datapoints. Similar to the dataset above, a datapoint has its own data and can manage files. \n", "\n", "_**NOTE:**_ a file can be renamed using the `name` keyword. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "datapoint = dataset_1.create_datapoint(\"dataset\")\n", "\n", "datapoint.add_data({\"input_1\": 1.0, \"input_2\": 12.2})\n", "\n", "my_new_file = project_dir / \"data_point_file.txt\"\n", "with my_new_file.open(\"w\") as f:\n", " f.write(\"data on the datapoint!\")\n", "\n", "datapoint.add_file(my_new_file, name=\"alternative_name.txt\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(datapoint.data)\n", "\n", "with datapoint.get_file(\"alternative_name.txt\").open() as f:\n", " print(f.read())" ] } ], "metadata": { "kernelspec": { "display_name": ".venv", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.13" } }, "nbformat": 4, "nbformat_minor": 2 }