Filesystem Implementation Introduction

This notebook aims to provide an overview of the functionality of hards.

Throughout the notebook, we will use two temporary directories called project_dir and database_dir. These can both be thought of as arbitrary directories on the filesystem. The former might be where we have some software running a simulation that produces some data files, we wish to store that data in the latter so we can perform some aggregate analysis at a later date.

[1]:
from pathlib import Path
from tempfile import TemporaryDirectory

temp_dir = TemporaryDirectory()

project_dir = Path(temp_dir.name) / "project_dir"
database_dir = Path(temp_dir.name) / "database_dir"

project_dir.mkdir()
database_dir.mkdir()

First, we need to create a database since one does not already exist.

[2]:
from hards.filesystem import FilesystemDatabase

database = FilesystemDatabase.create_database(database_dir / "database")

Datasets

A dataset has a name which must be unique on this level (ie. the database cannot have another dataset named database_1). The dataset name (and all names of hards objects) should only contain ASCII letters and digits, full stops (periods), hypens, and underscores.

[3]:
dataset_1 = database.create_dataset("dataset_1")

We can assign data to the dataset using a dictionary with a string key and JSON-serializable values. Note that data on a dataset should be thought of as metadata, the datapoints should contain your actual data (e.g. one datapoint per run of a simulation).

[4]:
dataset_1.add_data({
    "version": 1,
    "cost": 123.78,
})

# Data can be added multiple times
dataset_1.add_data({
    "name": "database1",
    "owners": ["you", 1234],
})

# The data attribute reflects the data from both calls
print(dataset_1.data)
{'version': 1, 'cost': 123.78, 'name': 'database1', 'owners': ['you', 1234]}

If you attempt to add data with an existing key, the old data is overwritten.

[5]:
dataset_1.add_data({"version": 2})

print(dataset_1.data["version"])
2

You can also attach files to a dataset.

[6]:
my_file = project_dir / "my_file.txt"
with my_file.open("w") as f:
    f.write("my important data!")

dataset_1.add_file(my_file)

Files are copied, so the original is preserved and a new file exists within the datasets structure.

[7]:
print(f"Does my_file.txt exist? {my_file.exists()}")

print(f"{dataset_1.name} has the following files: {dataset_1.files}")
Does my_file.txt exist? True
dataset_1 has the following files: ['my_file.txt']

We can access this file from the dataset and read it like a regular file. It is not advised to attempt to modify the files in the dataset structure. Instead, modify the original file and re-add it under the same name (the file will be overwritten with the new version).

[8]:
my_file_in_dataset = dataset_1.get_file("my_file.txt")

with my_file_in_dataset.open() as f:
    print(f.read())
my important data!

Datapoints

A dataset can contain many datapoints. Similar to the dataset above, a datapoint has its own data and can manage files.

NOTE: a file can be renamed using the name keyword.

[9]:
datapoint = dataset_1.create_datapoint("dataset")

datapoint.add_data({"input_1": 1.0, "input_2": 12.2})

my_new_file = project_dir / "data_point_file.txt"
with my_new_file.open("w") as f:
    f.write("data on the datapoint!")

datapoint.add_file(my_new_file, name="alternative_name.txt")
[10]:
print(datapoint.data)

with datapoint.get_file("alternative_name.txt").open() as f:
    print(f.read())
{'input_1': 1.0, 'input_2': 12.2}
data on the datapoint!