Filesystem Implementation Introduction¶
This notebook aims to provide an overview of the functionality of hards.
Throughout the notebook, we will use two temporary directories called project_dir and database_dir. These can both be thought of as arbitrary directories on the filesystem. The former might be where we have some software running a simulation that produces some data files, we wish to store that data in the latter so we can perform some aggregate analysis at a later date.
[1]:
from pathlib import Path
from tempfile import TemporaryDirectory
temp_dir = TemporaryDirectory()
project_dir = Path(temp_dir.name) / "project_dir"
database_dir = Path(temp_dir.name) / "database_dir"
project_dir.mkdir()
database_dir.mkdir()
First, we need to create a database since one does not already exist.
[2]:
from hards.filesystem import FilesystemDatabase
database = FilesystemDatabase.create_database(database_dir / "database")
Datasets¶
A dataset has a name which must be unique on this level (ie. the database cannot have another dataset named database_1). The dataset name (and all names of hards objects) should only contain ASCII letters and digits, full stops (periods), hypens, and underscores.
[3]:
dataset_1 = database.create_dataset("dataset_1")
We can assign data to the dataset using a dictionary with a string key and JSON-serializable values. Note that data on a dataset should be thought of as metadata, the datapoints should contain your actual data (e.g. one datapoint per run of a simulation).
[4]:
dataset_1.add_data({
"version": 1,
"cost": 123.78,
})
# Data can be added multiple times
dataset_1.add_data({
"name": "database1",
"owners": ["you", 1234],
})
# The data attribute reflects the data from both calls
print(dataset_1.data)
{'version': 1, 'cost': 123.78, 'name': 'database1', 'owners': ['you', 1234]}
If you attempt to add data with an existing key, the old data is overwritten.
[5]:
dataset_1.add_data({"version": 2})
print(dataset_1.data["version"])
2
You can also attach files to a dataset.
[6]:
my_file = project_dir / "my_file.txt"
with my_file.open("w") as f:
f.write("my important data!")
dataset_1.add_file(my_file)
Files are copied, so the original is preserved and a new file exists within the datasets structure.
[7]:
print(f"Does my_file.txt exist? {my_file.exists()}")
print(f"{dataset_1.name} has the following files: {dataset_1.files}")
Does my_file.txt exist? True
dataset_1 has the following files: ['my_file.txt']
We can access this file from the dataset and read it like a regular file. It is not advised to attempt to modify the files in the dataset structure. Instead, modify the original file and re-add it under the same name (the file will be overwritten with the new version).
[8]:
my_file_in_dataset = dataset_1.get_file("my_file.txt")
with my_file_in_dataset.open() as f:
print(f.read())
my important data!
Datapoints¶
A dataset can contain many datapoints. Similar to the dataset above, a datapoint has its own data and can manage files.
NOTE: a file can be renamed using the name keyword.
[9]:
datapoint = dataset_1.create_datapoint("dataset")
datapoint.add_data({"input_1": 1.0, "input_2": 12.2})
my_new_file = project_dir / "data_point_file.txt"
with my_new_file.open("w") as f:
f.write("data on the datapoint!")
datapoint.add_file(my_new_file, name="alternative_name.txt")
[10]:
print(datapoint.data)
with datapoint.get_file("alternative_name.txt").open() as f:
print(f.read())
{'input_1': 1.0, 'input_2': 12.2}
data on the datapoint!