{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Example Application of the Filesystem Implementation\n", "This notebook outlines how `hards` can be used to store and access data for scientific experiments.\n", "\n", "`hards` itself has no external dependencies and is written using Python standard libraries such as `pathlib` and `json`. However, this example uses a number of a visualisation and scientific computing libraries to provide a more coherent and useful example. It is therefore required that this notebook is run in an environment with the development dependencies installed as detailed in the README.\n", "\n", "\n", "The approximation of $\\pi$ using [Monte Carlo methods](https://en.wikipedia.org/wiki/Monte_Carlo_method#Overview) is a well-know introductory problem to scientific computing and Monte Carlo methods. In this notebook, we will compare Monte Carlo sampling versus quasi-Monte Carlo sampling and conclude which is better at approximating $\\pi$. The basic principle is to generate $N$ points on the unit square $[0, 1]^2$ and check how many, $p$, are within the unit circle (centre $(0, 0)$ and radius $1$). \n", "\n", "$$\n", "\\pi\\approx\\frac{4p}{N}\n", "$$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Setup Helper Functions\n", "\n", "First, we will setup a number of helper functions to sample data on the unit square and check whether the sample is within the unit circle. The data will be a list of dictionaries, one for each sample point." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import random\n", "from typing import Literal\n", "\n", "from scipy.spatial.distance import euclidean\n", "from scipy.stats import qmc\n", "\n", "\n", "def sample_unit_square(num: int, method: Literal[\"sobol\", \"random\"]) -> list:\n", " \"\"\"Sample `num` points from the unit square.\"\"\"\n", " # Generate required number of 2D samples\n", "\n", " if method == \"sobol\":\n", " sampler = qmc.Sobol(2)\n", " samples = sampler.random(num).tolist()\n", " else:\n", " samples = []\n", " for _ in range(num):\n", " samples.append([random.uniform(0.0, 1.0), random.uniform(0.0, 1.0)])\n", "\n", " data = []\n", " for x, y in samples:\n", " # Calculate the Euclidean distance of the sample from the cirlce centre\n", " radius = euclidean([x, y], [0.0, 0.0])\n", "\n", " # Check whether the point are within the unit circle\n", " in_circle = radius <= 1.0\n", "\n", " data.append({\"x\": x, \"y\": y, \"r\": radius, \"in_circle\": in_circle})\n", "\n", " return data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The next helper functions will visualise the samples and calculate an approximate of $\\pi$." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "import pandas as pd\n", "import seaborn as sns\n", "\n", "\n", "def plot_samples(samples: list) -> None:\n", " \"\"\"Plot samples on the unit circle.\"\"\"\n", " samples_df = pd.DataFrame(samples)\n", "\n", " ax = sns.scatterplot(\n", " samples_df,\n", " x=\"x\",\n", " y=\"y\",\n", " hue=\"in_circle\",\n", " style=\"in_circle\",\n", " )\n", "\n", " ax.add_patch(plt.Circle((0, 0), 1.0, edgecolor=\"black\", facecolor=\"None\"))\n", " ax.set_xlim(0, 1)\n", " ax.set_ylim(0, 1)\n", " sns.move_legend(ax, loc=\"upper left\")\n", "\n", " return ax.figure\n", "\n", "\n", "def approximate_pi(samples: list) -> float:\n", " \"\"\"Approximate pi from the samples.\"\"\"\n", " samples_df = pd.DataFrame(samples)\n", "\n", " return (4.0 * len(samples_df[samples_df[\"in_circle\"]])) / len(samples_df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, this function converts the list of samples into a dictionary indexed by the string `'sample'`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def samples_to_dict(samples: list) -> dict:\n", " \"\"\"Convert the samples into a dictionary.\"\"\"\n", " return {f\"sample{i}\": v for i, v in enumerate(samples)}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Create the HARDS Databse\n", "The `FilesystemDatabase` will be created inside of a temporary directory meaning all data from the example will be cleaned up upon exiting the notebook (the OS should handle this, the end of the notebook will also do this explicitly)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from pathlib import Path\n", "from tempfile import TemporaryDirectory\n", "\n", "from hards.filesystem import FilesystemDatabase\n", "\n", "temp_directory = TemporaryDirectory()\n", "database = FilesystemDatabase.create_database(Path(temp_directory.name) / \"database\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Initial Random Sampling\n", "Some initial sampling will help verify the helper functions work and illustrate basic `hards` dataset management. \n", "\n", "First, lets create the dataset where we will store these initial random samples." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "initial_random_dataset = database.create_dataset(\"initial_random\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We might want to include some metadata on this dataset, such as what method was used to create the samples." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "initial_random_dataset.add_data({\"method\": \"random\"})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, we can create `100` samples using Monte Carlo (random) sampling, visualise how they are distributed in the unit square, and calculate an approximation of $\\pi$." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "initial_random_samples = sample_unit_square(100, \"random\")\n", "\n", "print(f\"pi approximation: {approximate_pi(initial_random_samples)}\")\n", "\n", "fig = plot_samples(initial_random_samples)\n", "fig.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can add each of the samples as a 'datapoint' which contains some data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for i, sample in enumerate(initial_random_samples):\n", " datapoint = initial_random_dataset.create_datapoint(f\"{i}\")\n", " datapoint.add_data(sample)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can then reload the data from `hards` and verify it produces an identical visualisation and $\\pi$ approximation" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "initial_random_samples = [\n", " dp.data for dp in initial_random_dataset.recursively_get_datapoints()\n", "]\n", "\n", "print(f\"pi approximation: {approximate_pi(initial_random_samples)}\")\n", "\n", "fig = plot_samples(initial_random_samples)\n", "fig.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Full Random Sampling\n", "Now that the basics of creating a `hards` database and dataset has been demonstrated, it is time to create a full dataset of $8092$ samples to get a more accurate estimation of $\\pi$.\n", "\n", "We have an existing dataset of $100$ samples, and it would be a shame to waste them. Therefore, we can extend the existing dataset and add just $8092$ new samples." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# create a new dataset 'under' the full dataset\n", "full_random_dataset = initial_random_dataset.create_dataset(\"full_random\")\n", "\n", "# create more samples and add it to this new dataset\n", "for i, sample in enumerate(sample_unit_square(8092, \"random\")):\n", " datapoint = full_random_dataset.create_datapoint(f\"{i}\")\n", " datapoint.add_data(sample)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This new dataset (`full_random_dataset`) is a child of the `initial_random` dataset. In `hards`, children provide access to their own data _and_ their parents data. Therefore, we would expect `full_random_dataset` has $8192$ datapoints despite us only adding $8092$." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(len(full_random_dataset.recursively_get_datapoints()))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Full Sobol Sampling\n", "\n", "We can repeat this with Sobol sampling so we have two datasets of the same size to compare against. We will do all of the samples into one dataset rather than messing around with sub-datasets." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "full_sobol_dataset = database.create_dataset(\"full_sobol\")\n", "\n", "for i, sample in enumerate(sample_unit_square(8192, \"sobol\")):\n", " datapoint = full_sobol_dataset.create_datapoint(f\"{i}\")\n", " datapoint.add_data(sample)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Conclusion\n", "\n", "Lets (for examples sake) load everything from scratch before performing some analysis on our results." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "database = FilesystemDatabase(Path(temp_directory.name) / \"database\")\n", "\n", "random_dataset = database.recursively_get_dataset(\"initial_random/full_random\")\n", "sobol_dataset = database.get_dataset(\"full_sobol\")\n", "\n", "random_datapoints = [dp.data for dp in random_dataset.recursively_get_datapoints()]\n", "sobol_datapoints = [dp.data for dp in sobol_dataset.recursively_get_datapoints()]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, lets see which method performs the best by checking their errors.\n", "\n", "Normally (but not always) the Sobol sampling will perform the best." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from math import pi\n", "\n", "print(f\"Random error: {abs(pi - approximate_pi(random_datapoints))}\")\n", "print(f\"Sobol error : {abs(pi - approximate_pi(sobol_datapoints))}\")\n", "\n", "temp_directory.cleanup()" ] } ], "metadata": { "kernelspec": { "display_name": ".venv", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.17" } }, "nbformat": 4, "nbformat_minor": 2 }