Custom Data Loaders

Data loaders are the bridge between your data sources and TokTagger's annotation interface. They define how to retrieve and format data for visualization and annotation. TokTagger comes with built-in loaders for common formats (images, Parquet files, UDA), but you can easily create custom loaders for your specific data sources.

Overview

A data loader is responsible for:

Defining the expected sample data format (e.g., shot IDs with signal names, file paths)
Retrieving the actual data from your data source
Converting the data into TokTagger's standardized format for visualization

Creating a Custom Data Loader

Step 1: Import Required Components

from toktagger.api.core.data_loaders import DataLoader, LoaderRegistry
from toktagger.api.schemas.data import Data, MultiVariateTimeSeriesData, TimeSeriesData
from toktagger.api.schemas.samples import ShotData, FileData, TimeSeriesFileData
from typing import Type
import pydantic

Step 2: Define Your Data Loader Class

Create a class that inherits from DataLoader and implement the required methods:

@LoaderRegistry.register("my_custom_loader")
class MyCustomLoader(DataLoader):
    """DataLoader for retrieving data from my custom data source"""

    def __init__(self, params):
        # Initialize your data source connection here
        # For example: database connection, API client, etc.
        super().__init__(params)

    @classmethod
    def sample_data_type(cls) -> Type[ShotData | FileData | TimeSeriesFileData]:
        """Define what type of sample data this loader expects"""
        return ShotData  # or FileData, or TimeSeriesFileData

    @pydantic.validate_call
    def get_sample(self, shot_id: int, sample_data: ShotData) -> Data:
        """
        Retrieve and return data for a specific sample.

        Args:
            shot_id: Unique identifier for the sample
            sample_data: Sample-specific configuration (signal names, file paths, etc.)

        Returns:
            Data object in TokTagger format
        """
        # Your implementation here
        pass

Step 3: Register the Loader

The @LoaderRegistry.register("my_custom_loader") decorator automatically registers your loader with TokTagger. The name you provide becomes the identifier used in project configurations.

Step 4: Run Server with Custom Loader

If you have added your own data loaders, you must make sure they have been loaded before the server is run. You can run the server from within a Python script by initializing the Server class, and running server.run(). For example, your script for adding a data loader and running the server may look like this:

from toktagger.api.core import DataLoader, LoaderRegistry, Server
from toktagger.api.schemas.samples import Sample
from toktagger.api.schemas.data import TimeSeriesData

# Say we have a python module which can load data from our Tokamak's data backend:
import my_tokamak_data_backend

# Decorate your class using the Loader Registry, 
# providing a string which you want to refer to your data loader as:
@LoaderRegistry.register("my_tokamak")
class MyDataLoader(DataLoader):

    def get_sample(self, sample: Sample) -> TimeSeriesData:
        # Here you add the logic for how to obtain the data for your given sample
        loaded: dict = my_tokamak_data_backend.get(
            shot_id = sample.shot_id, 
            signal_name = sample.signal_names[0]
            )

        # Format it into the Data schema which you want to use
        # For example, if we are loading time series data:
        return TimeSeriesData(
            time=loaded["time"], 
            values=loaded["values"]
            )

# Create and run a server in the same file, 
# so that your data loader is correctly added to the registry:
server = Server()
server.run()

Data Types and Schemas

Sample Data Types

Choose the appropriate sample data type for your loader:

`ShotData`

For shot-based data sources (like UDA, databases):

class ShotData(BaseModel):
    protocol: Literal["uda", "sal"]
    signal_names: list[str]  # List of signals to retrieve

Use when: Data is identified by shot number and signal names.

`FileData`

For file-based data sources (images, videos):

class FileData(BaseModel):
    file_name: str  # Path to file or directory
    type: FileType  # e.g., "png", "jpeg", "mp4"
    protocol: Literal["file", "s3"]  # "file" or "s3"

Use when: Data is stored in individual files with different types.

`TimeSeriesFileData`

For time series files (Parquet, CSV, NetCDF):

class TimeSeriesFileData(FileData):
    signal_names: Optional[list[str]] = None  # Optional signal filter

Use when: Files contain time series data with multiple signals.

Return Data Types

Your get_sample() method should return one of these data types:

`ImageData`

For image-based visualization:

class ImageData(Data):
    frame: int  # Frame number or identifier
    values: str  # Base64-encoded image string

`TimeSeriesData`

For single signal time series:

class TimeSeriesData(Data):
    time: list[float]  # Time values
    values: list[float]  # Signal values

`MultiVariateTimeSeriesData`

For multiple signals (most common for fusion diagnostics):

class MultiVariateTimeSeriesData(Data):
    values: dict[str, TimeSeriesData | None]  # Signal name -> data

`SpectrogramData`

For frequency-time representations:

class SpectrogramData(Data):
    time: list[float]
    frequency: list[float]
    amplitude: list[list[float]]  # 2D array

Complete Example: CSV Time Series Loader

Here's a complete example of a custom loader for CSV time series files:

loader.py

import pandas as pd
import pathlib
from typing import Type
import pydantic

from toktagger.api.core.data_loaders import DataLoader, LoaderRegistry
from toktagger.api.schemas.data import MultiVariateTimeSeriesData, TimeSeriesData
from toktagger.api.schemas.samples import TimeSeriesFileData


@LoaderRegistry.register("csv_timeseries")
class CSVTimeSeriesLoader(DataLoader):
    """DataLoader for retrieving time series data from CSV files"""

    @classmethod
    def sample_data_type(cls) -> Type[TimeSeriesFileData]:
        # This loader expects file paths with optional signal names
        return TimeSeriesFileData

    @pydantic.validate_call
    def get_sample(
        self, 
        shot_id: int, 
        sample_data: TimeSeriesFileData
    ) -> MultiVariateTimeSeriesData:
        """
        Load time series data from a CSV file.

        Expected CSV format:
        - First column: time values
        - Remaining columns: signal values with column headers
        """
        # Verify file exists
        file_path = pathlib.Path(sample_data.file_name)
        if not file_path.exists():
            raise FileNotFoundError(
                f"Could not find CSV file at '{file_path}'"
            )

        # Read CSV file
        df = pd.read_csv(file_path, index_col=0)  # First column is time

        # Filter to requested signals if specified
        if sample_data.signal_names:
            df = df[sample_data.signal_names]

        # Handle missing values
        df = df.fillna(0)

        # Extract time values from index
        time = df.index.values.tolist()

        # Convert each column to TimeSeriesData format
        results = {}
        for column_name in df.columns:
            results[column_name] = TimeSeriesData(
                time=time,
                values=df[column_name].values.tolist()
            )

        return MultiVariateTimeSeriesData(values=results)

This examples assumes the CSV file has the following format:

First column: time values
Remaining columns: signal values with column headers

The file should be called {shot_id}.csv.

For example:

1.csv

time,signal1,signal2
0.0,1.0,2.0
0.1,1.5,2.5
0.2,2.0,3.0

Running the Example

To run the example above and create a custom CSV dataloader:

Copy the code above into a file called loader.py
Make a file to run the server called run.py, which does:

run.py

from loader import CSVTimeSeriesLoader
from toktagger.api.main import Server

server = Server()
server.run()

Create a directory which will contain your data called data, and then copy the CSV above into a file named 1.csv within the new directory
Run python run.py
Open the UI in your browser at localhost:8002/ui/projects
Click the Create button on the UI, and click the Data Loader dropdown. You should see your new data loader listed there, called csv_timeseries.
After selecting the custom data loader, set the file path as data - the UI should tell you that a file has been found within that directory. Fill in the other information, including project name and task as shown below, and press Create:

You should now be able to open the sample which we have created, corresponding to our CSV file. You should see two time series plots containing the data from our CSV file:

Example: Database Loader

Here's an example of loading data from a SQL database:

import sqlalchemy as sa
from typing import Type
import pydantic

from toktagger.api.core.data_loaders import DataLoader, LoaderRegistry
from toktagger.api.schemas.data import MultiVariateTimeSeriesData, TimeSeriesData
from toktagger.api.schemas.samples import ShotData


@LoaderRegistry.register("sql_database")
class SQLDatabaseLoader(DataLoader):
    """DataLoader for retrieving data from a SQL database"""

    def __init__(self, params):
        # Initialize database connection
        # Connection string should be in environment variable
        import os
        connection_string = os.environ.get("DATABASE_URL")
        self.engine = sa.create_engine(connection_string)
        super().__init__(params)

    @classmethod
    def sample_data_type(cls) -> Type[ShotData]:
        return ShotData

    @pydantic.validate_call
    def get_sample(
        self, 
        shot_id: int, 
        sample_data: ShotData
    ) -> MultiVariateTimeSeriesData:
        """Load time series data from database"""
        results = {}

        with self.engine.connect() as conn:
            for signal_name in sample_data.signal_names:
                # Query time series data for this shot and signal
                query = sa.text("""
                    SELECT time, value 
                    FROM timeseries_data 
                    WHERE shot_id = :shot_id AND signal_name = :signal_name
                    ORDER BY time
                """)

                result = conn.execute(
                    query, 
                    {"shot_id": shot_id, "signal_name": signal_name}
                )
                rows = result.fetchall()

                if rows:
                    time = [row[0] for row in rows]
                    values = [row[1] for row in rows]
                    results[signal_name] = TimeSeriesData(time=time, values=values)
                else:
                    results[signal_name] = None

        return MultiVariateTimeSeriesData(values=results)

Using Docker

If you are using the docker compose option to run the server, you can provide a custom script similar to the one above to add your own data loaders. To do this, create a file similar to the one above, but making sure to pass the following arguments into server.run():

server.run(
    host="0.0.0.0",
    port=8002
)

You can then provide the path to your script when running docker compose. For example, say we have the above script in a file called custom_toktagger.py - We simply need to add CUSTOM_SCRIPT=./custom_toktagger.py before the docker compose command!

CUSTOM_SCRIPT=./custom_toktagger.py docker compose --env-file .env.dev -f docker-compose.dev.yml up --build

Tip

Make sure you provide the path to your script, and not a string.

e.g, provide CUSTOM_SCRIPT=./custom_toktagger.py, and not CUSTOM_SCRIPT=custom_toktagger.py.