Advanced Containerization Using DataLad

Advanced Containerization Using DataLad

Reproducible neuroimaging principles: 2a: Use standard data formats, 2b: Use data version control, 2c: Annotate data, 3: Software management.

Actions Standards, Annotation and provenance, Version control, Containers.

Standards: BIDS.

Tools: DataLad, Singularity/Apptainer.

Challenge

Using version control and automation to execute procedures can produce re-executable and provenance-rich results, but the task can appear daunting. Following best-practices for file layouts (DataLad + YODA Principles) provide clear connections (via subdatasets) between the source data and the derivative data that is produced. Additionally, utilizing datalad run with repronim-containers preserves the provenance of exactly what software versions were used and how, leaving a detailed trail for future work.

Exercise

Let’s assume that our goal is to do Quality Control of an MRI dataset (which is available as DataLad dataset ds000003). We will create a new dataset with the output of the QC results (as analyzed by mriqc BIDS-App).

  • create a new dataset which would contain results and everything needed to obtain them
  • install/add subdatasets(code, other datasets, containers)
  • perform the analysis using only materials available within the reach of this dataset.

This would help to guarantee reproducibility in the future because all the materials would be reachable within that dataset.

Note: This exercise is based on the ReproNim/containers README, which should be referenced for more information.

Before you start

Required knowledge:

  • Basics of operating in a terminal environment

Though it is not strictly necessary to be familiar with all of the tools used to complete the tutorial, knowledge of the following will be helpful for adapting this tutorial to your usecase:

Step by step guide

Step 1: Install the necessary tools

The following tools should be installed:

Additionally, the datalad-container extension should also be installed.

pip install datalad-container

Step 2: Start a DataLad dataset

Following YODA, our dataset for the results is the dataset that will contain everything needed to produce those results.

mkdir ~/my-experiments
cd ~/my-experiments
datalad create -d ds000003-qc -c text2git
cd ds000003-qc

Step 3: Install source data

Next we install our source data as a subdataset.

datalad install -d . -s https://github.com/ReproNim/ds000003-demo sourcedata

Step 4: Install ReproNim/containers

Next we install the ReproNim/containers collection.

datalad install -d . -s ///repronim/containers code/containers

Now let’s take a look at what we have.

/ds000003-qc # The root dataset contains everything
 |--/sourcedata # we call it source, but it is actually ds000003-demo
 |--/code/containers # repronim/containers, this is where our non-custom code lives

Step 4: Freeze container image versions

freeze_versions is an optional step that will record and “freeze” the version of the container used. Even if the ///repronim/containers dataset is upgraded with a newer version of our container, we are “pinned” to the container we explicitly determined. Note: To switch version of the container (e.g., to upgrade to a new one), rerun freeze_versions script with the version specified.

The container version can be “frozen” into the clone of the ///repronim/containers dataset, or the top-level dataset.

Option 1: Top level dataset (recommended)

# Run from ~/my-experiments/ds000003-qc
datalad run -m "Downgrade/Freeze mriqc container version" \
  code/containers/scripts/freeze_versions --save-dataset=. bids-mriqc=0.16.0

Option 2: ///repronim/containers

# Run from ~/my-experiments/ds000003-qc/
datalad run -m "Downgrade/Freeze mriqc container version" \
    code/containers/scripts/freeze_versions bids-mriqc=0.16.0

Note: It is recommended to freeze a container image version into the top-level dataset to simplify reuse. If ///repronim/containers is modified in any way, the author must ensure that their altered fork of ///repronim/containers is publicly available and that its URL specified in the .gitmodules. By freezing into the top-level dataset instead, authors do not need to host a modified version of ///reporonim/containers.

Fixup datalad config

The version of mriqc we are using does not have an option --no-datalad-get which is hardcoded into mriqc config, so we should remove it.

datalad run -m "Remove ad-hoc option for mriqc for older frozen version" sed -i -e 's, --no-datalad-get,,g' .datalad/config

Step 5: Run the containers

When we run the bids-mriqc container, it will need a working directory for intermediate files. These are not helpful to commit, so we will tell git (and datalad) to ignore the whole directory.

echo "workdir/" > .gitignore && datalad save -m "Ignore workdir" .gitignore

Now we use datalad containers-run to perform the analysis.

datalad containers-run \
        -n bids-mriqc \
        --input sourcedata \
        --output . \
        '{inputs}' '{outputs}' participant group -w workdir

If everything worked as expected, we will now see our new analysis, and a commit message of how it was obtained! All of this is contained within a single (nested) dataset with a complete record of how all the data was obtained.

(git) .../ds000003-qc[master] $ git show --quiet
Author: Austin <austin@dartmouth.edu>
Date:   Wed Jun 5 15:41:59 2024 -0400

    [DATALAD RUNCMD] ./code/containers/scripts/singularity_cm...

    === Do not change lines below ===
    {
     "chain": [],
     "cmd": "./code/containers/scripts/singularity_cmd run code/containers/images/bids/bids-mriqc--0.16.0.sing '{inputs}' '{outputs}' participant group -w workdir",
     "dsid": "c9c96ab9-f803-43ba-83e2-2eaec7ab4725",
     "exit": 0,
     "extra_inputs": [
      "code/containers/images/bids/bids-mriqc--0.16.0.sing"
     ],
     "inputs": [
      "sourcedata"
     ],
     "outputs": [
      "."
     ],
     "pwd": "."
    }
    ^^^ Do not change lines above ^^^

This record could later be reused (by anyone) using datalad rerun to rerun this computation using exactly the same version(s) of input data and the singularity container. You can even now datalad uninstall sourcedata and even containers sub-datasets to save space - they will be retrievable at those exact versions later on if you need to extend or redo your analysis.

Notes

  • aforementioned example requires DataLad >= 0.11.5 and datalad-containers >= 0.4.0;
  • for more elaborate example with use of reproman to parallelize execution on remote resources, see ReproNim/reproman PR#438;
  • a copy of the dataset is made available from ///repronim/ds000003-qc and [ds000003-qc][https://github.com/ReproNim/ds000003-qc].