Reproducibility of experiments is the key to validating scientific claims.
I find an interesting paper that fits my research needs.
The authors’ methods seem perfect for my work, so I want to dive deeper into their experiment.
I look for the companion artifact… What can go wrong?
The link to the research artifacts (code, data, etc.) is broken.
Example: How would you open and execute a
duffy-duck.qcow2
file that weights 20GB? Yes, it can be a valid artifact.
which
ones?We can demonstrate that our paper claims work, by providing a reproducible companion artifact.
However, reproducibility is a process that starts from software development:
pip install
poetry install
to install dependencies defined in pyproject.toml
poetry add <package>
to add a new dependencypyproject.toml
and poetry.lock
poetry update
to update dependencieseval $(poetry env activate)
to activate the virtual environmentGB
s)
Source: Research Gate. Available here.
In most cases the answer is no
Containers share the host system’s kernel but are isolated in their own user space.
Container Image: A lightweight, standalone, executable package that includes everything needed to run the software.
Container: A running instance of an image.
Docker is an open-source platform for distributing and executing containers.
It’s not the only solution, but we use it as a reference since it’s the most used.
Docker Image: Template for creating containers, built from a Dockerfile
.
Docker Hub: A cloud-based registry for storing and sharing Docker images.
Dockerfile: A text file containing a set of instructions to build a Docker image.
docker build
: Build an image from a Dockerfile
.
docker run
: Run a container from an image.
docker ps
: List running containers on the machine.
docker images
: List all locally available Docker images.
docker stop
: Stop a running container.
A Dockerfile is a script with instructions on how Docker Image is built.
FROM python:3.9
WORKDIR /app
COPY . .
RUN pip install -r requirements.txt
CMD ["python", "experiment.py"]
Explanation:
FROM
: Specifies the base image (here, Python 3.9).WORKDIR
: Sets the working directory inside the container.COPY
: Copies files from the host to the container.RUN
: Executes shell
commands inside the container.CMD
: Defines the command to run when the container eventually starts.Complete list of Dockerfile instructions here.
To build the Docker image from the Dockerfile:
docker build -t myapp:latest .
docker run -d -p 5000:5000 myapp:latest
How many containers do you need for a web server?
At least two: One for the front-end and another for the backend. Plus another one if there’s also a database.
Do I need to configure and run each one independently?
Obviously no.
Docker compose is a tool for defining and running multi-container Docker applications using YAML
syntax.
version: '3'
services:
web:
image: myapp
ports:
- "5000:5000"
db:
image: postgres
environment:
POSTGRES_PASSWORD: example
Is simply run with:
docker-compose up
Portability: Docker containers can run on any machine.
Isolation: Each container is isolated, which means applications and dependencies do not conflict.
Efficiency: Containers are lightweight and start quickly, allowing faster deployment cycles.
A highly opinioated example: https://github.com/raphaelschwinger/lightning-hydra-template
Forked from: https://github.com/ashleve/lightning-hydra-template
Automation:
GitHub
and GitLab
README.md
that describes how to execute the experiment.gitignore
Public or private repository?
Open-sourcing allows the community to work on your problem, build on it and improve it!
Are you afraid that someone can “steal” your work?
If you don’t pick a license, your repository is considered proprietary
even if it is public.
If you want to allow contributions, just pick the correct licensing: http://choosealicense.com/licenses/
For each change on the source code:
Most code hosting platforms provide free limited plans for setting up CI/CD.
Usually, free plans are offered only for public repositories
GitHub GitHub Actions
jobs
(like build, test, deploy) executed as part of a workflow.Each job runs in its own virtual machine or container.
Each job consists of multiple steps
, which can include running shell commands or using prebuilt actions
.
GitHub provides hosted runners (Linux, macOS, Windows), or you can use self-hosted runners.
Workflows are defined as YAML
files inside .github/workflows/
folder of the project.
If the folder exists, then GitHub tries to execute the workflows inside of that.
Trivial example: this repository
Better example: submitted experiment