Page tree
Skip to end of metadata
Go to start of metadata

Main Conclusions

The main takeaway from the survey is that all partner facilities either require or may require:

  • Access to distributed filesystems
  • Ability of users to submit jobs to multiple nodes (or container instances)
  • Access to GPU resources

When asked, most said (8) that they plan on dealing with security concerns/issues with accessing shared filesystems by using Singularity instead of Docker, with the remaining partners saying they will use: K8s Pod Security Policies, Active Directory, and one saying they have not picked an approach.

The majority (10) of partners said they they plan on using a 'mixed model' for the deployment of the portal, where kubernetes is used for hosting the web portion of the portal, and the compute instances are ran with a different provider (e.g. SLURM).

Given the responses, two clear goals come out of this survey:

  • Research into using Singularity with Kubernetes, as the majority of responders said that they plan on using Singularity
    • It's likely most responders picked a mixed-model approach as they assumed Kubernetes meant using Docker (some additional questions for more clarification would have been good), the option to run the Kubernetes Cloud Provider with Singularity might be enough to convince facilities that they do not need to use another cloud provider
    • Research into using Singularity with GPUs in both the standalone and Kubernetes integrated scenarios - is a virtualisation license required if Singularity is used, or can it be avoided?
  • SLURM is the most popular scheduler, so the plan to make a SLURM Provider is still good

Short Survey Summary

11 Responses from the following facilities:

  • DESY
  • ELI-ALPS
  • ESRF
  • Max IV Lab, Lund
  • STFC ISIS
  • EuXFEL
  • ESS
  • CERIC-ERIC
  • ELETTRA
  • DLS

The majority of partners who responded use SLURM (10/11), with one facility using SGE.

The most popular filesystem were: GPFS (4), Ceph (3), GlusterFS (2), dCache (1) and Lustre (1).

All facilities replied with yes/maybe/testing to the questions:

  • Will users require access to DFS
  • Will users need to submit jobs to multiple nodes
  • Will users need access to GPU resources
  • Does you facility use containers already
  • Are/will containers be used for data analysis

Facilities plan to/already have the following container systems available: Docker (9), Singularity (8), Kubernetes (7), OpenStack (2)

And facilities responded to the question of handling safe access to a distributed file system by saying that they use/plan to use: Singularity (8), K8s Pod Security Policy (1), Active Directory (1), Unknown (1)

Survey Responses

Cluster Infrastructure

Workload manager:

  • SLURM (10)
  • SGE (1)

Distributed File System:

  • GPFS (4)
  • Ceph (3)
  • GlusterFS (2)
  • dCache (1)
  • Lustre (1)

User Requirements

Will users require access to DFS:

  • Yes (8)
  • Maybe (3)

Will users need to submit jobs to multiple nodes:

  • Yes (6)
  • Maybe (5)

Will users need access to GPU resources:

  • Yes (6)
  • Maybe (5)

JupyterHub

Does your facility have JupyterHub:

  • Yes (10) - (1) test mode, (1) internal
  • No (1)

Which spawner(s) does it use:

  • Docker (5)
  • Batch (4)
  • Kubernetes (2)
  • SingleUser (2)
  • sudo (1)

Setup description:

  • JupyterHub runs on a shared node, users can select which partition to start their instance on, instances then start via the SLURM batch spawner so that each user has their own dedicated bare-metal node
  • Test instance running on K8s. Uses CAS authenticator, patched to also lookup against our LDAP for uid/gid/sgids which get set for the notebook pods. Because uid/gid is enforced, we allow /home and /dls as hostPath mounts. This gives notebooks access to data. GPUs are also available to the K8s cluster and hence notebooks.
  • * batch: Using slurm batchspawner: User can select either a predefined configuration (1 core/8GRAM, 1 full node or Half a node with 1 GPU) or select their own configuration and then run on bare-metal node. * sudo: Different jupyterhub servers running separately sharing a common installation"
  • single shared server, users cannot select which partition to start
  • Shared node ; users cannot select which partition to start
  • We have a couple of different jupyterhubs, one for workshops with a different authentication configuration. And one that has access to our test beamline data and can query scicat. Both of them have pretty typical k8s setups with 1 head node and 2-3 worker nodes.
  • shared node, users cannot select which partition to start
  • JupyterHub is running in a VM and spawns docker containers to dedicated node(s). The purpose of this instance is to provide PC-like remote data analysis environment connected to facility data storage. An extension is under development that can allow spawning to SLURM managed HPC cluster in order to provide HPC jupyter instances for memory and compute demanding applications and utilizing existing resources. The current underlying infrastructure is a bare-metal. Simultaneously a Kubernetes infrastructure is under development. The JuperHub VM and the PC-like data analysis environment may be moved to it in future.
  • We are running a test in a single-node installation

Containers

Does you facility use containers already:

  • Yes (8) - (1) testing
  • Will be used (3)

Which container software is/will be available:

  • Docker (9)
  • Singularity (8)
  • Kubernetes (7)
  • OpenStack (2)

Are/will containers be used for data analysis:

  • Yes (8)
  • Maybe (3)

Do containers require access to DFS:

  • Yes (6) - already can
  • Yes (5) - required, but not currently possible

If containers require DFS access, how do you (plan to) handle security concerns:

  • Singularity (8)
  • K8s Pod Security Policy (1)
  • Active Directory (1)
  • Unknown (1)

Short summary of what containers are/will be used for at your facility:

  • Docker used for self-contained microservices, Singularity for data analysis environments
  • Jhub, podman on desktops for testing. Monitoring. Inventry mgmt. Cron jobs (e.g. to create users, backup switch config). A test project has been completed that ran EPICS Soft IOCs as K8s pods. Contact me for more details on this.
  • Containers are used for micro-services, and were used for one of the jupyterhub server based on kubernetes.
  • self contained microservices, jupyterhub, edge computing
  • self contained microservices, jupyter hub, edge computing
  • Likely for Jupyter.
  • Containers are currently used in production for deploying applications/microservices such as scicat and with jupyter for workshops. They are being tested as part of jupyterhub for data analysis and to be used in CI with gitlab.
  • Not only centrally provided services, we hope users will also be able to use containers for their own software
  • self contained microservices, Jupyterhub, edge computing
  • self-contained microservices (CI, GitLab, web servers, DAQ), singularity on HPC, Jupyter uses Docker, singularity can be used in beamlines and control
  • Jupyter, Control System Services, other research infrastructure services (e.g. Grafana)

Portal Deployment

How do you imagine the PaNOSC portal will be deployed at your facility, e.g. fully virtualised, everything running in kubernetes; or a mixed-model with the web portions in kubernetes, and using the SLURM Provider to create compute instances; or something else:

  • Kubernetes for web portion (site, SSO, metadata search, etc...); SLURM Provider for compute
  • K8s
  • -o me, mixed-model: micro-services for web service + slurm spawner
  • Mixed Model
  • a mixed-model
  • Probably a mixed model with both Kubernetes and our existing DAaaS VMS.
  • We haven't fully tested jupyterhub with our slurm instance yet, so it is hard to give a definite answer, but it would be ideal if we were able to use both kubernetes and slurm spawners at our facility to use our resources most effectively.  Initially we are happy to just deploy and test with - kubernetes though.
  • use of the portal may be limited if everything has to run in Kubernetes (eg: only Jupiter available); if we can access SLURM on the backend then we have many more options for deploying useful data analysis pipelines.
  • mixed model
  • mixed-mode, it will likely be using heterogeneous infrastructure
  • In a mixed model as indicated above

Comments

We require legacy software, so purely Jupyter-based software or similar is not an option for us.


Avoid fitting a square pegs into a round holes.  Containers for deployment are great but need access to the file system (existing and trusted workflows rely on hopping data to and from disk; rewriting not a practical option in most cases).  Kubernetes interesting and under investigation, but only really useful if it's as 'easy' to use as existing and well known batch systems. Maybe it doesn't suit, maybe it does - we have to find out!



  • No labels