Overview

About ICDS

The Institute for Computational and Data Sciences (ICDS) is one of seven interdisciplinary research institutes within Penn State's Office of the Senior Vice President for Research. The mission of ICDS is to build capacity to solve problems of scientific and societal importance through cyber-enabled research. ICDS enables and supports the diverse computational and data science research taking place throughout Penn State.

ICDS provides university faculty, staff, students, and collaborators access to Roar, which consists of the Roar Collab (RC) and Roar Restricted (RR) research computing clusters. Roar Collab is the flagship computing cluster for Penn State researchers. Access to Roar Restricted is provided on an as-needed basis to research groups specifically handling restricted data. Most of the material within this ICDS User Guide is common to both Roar Collab and Roar Restricted, but some sections may specifically refer to Roar Collab. The Roar Restricted Addendum specifically addresses items unique to Roar Restricted.

Shared Computing Clusters

A computing cluster is a group of interconnected computers that work together to perform computational tasks. Each of these computers, referred to as a node, typically consists of its own processor(s), memory, and network interface. Often these nodes are additionally connected to a shared filesystem so data is available to all of the nodes. Nodes can be used individually but can also be utilized collectively to perform demanding computational processes more efficiently. Integrating the separate nodes into a single system requires a suite of cluster management software that is configured and implemented by system administration personnel.

The research computing clusters offered by ICDS are shared computational resources, so the computational processes must be managed to allow for many simultaneous users. To perform computationally intensive tasks, users must request compute resources and be provided access to those resources to perform the tasks. A computational process or workflow run via the request and provision of computational resources is typically referred to as a computational job. The request and provision process allows the tasks of many users to be scheduled and carried out efficiently to avoid resource contention.

The software tasked with managing the scheduling of computational resources is often referred to as the resource manager or the job scheduler. Shared computing clusters are often configured to mediate access to its pool of computational resources not only through the use of a resource manager, but also by the cluster architecture. From an architectural perspective, in addition to the set of compute nodes that act as the pool of computational resources, a cluster often utilizes a set of nodes that specifically handle user logins and then also has other auxiliary nodes used for specific system administration functions. A user typically connects to a login-type node, and then requests computational resources via the resource management software. A node that is configured to handle user logins and the submission of resource requests is referred to as a submit node, while a node used for computationally intensive tasks is referred to as a compute node.

Slurm

Slurm (Simple Linux Utility for Resource Management) is the software that acts as the job scheduler and resource manager. Slurm is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for Linux clusters. Its primary functions are to

  • Allocate access to compute resources to users for some duration of time
  • Provide a framework for starting, executing, and monitoring work on the set of allocated compute resources
  • Arbitrate contention for resources by managing a queue of pending work

Slurm is rapidly rising in popularity and many other computing clusters use Slurm as well. The Submitting Jobs section provides further detail on the use of Slurm on Roar, and Slurm's documentation is a great resource for in-depth details on the usage of Slurm.

Slurm's sinfo Command

The ICDS research computing clusters are heterogeneous and somewhat dynamic. To see the different node configurations available, use the following command:

sinfo --Format=features:40,nodelist:20,cpus:10,memory:10

This sinfo command displays not only the core and memory configuration of the nodes, but it also indicates the processor generation associated with each node. Furthermore, while connected to a specific node, the lscpu command provides more detailed information on the specific processor type available on the node. The first column of the output lists the features associated with each block of nodes

To add a column to the sinfo command output that indicates the number of GPU(s) associated with each of the node blocks, simply add the gres option to the sinfo format string:

sinfo --Format=features:40,nodelist:20,cpus:10,memory:10,gres:10

On a GPU node, running the nvidia-smi command displays more detailed information on the GPU(s) available on that node.

Slurm's sinfo documentation page provides a detailed description of the function and options of the sinfo command.

System Specs

Roar is comprised of two distinct research computing clusters. The Roar Collab (RC) and Roar Restricted (RR) system specifications are described below.

Roar Collab

Roar Collab (RC) is the flagship computing cluster for Penn State researchers. Designed with collaboration in mind, the RC environment allows for more frequent software updates and hardware upgrades to keep pace with researchers’ changing needs. RC utilizes the Red Hat Enterprise Linux (RHEL) 8 operating system to provide users with access to compute resources, file storage, and software. RC is a heterogeneous computing cluster comprised of different types of compute nodes, each of which can be categorized as a Basic, Standard, High-Memory, GPU, or Interactive node.

Node Type
(Designation)
Core/Memory
Configurations
Description
Basic
(bc)
24 cores, 128 GB
64 cores, 256 GB
Connected via Ethernet
Configured to offer about 4 GB of memory per core
Best used for single-node tasks
Standard
(sc)
24 cores, 256 GB
48 cores, 380 GB
48 cores, 512 GB
Connected via Infiniband and Ethernet
Infiniband connections provide higher bandwidth inter-node communication
Configured to offer about 10 GB of memory per core
Good for single-node tasks and also multi-node tasks
High-Memory
(hc)
48 cores, 1 TB
56 cores, 1 TB
Connected via Ethernet
Configured to offer about 25 GB of memory per core
Best for memory-intensive tasks
GPU
(gc)
28 cores, 256 GB
28 cores, 512 GB
48 cores, 380 GB
Feature GPUs that can be accessed either individually or collectively
Both A100 and P100 GPUs are available
Interactive
(ic)
36 cores, 500 GB Feature GPUs that are specifically configured for GPU-accelerated graphics
Best for running graphical software that requires GPU-accelerated graphics

Roar Collab User Flow Diagram

Roar Restricted

Roar Restricted (RR) is designed for research groups that handle restricted data. RR is only accessible when connecting either via the Penn State network or via Penn State GlobalProtect VPN. Additionally, transferring data to/from RR performed using RR's Secure Data Transfer Management Model. RR utilizes the Red Hat Enterprise Linux (RHEL) 8 operating system to provide users with access to compute resources, file storage, and software. RR is a heterogeneous computing cluster comprised of different types of compute nodes, each of which can be categorized as a Standard, GPU, or Interactive node.

Node Type
(Designation)
Core/Memory
Configurations
Description
Standard
(sc)
20 cores, 256 GB
48 cores, 380 GB
General purpose compute nodes on RR
Connected via Infiniband and Ethernet
Infiniband connections provide higher bandwidth inter-node communication
Good for single-node tasks and also multi-node tasks
GPU
(gc)
28 cores, 256 GB Feature GPUs that can be accessed either individually or collectively
P100 GPUs are available
Interactive
(ic)
28 cores, 256 GB Feature GPUs that are specifically configured for GPU-accelerated graphics
Best for running graphical software that requires GPU-accelerated graphics

Roar Restricted User Flow Diagram

Best Practices

Adhering to the recommended best practices on Roar ultimately improves system functionality and stability. Roar is shared by many users, and a user's operating behavior can inadvertently impact system functionality for other users. Exercise good citizenship to mitigate the risk of adversely impacting the system and the ICDS research community.

Don't! Do Not Perform Computationally Intensive Tasks On Submit Nodes

The submit nodes (with a submit* hostname) are not configured to handle intensive computational tasks. Dozens, and sometimes hundreds, of users may be logged on at any one time. Think of the submit nodes as a prep area, where users may edit and manage files, initiate file transfers, submit new jobs, and track existing jobs. The submit nodes serve as an interface to the system and to the computational resources.

Perform intensive computations, like utilizing research software and conducting development and debugging sessions, on compute nodes. To access compute nodes, either submit a batch job or request an interactive session. The Submitting Jobs section provides further details on requesting computational resources.

Since the submit nodes are not configured for intensive computations, the computational performance is poor. Additionally, running computationally intensive or disk intensive tasks on a submit node negatively impacts performance for other users. Habitually running jobs on the submit nodes can potentially lead to account suspension.

Do! Remain Cognizant of Storage Quotas

All available storage locations have associated quotas. If the usage of a storage location approaches these quotas, software may not function nominally and may produce cryptic error messages. The Handling Data section provides further details on checking storage usage relative to the quotas.

Don't! Do Not Use Scratch as a Primary Storage Location

Scratch serves as a temporary repository for compute output and is explicitly designed for short-term usage. Unlike other storage locations, scratch is not backed up. Files are subject to automatic removal after 30 days. Only utilize scratch for files that are non-critical and/or can be easily regenerated. The Handling Data section provides further details on storage options.

Do! Try To Minimize Resource Requests

The amount of time jobs are queued grows as the amount of requested resources increases. To minimize the amount of time a job is queued, minimize the amount of resources requested. It is best to run small test cases to verify that the computational workflow runs successfully before scaling up the process to a large dataset. The Submitting Jobs section provides further details on requesting computational resources.

Policies

The policies regarding the use of Roar can be found on the ICDS Policies page.