HPC SCHEDULERS
Without a scheduler, an HPC Cluster would just be a bunch of servers with different jobs interfering with each other. When you have a large cluster and multiple users, each user doesn’t know which compute nodes and CPU cores to use, nor how much resources are available on each node. To solve this, cluster batch control systems are used to manage jobs on the system using HPC Schedulers. They are essential for sequentially queueing jobs, assigning priorities, distributing, parallelizing, suspending, killing, or otherwise controlling jobs cluster-wide. Below are some of the HPC schedulers commonly requested for Aspen Systems’ customers.
SLURM
The Simple Linux Utility for Resource Management (SLURM) is an open-source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters. Slurm requires no kernel modifications for its operation and is relatively self-contained. As a cluster workload manager, Slurm has three key functions. First, it allocates exclusive and/or non-exclusive access to resources (compute nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes. Finally, it arbitrates contention for resources by managing a queue of pending work.
- No single point of failure, fault-tolerant options, backup daemons
- Highly scalable
- Up to 1000 job submissions/second (600 executions/second)
- Heterogeneous resources supported
- Each job can have custom operating systems booted
- Automatic job re-queue (policy configured based on exit value)
- Highly configurable (over 100 plugins)
- GNU General Public License
- Reserve or limit resources for specific users
- Real-time accounting down to the task level (identify tasks based on CPU or memory usage)
- Account for power consumption per job
- Report API use by user, and time consumed
Moab HPC Suite
Moab HPC Suite is a workload and resource orchestration platform that automates the scheduling, managing, monitoring and reporting of HPC workloads on massive scale. The patented Moab intelligence engine uses multi-dimensional policies and advanced future modeling to optimize workload start and run times on diverse resources.
Maui Cluster Scheduler
Maui is a highly optimized and configurable advanced job scheduler for use on clusters. It is capable of supporting a large array of scheduling policies, including dynamic priorities, extensive reservations, and fair-share, and also interfaces with numerous resource management systems. Maui improves the manageability and efficiency of machines ranging from servers of a few processors to multi-teraflop clusters.
TORQUE Resource Manager
TORQUE (Tera-scale Open-source Resource and QUEue manager) is a resource manager providing control over batch jobs and distributed compute nodes. It is a community effort based on the original PBS project and has incorporated significant advancements in the areas of scalability, fault tolerance, and feature extensions contributed by NCSA, OSC, USC, the U.S. Dept of Energy, Sandia, PNNL, University of Buffalo, TeraGrid, and many other leading-edge HPC organizations. TORQUE is fully supported by Moab Workload Manager and Maui Scheduler.
VIEWPOINT
Viewpoint is a rich, easy-to-use portal for end-users and administrators, designed to increase productivity through its visual web-based interface, powerful job management features and other workload functions. The portal provides greater self-sufficiency for end-users while reducing administrator overhead in High Performance Computing (HPC).
The Portable Batch System, PBS, is the leading workload management solution for HPC systems and Linux clusters. PBS was originally designed for NASA because existing resource management systems of the day were inadequate for modern parallel/distributed computers and clusters. From the initial design forward, PBS has included innovative new approaches to resource management and job scheduling, such as the extraction of scheduling policy into a single separable, completely customizable module. PBS today exists as OpenPBS, the open source version, and PBS Professional. PBS Works operates in networked multi-platform UNIX environments and supports heterogeneous clusters of workstations, supercomputers, and massively parallel systems. PBS Professional: the trusted solution for workload management.
GRID ENGINE
When you move from network computing to grid computing, you will notice reduced costs, shorter time to market, increased quality and innovation and you will develop products you couldn’t before. Grid Computing solutions are ideal for compute-intensive industries such as scientific research, EDA, life sciences, MCAE, geosciences, financial services and others.
THE PATH TO SCALED-UP DATA CENTERS
Univa Grid Engine is a commercially supported and licensed software that is the leading distributed resource management system that optimizes resources in thousands of data centers by transparently selecting the resources that are best suited for each segment of work. Grid Engine software manages workloads automatically, maximises shared resources and accelerates deployment of any container, application or service in any technology environment, on-premise or in the cloud.
OPEN GRID SCHEDULER
Open Grid Scheduler/Grid Engine is a commercially supported open-source batch-queuing system for distributed resource management. OGS/GE is based on Sun Grid Engine, and maintained by the same group of external (i.e. non-Sun) developers who started contributing code since 2001.
The Son of Grid Engine is a community project to continue Sun’s old grid engine free software project that used to live at http://gridengine.sunsource.net after Oracle shut down the site and stopped contributing code. (Univa now owns the copyright) It will maintain copies of as much as possible/useful from the old site.