Without a scheduler, an HPC Cluster would just be a bunch of servers with different jobs interfering with each other. When you have a large cluster and multiple users, each user doesn’t know which compute nodes and CPU cores to use, nor how much resources are available on each node. To solve this, cluster batch control systems are used to manage jobs on the system using HPC Schedulers. They are essential for sequentially queueing jobs, assigning priorities, distributing, parallelizing, suspending, killing or otherwise controlling jobs cluster-wide. Below are some of the HPC schedulers commonly requested for Aspen Systems’ customers.
Moab HPC Suite is a workload and resource orchestration platform that automates the scheduling, managing, monitoring and reporting of HPC workloads on massive scale. The patented Moab intelligence engine uses multi-dimensional policies and advanced future modeling to optimize workload start and run times on diverse resources.
Maui is a highly optimized and configurable advanced job scheduler for use on clusters. It is capable of supporting a large array of scheduling policies, including dynamic priorities, extensive reservations, and fair-share, and also interfaces with numerous resource management systems. Maui improves the manageability and efficiency of machines ranging from servers of a few processors to multi-teraflop clusters.
TORQUE (Tera-scale Open-source Resource and QUEue manager) is a resource manager providing control over batch jobs and distributed compute nodes. It is a community effort based on the original PBS project and has incorporated significant advancements in the areas of scalability, fault tolerance, and feature extensions contributed by NCSA, OSC, USC, the U.S. Dept of Energy, Sandia, PNNL, University of Buffalo, TeraGrid, and many other leading edge HPC organizations. TORQUE is fully supported by Moab Workload Manager and Maui Scheduler.
Viewpoint is a rich, easy-to-use portal for end-users and administrators, designed to increase productivity through its visual web-based interface, powerful job management features and other workload functions. The portal provides greater self-sufficiency for end-users while reducing administrator overhead in High Performance Computing (HPC).
The Portable Batch System, PBS, is the leading workload management solution for HPC systems and Linux clusters. PBS was originally designed for NASA because existing resource management systems of the day were inadequate for modern parallel/distributed computers and clusters. From the initial design forward, PBS has included innovative new approaches to resource management and job scheduling, such as the extraction of scheduling policy into a single separable, completely customizable module. PBS today exists as OpenPBS, the open source version, and PBS Professional. PBS Works operates in networked multi-platform UNIX environments and supports heterogeneous clusters of workstations, supercomputers, and massively parallel systems. PBS Professional: the trusted solution for workload management.
Watch this interview to see how Imperial College of London benefits from using PBS Works for cluster workload management.
Univa Grid Engine is a commercially supported and licensed software that is the leading distributed resource management system that optimizes resources in thousands of data centers by transparently selecting the resources that are best suited for each segment of work. Grid Engine software manages workloads automatically, maximises shared resources and accelerates deployment of any container, application or service in any technology environment, on-premise or in the cloud.
Open Grid Scheduler/Grid Engine is a commercially supported open-source batch-queuing system for distributed resource management. OGS/GE is based on Sun Grid Engine, and maintained by the same group of external (i.e. non-Sun) developers who started contributing code since 2001.
The Son of Grid Engine is a community project to continue Sun’s old grid engine free software project that used to live at http://gridengine.sunsource.net after Oracle shut down the site and stopped contributing code. (Univa now owns the copyright) It will maintain copies of as much as possible/useful from the old site.
The Simple Linux Utility for Resource Management (SLURM) is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters. Slurm requires no kernel modifications for its operation and is relatively self-contained. As a cluster workload manager, Slurm has three key functions. First, it allocates exclusive and/or non-exclusive access to resources (compute nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes. Finally, it arbitrates contention for resources by managing a queue of pending work.