Training 2 of 2
Aspen Cluster Administrator Training serves as an administrator introduction to an Aspen cluster, and teaches your administration staff how best to maintain and repair your delivered configuration. The training is tailored to your delivered system, as some information may not be germane to your specific implementation. Some Linux administrative knowledge is a prerequisite for this course, and that knowledge is used as a base to introduce and expand upon more advanced topics. Physical access to the delivered cluster is mandatory, so this training must be done on-site with the cluster, or prior to delivery at an Aspen manufacturing facility.
-
Cluster Hardware Familiarization: This module familiarizes the cluster administration staff with the delivered system. Specific node types are identified, and if desired, removed for in-depth examination and on-site hardware maintenance procedural training. System peripherals, network cabling, color coding, and safety and ESD requirements are identified to ensure participant familiarity, external connectivity requirements are identified, and node port cabling is reviewed.
-
Cluster Distribution and Image Type: This module is used to train the administrators on the specific distribution used for their delivered cluster. Distribution update procedures are shown or reviewed, as well as training about when to patch, what nodes to patch, and what effect patching will have on the operational cluster. If a commercial distribution is used, the cluster administrators are trained on how to access distribution update resources and patches. Any distribution specific limitations or operational constraints are identified, and distribution specific update tools are identified and used.
-
Network, NAT, Firewall, IP Addressing: In this module, administrators are trained on their site specific configuration and requirements. Internal and external IP addressing are reviewed, and external node NAT gateways are explained. Firewall rules are examined in detail and additional rules are implemented to allow specific nodes, networks or protocols.
-
RAIDS, Storage, and File Systems: In this module, your administrators are trained on your specific file system layout on each node type, and all data storage systems and configurations are examined in detail. One of the most critical items for your administration staff to learn is how to monitor and manage your RAID and storage systems, and successful completion of this module will ensure that they have the requisite knowledge to replace RAID disks or troubleshoot and repair storage system faults as necessary.
-
Cluster Services: This module presents training on network time protocol, additional monitoring utilities, e-mail configuration, and other additional services that may be configured on your cluster.
-
Interconnects, MPI Implementations, Compilers, and Environment Modules: In this module, administrators are familiarized with the installed Interconnect technology, any installed commercial compilers, and available MPI implementations. MPI building and installation is reviewed, and if possible, an additional MPI implementation is built, installed, and tested as part of the training. If environment modules are installed and used on the cluster, an additional environment module is written and installed to familiarize your cluster administration staff with the entire process of adding a new MPI implementation.
-
Utilities: In this module, your HPC administrators are trained to utilize cluster wide deployment command line interface tools such as IPMI utilities, "dsh", and other parallel administrative tools to gather data about and manage their cluster. Additional packages are installed across the cluster via command line utilities, and cluster-wide data is gathered via a student written script.
-
Scheduling: In this module, the cluster administration staff is trained on the installed scheduler on the cluster. Routine tasks such as marking nodes offline, as well as changes in site policy and job routing are explored.
-
Redundant Node Fail-over: If redundant fail-over systems are configured in the cluster, node fail-over is tested and explained in detail. If possible, actual fail-over scenarios are detailed and emulated on the delivered systems.
-
Node Failure: In this module, node failure and fault correction is emulated. For instance, if your cluster is disked and not single image, a node disk failure is generated and the node is re-imaged via command line or ABC utilities.
-
ABC: If ABC is installed on your cluster, this training module is used to perform an in-depth walk through of the ABC capabilities, and how to best use ABC to monitor and manage your cluster.
-
Support: In this module, administrators are trained to utilize Aspen support tools and given contact and procedural information for accessing Aspen technical support resources quickly and efficiently.
Aspen Custom Training Sessions
Custom training sessions can be contracted as well. If you, for example, require parallel code development instruction, debugger training, performance tuning, or other subject training not covered by our standard training packages, your sales engineer can work with you to either extend one of our packages, or to outline an entirely custom curriculum. In many cases, a custom training session will require that one of our standard training packages be completed by your students before the additional training session is initiated. Additional time will be needed to develop your custom curriculum. Our training engineers will need to work with you extensively to understand your exact training goals. They will present curriculum outines for your approval which you must approve prior to developing your course.
For more information about our training packages or to discuss custom training, contact your sales engineer, or Aspen Systems sales at 1-800-992-9242




