L2 Skill HPC Engineer with Application Expertise
Role Overview:
An L2 HPC (High-Performance Computing) Engineer with an application skillset is responsible for supporting, troubleshooting, and maintaining HPC infrastructure and assisting users with scientific and engineering applications. They operate between infrastructure and application layers, ensuring optimal performance and availability of both.
Core Responsibilities:
- HPC Cluster Support:
Manage day-to-day operations of HPC clusters (Slurm, PBS, LSF), monitor jobs, and node health, and manage user issues at L2. - Application Support & Optimization:
Support scientific/engineering applications (ANSYS, Gaussian, GROMACS, OpenFOAM, etc.) including installation, configuration, tuning, and parallel execution optimization (MPI/OpenMP). - User & Job Management:
Handle user access, and environment setup (modules, environment variables), and resolve job scheduling issues. - Performance Monitoring:
Use tools like Ganglia, Prometheus, or Nagios to monitor cluster and job performance. - OS & Middleware Maintenance:
Perform updates and patching of OS (Linux/RHEL/CentOS), compilers (Intel, GNU), and libraries (MPI, BLAS, CUDA). - Collaboration:
Work with L3/engineering teams for complex issues and contribute to environment upgrades or migrations.
Key Skill Areas:
- HPC Environment: Slurm, PBS Pro, LSF, Bright Cluster Manager
- OS & Shell: Linux (RHEL/CentOS), Bash scripting
- Compilers & Libraries: Intel, GCC, OpenMPI, MPICH, CUDA
- Applications: ANSYS, Abaqus, Gaussian, GROMACS, MATLAB, OpenFOAM
- Monitoring Tools: Ganglia, Nagios, Prometheus
- Job Scheduler Debugging & Logs Analysis
- Basic Networking & Storage Concepts
- Software Module Systems (Lmod/Environment Modules)
Preferred Experience:
- 4–6 years in HPC environments
- Exposure to GPU workloads
- Understanding of parallel computing fundamentals
- Ability to interact with application end-users and researchers