The Technology Specialist – HPC ensures that critical IT services and high-performance computing (HPC) infrastructure are available, efficient, and secure. The person in this role manages daily operations of mission-critical systems in multiple client’s data centres, working closely with both facilities engineering teams (power, cooling, physical infrastructure) and IT infrastructure / operations teams, to support service clients around the clock. This role combines technical leadership, operations oversight, incident / problem management, and strategic planning.
- Experience architecting and maintaining HPC/AI systems.
- Linux system administration
- Cluster management
- System and software configuration management
- High speed networking
- Resource managers and schedulers
- High speed parallel storage
- Monitoring and alerting
- Strong understanding of HPC/AI architectures and concepts.
- Experience supporting and managing a group of HPC/AI Clusters.
- Excellent knowledge in prototyping and deploying HPC/AI clusters.
- Extensive experience in troubleshooting Linux OS, filesystems and cluster hardware.
- Good command of various Linux scripting tools, like bash, Perl, python, etc.
- Experience implementing, maintaining, and verifying defined security policies.
- To be willing to maintain a flexible work schedule.
- A positive attitude and willingness to help enable the lab users for success.
- Excellent guidance and teamwork skills.