- Company Name
- Sustainable Talent
- Job Title
- Senior System Engineer
- Job Description
-
**Job Title**
Senior System Engineer
**Role Summary**
Design, deploy, and maintain a high‑availability compute cluster supporting NVIDIA’s next‑generation GPU, AI/ML, and accelerated computing hardware. Lead system recovery, root‑cause analysis, and continuous improvement of infrastructure reliability, performance, and operational efficiency in a large‑scale on‑prem data center and lab environment.
**Expectations**
- Manage and expand a dense GPU‑clustered compute farm, ensuring uptime, performance, and safety.
- Deliver rapid remediation of hardware, network, storage, and thermal incidents.
- Scale new systems, perform qualification, benchmarking, and lifecycle management.
- Meet internal SLAs (PUE, MTTR, test throughput) and collaborate on capacity planning.
**Key Responsibilities**
- Partner with system architects, hardware, firmware, QA, and platform teams to develop and release products.
- Maintain racks, GPU nodes, interconnects, storage arrays, and supporting infrastructure (power, cooling, UPS).
- Monitor availability, conduct root‑cause analysis, and drive remediation initiatives.
- Deploy, qualify, and scale high‑density GPU clusters, rack‑scale systems, and liquid‑cooling environments.
- Coordinate inventory, asset lifecycle, configuration management, decommissioning, and refresh.
- Ensure lab and data‑center hygiene (cable management, ESD compliance, tool control).
- Troubleshoot cross‑platform issues (Windows, Linux, macOS) with firmware, OS, and platform infrastructure.
- Represent the infrastructure team in reviews and global NVIDIA coordination meetings.
**Required Skills**
- Experience in large‑scale datacenter or compute‑lab environments (compute‑dense, hyperscale).
- Proficient with DCIM tools (e.g., Nautobot), version control (Git, Perforce), and automation (shell, Python, Ansible).
- Strong networking fundamentals (TCP/IP, DNS, NFS, SSL/TLS, IPv6) and high‑bandwidth interconnects.
- Multi‑OS support: Windows, macOS, Linux, BIOS/firmware updates, driver deployments, system imaging.
- Physical hardware expertise: PCBs, GPUs, server/node deployments, rack integration, cooling/power, cable/fibre management.
- Excellent written and verbal communication; analytical problem‑solving; ownership mindset.
**Required Education & Certifications**
- Associate’s or Bachelor’s degree in Engineering, Computer Science, or related technical field (or equivalent experience).
- Certifications preferred: CCNA/CCNP, or similar networking/infrastructure credentials.
- Experience with HPC or GPU clusters (Slurm, Kubernetes, BCM) and private cloud stacks (OpenStack, VMware, Nutanix) is advantageous.