Site Reliability Engineer
Company: Cisco Systems, Inc.
Location: Raleigh
Posted on: January 23, 2025
|
|
Job Description:
Application Deadline 1/20/25Who We AreAt Cisco, we are a global
leader in networking and IT, driving innovation and redefining how
people connect, communicate, and collaborate. Our mission is to
shape the future of the internet by creating unprecedented value
and opportunity for our customers, employees, investors, and
ecosystem partners. We are committed to encouraging a diverse and
partnership environment where everyone can thrive and encourage our
collective success.Who You AreWe are seeking a highly skilled and
experienced Senior Engineer to join our team, focusing on the
design and development of AI services and capabilities tailored to
IT's GPU-based AI Clusters observability. This role involves
reshaping how we lead alerts, metrics, and logs by introducing deep
learning and GenAI to enhance reliability services. The ideal
candidate will have a strong background in artificial intelligence,
machine learning, and GPU-based AI infrastructure, with a
consistent record of delivering innovative solutions that enhance
system monitoring, performance, and reliability.Key
Responsibilities:Design, build, and maintain observability systems
for leading NVIDIA DGX clusters, ensuring flawless monitoring of AI
workloads, hardware utilization (GPUs), and system health.Develop
monitoring tools and dashboards that supervise key metrics such as
GPU utilization, memory, temperature, latency, network bandwidth,
model performance, and system availability.Build custom alerting
systems for AI/ML workflows, enabling proactive issue detection
(e.g., GPU failures, hardware bottlenecks, system
crashes).Collaborate with IT and MLOps teams to design efficient,
scalable solutions for deploying, monitoring, and leading machine
learning models on DGX systems.Optimize DGX infrastructure by
implementing standard processes for observability, ensuring high
performance and reducing operational costs.Supervise system-level
metrics such as hardware temperature, power consumption, and
GPU/CPU health, preventing hardware degradation or failure.Develop
solutions for supervising AI/ML model performance across DGX
clusters, integrating logging and supervising for model training,
inference, and deployment processes.Integrate observability tools
(e.g., Prometheus, Grafana, Splunk) with NVIDIA-specific tools
(e.g., DCGM, NVIDIA GPU Cloud) for real-time monitoring and
alerting.Work closely with data scientists and machine learning
engineers to ensure effective resource utilization and model
observability, including the identification of performance
bottlenecks and tuning for optimal GPU usage.Drive solving and root
cause analysis for failures and anomalies in both the DGX hardware
and AI/ML models running on the infrastructure.Ensure compliance
with ethical AI standards by monitoring fairness, model drift, and
performance consistency.Document standard methodologies and
processes for managing, deploying, and monitoring AI workloads on
DGX clusters.Minimum Qualifications:Bachelor's degree in Computer
Science, Software Engineering, Data Science, or related fields.7+
years of experience software engineering, systems engineering, or
DevOps roles.3+ years of experience in high-performance computing
(HPC) or AI/ML environments.Preferred Qualifications:Strong
experience leading NVIDIA DGX systems or similar GPU-based
computing clusters.Proficiency in GPU monitoring tools such as
NVIDIA Data Center GPU Manager (DCGM) and related NVIDIA
libraries/APIs.Experience with AI/ML model deployment and
monitoring on large-scale infrastructure, including model
performance metrics (latency, throughput, accuracy).Hands-on
experience with observability tools such as Prometheus, Grafana,
Splunk or similar, especially in high-performance computing
environments.Proficiency in scripting/programming languages (e.g.,
Python, Bash, Go) for automating cluster management and monitoring
tasks.Experience with container orchestration technologies (e.g.,
Docker, Kubernetes), including NVIDIA's GPU operator for
Kubernetes.Familiarity with AI/ML lifecycle management tools such
as ML flow, Kubeflow, or similar.Strong understanding of HPC
environments, including distributed computing, storage, and
networking for AI/ML workloads.Experience with infrastructure
monitoring and solving at both hardware (GPU, CPU, memory) and
software (AI/ML models, applications) levels.Strong analytical and
problem-solving skills, with the ability to interpret complex data
and develop actionable insights.Excellent verbal and written
communication skills, with the ability to convey technical concepts
to non-technical partners.Ability to work effectively in a
collaborative team environment and lead multiple projects
simultaneously.Experience with NVIDIA NGC (NVIDIA GPU Cloud) and
DGX OS software stack for large-scale AI workloads.Understanding of
AI workload orchestration with frameworks such as Slurm or
Kubernetes in GPU-based clusters.Knowledge of NVIDIA Deep Learning
frameworks (TensorFlow, PyTorch) and their performance optimization
on DGX infrastructure.Experience with AIOps tools for automated
anomaly detection and solving of large-scale AI
infrastructure.Certification or experience with cloud platforms
that offer GPU instances (AWS, GCP, Azure).Familiarity with network
performance tuning in HPC environments and large-scale AI
workloads.Familiarity with DevOps practices and tools, including
CI/CD pipelines and infrastructure as code. Knowledge of Graphs,
Graph DB's and Graph Theory. Familiarity with Terraform, Helm
Chart, Ansible, or similar tools.Why Cisco#WeAreCisco, where each
person is unique, but we bring our talents to work as a team and
make a difference powering an expansive future for all.We adopt
digital, and help our customers implement change in their digital
businesses. Some may think we're "old" (36 years strong) and only
about hardware, but we're also a software company. And a security
company. We even invented an intuitive network that adapts,
predicts, learns and protects. No other company can do what we do -
you can't put us in a box!But "Digital Transformation" is an empty
buzz phrase without a culture that allows for innovation,
creativity, and yes, even failure (if you learn from it.)Day to
day, we focus on the give and take. We give our best, give our egos
a break, and give of ourselves (because giving back is built into
our DNA.) We take accountability, bold steps, and take difference
to heart. Because without diversity of thought and a dedication to
equality for all, there is no moving forward.So, you have colorful
hair? Don't care. Tattoos? Show off your ink. Like polka dots?
That's cool. Pop culture geek? Many of us are. Passion for
technology and world changing? Be you, with us!
Keywords: Cisco Systems, Inc., Fayetteville , Site Reliability Engineer, Professions , Raleigh, North Carolina
Click
here to apply!
|