- Company Name
- Grafana Labs
- Job Title
- Staff AI Engineer - Grafana Ops, AI/ML | USA | Remote
- Job Description
-
**Job Title:** Staff AI Engineer – Observability AI/ML
**Role Summary:**
Design, build, and ship AI-driven features that analyze, triage, and resolve incidents across a large-scale observability stack. Own the end-to-end development of high‑performance, scalable AI solutions (LLMs, agentic workflows, prompt‑engineered pipelines) and rapidly iterate from prototype to production while collaborating with data analysts, product managers, and designers.
**Expectations:**
* Deliver production‑grade AI tools that provide tangible value to users.
* Prototype and ship features quickly, validate with real users, and iterate aggressively.
* Own the full lifecycle of AI solutions—concept, design, implementation, testing, deployment, and scaling.
**Key Responsibilities:**
1. Build AI features for incident detection, triage, and resolution using observability data.
2. Develop, prototype, and ship LLM‑ and agent‑powered workflows for incident lifecycle management and automated analysis.
3. Integrate AI components with alerting systems, runbooks, internal developer tools, and observability dashboards.
4. Collaborate cross‑functionally with data analysts, product managers, and designers to shape product requirements and user flows.
5. Implement iterative experimentation cycles: prototype → test → validate feedback → refine.
6. Ensure scalability, maintainability, and performance of AI services in production environments.
7. Communicate progress, risks, and solutions clearly to technical and non‑technical stakeholders.
**Required Skills:**
* Proven software engineering experience (backend and/or full‑stack) in production systems.
* Strong proficiency in Python, Go, or similar languages; expertise in integrating LLMs and AI frameworks.
* Hands‑on experience with large language models, prompt engineering, and building agentic applications.
* Knowledge of observability data modalities (metrics, logs, traces) and related tooling (e.g., Loki, Tempo).
* Rapid prototyping mindset, ability to ship minimal viable features and iterate with user feedback.
* Ownership and initiative in ambiguous, high‑velocity environments.
* Excellent communication and collaboration skills.
* Familiarity with CI/CD pipelines, containerization (Docker), orchestrators (Kubernetes), and basic cloud services.
* Optional: Experience with vector stores, retrieval‑augmented generation, or LLM fine‑tuning.
**Required Education & Certifications:**
* Bachelor’s or higher degree in Computer Science, Software Engineering, or related technical field.
* Relevant certifications (e.g., AWS Certified DevOps Engineer, GCP Professional Data Engineer) are a plus but not mandatory.