Job Specifications
Join FlexAI:
FlexAI is at the forefront of revolutionizing AI computing by reengineering infrastructure at the system level. Our groundbreaking architecture, combined with sophisticated software intelligence, abstraction, and an orchestration layer, allows developers to leverage a diverse array of compute, resulting in efficient, more reliable computing at a fraction of the cost. We are seeking a skilled and experienced Lead Devops/SRE Engineer.
Founded by Brijesh Tripathi, who bring experience from Nvidia, Apple, Tesla, Intel and Zoox, FlexAI is not just building a product - we're shaping the future of AI. Our teams are strategically distributed across Paris, Silicon Valley, and Bangalore, united by a shared mission: to deliver more compute with less complexity.
If you're passionate about shaping the future of artificial intelligence, driving innovation, and contributing to a sustainable and inclusive AI ecosystem, FlexAI is the place for you !
Position Overview:
FlexAI is seeking a skilled and motivated Lead DevOps/SRE Engineer to join our PaaS Team. As part of this innovative team, you will play a pivotal role in building and maintaining the infrastructure that powers FlexAI's cutting-edge PaaS (Platform as a Service) system. Our PaaS Cloud Service is designed to enable customers to run workloads seamlessly across various architectures, providing unparalleled reliability and efficiency. Our PaaS product is currently in Beta testing with select clients, offering a unique opportunity to contribute to a cutting-edge platform that is poised to redefine industry standards. Join us in this critical phase as we refine and perfect our solution for broader release.
What you'll do:
Drive operational excellence by coordinating priorities across DevOps, SRE, and Product teams, ensuring scalability and reliability objectives are consistently met during and after Beta.
Mentor and guide engineers within the Ops team, setting best practices, fostering autonomy, and contributing to long-term platform strategy in collaboration with leadership.
Design, implement, and maintain CI/CD pipelines to support the efficient delivery and deployment of our Beta product, ensuring seamless customer experience.
Develop and manage infrastructure as code (IaC) using tools like Terraform, enabling scalable and repeatable infrastructure that supports our PaaS goals.
Implement and manage containerization and orchestration tools (e.g., Docker, Kubernetes) to ensure scalable deployment across various architectures.
Monitor and optimize system performance, proactively identifying and resolving bottlenecks to maintain reliability and efficiency during Beta testing and beyond.
Collaborate with software developers and backend engineers to ensure the seamless integration and performance of backend services within our PaaS infrastructure.
Ensure system reliability and availability by implementing best practices in monitoring, alerting, and incident response, particularly as we scale our Beta product.
Troubleshoot and resolve infrastructure issues promptly to minimize downtime and maintain customer trust.
Collaborate with security teams to ensure infrastructure meets security best practices and compliance requirements, especially in a multi-architecture environment.
Automate routine tasks to improve efficiency and reduce manual intervention, focusing on maintaining the flexibility and reliability of our PaaS offerings.
What you'll need:
Bachelor's or higher degree in Computer Science, Software Engineering, or a related field.
Proven experience as a Lead DevOps or SRE Engineer, with a strong focus on OPs, automation, scalability, and reliability within PaaS environments.
Familiarity with cloud-native technologies including container runtimes such as Docker and cluster schedulers such Kubernetes is a must
Strong proficiency in scripting languages (e.g.,Python, Bash) and familiarity with programming languages such as Go or Rust.
Experience with cloud platforms (AWS, Azure, GCP) and infrastructure services, especially in supporting PaaS solutions.
Proficiency in containerization and orchestration tools (e.g., Docker, Kubernetes) with experience in managing multi-architecture deployments.
Hands-on experience with infrastructure as code (IaC) tools like Terraform, supporting scalable and reliable infrastructure.
Strong understanding of CI/CD pipelines and automated testing methodologies.
Excellent problem-solving and troubleshooting skills, especially in the context of Beta testing and production environments.
Excellent collaboration and communication skills to work effectively with cross-functional teams.
Entrepreneurial & start-up mindset!
Note: Familiarity with AI model training is a significant advantage!
What we offer:
Competitive salary and benefits package, tailored to recognize your dedication and contributions.
Opportunity to collaborate with leading experts in AI and cloud computing, learning from the best and brightest, fostering c
About the Company
FlexAI delivers Workload as a Service (WaaS), ensuring AI workloads run optimally—anytime, anywhere. Our intelligent orchestration eliminates inefficiencies by dynamically placing workloads on the best compute without vendor lock-in. With FlexAI, businesses accelerate AI training, fine-tuning, and inference while reducing costs and complexity.
Know more