Objectives
Design, implement, and maintain critical infrastructure and automation pipelines that sustain the software and AI model lifecycle. Central focus on ensuring scalability, resilience, and security of hybrid and multi-cloud ecosystems.
Responsibilities
- Architect and manage Infrastructure as Code (IaC) for production environments, ensuring consistency between development, staging, and production
- Implement and optimize CI/CD pipelines for traditional applications and MLOps workflows, automating deployment and monitoring of Large Language Models (LLMs)
- Orchestrate large-scale container clusters, ensuring high availability of critical AI and data services
- Monitor infrastructure and model performance in real-time, implementing auto-scaling strategies and cloud cost management (FinOps)
- Collaborate with development and security teams to integrate DevSecOps practices from architecture conception
Requirements and Profile
- Minimum 10 years experience in DevOps or Site Reliability Engineering (SRE) roles
- Mastery of IBM Cloud (VPC, OpenShift, Kubernetes Service), with solid experience in other providers like AWS or Azure
- Proficiency in infrastructure and orchestration tools: Terraform, Ansible, Docker, and Kubernetes (K8s/OpenShift)
- Practical experience with CI/CD tools (GitHub Actions, GitLab CI, or Jenkins) and monitoring (Prometheus, Grafana, Instana, or ELK Stack)
- Knowledge in data pipelines and integration with AI services (e.g., Watsonx, SageMaker, or Azure AI)
- English fluency; clear technical communication skills and profile oriented toward mentoring and collaboration in multidisciplinary teams
Apply Now