Site reliability Engineer Openstack H/F

OVH offers a wide range of IT services to companies, and to individuals who are passionate about tech. Whether you're looking at our Private Cloud, Public Cloud or Hybrid Cloud services, web hosting plans, virtual datacentres, dedicated servers, storage solutions or even xDSL and VoIP connections, our services are constantly being improved with the very latest innovations, and are regularly developed with new features.

In OVH Public Cloud team we are aiming to deliver the best-in- class service for wide range scale customers from one VM start-ups through DevOps development playgrounds, up to hundreds VMs hybrid-cloud clusters. In OVH Public Cloud OpenStack team you will be challenged with huge scale deployments and issue related to it, cooperation with upstream OpenStack developers from all over the world and delivering the latest, cutting edge technologies as a service.

Your Role?

The Site Reliability Engineer (SRE) is responsible for the availability, performance, monitoring, and incident response, among other things, of the platforms and services that Product Unit Public Cloud Instances runs and owns. Thanks to your software and system engineering competencies you are able to build and run large-scale, massively distributed, fault-tolerant system. SRE ensures that systems have reliability and uptime appropriate to users' needs and a fast rate of improvement while keeping an ever-watchful eye on capacity and performance.

Much of SRE software development focuses on optimizing existing systems, building infrastructure and eliminating work through automation. Practices such as limiting time spent on operational work, blameless postmortems and proactive identification of potential outages factor into iterative improvement that is key to both product quality and interesting and dynamic day-to-day work.Our organization brings together people with a wide variety of backgrounds, experiences and perspectives. We encourage them to collaborate, think big and take risks in a blame-free environment. We promote self-direction to work on meaningful projects, while we also strive to create an environment that provides the support and mentorship needed to learn and grow.

Engage in and improve the whole lifecycle of servicesfrom inception and design, through deployment, operation and refinement.

Support services before they go live through activities such as system design consulting, developing software platforms and frameworks, capacity planning and launch reviews.

Maintain services once they are live by measuring and monitoring availability, latency and overall system health.

Scale systems sustainably through mechanisms like automation, and evolve systems by pushing for changes that improve reliability and velocity.

Practice sustainable incident response and blameless postmortems.

Your skills?

English

Experience with managing distributed, highly available, high traffic infrastructure based on Linux

Experience with the use, maintenance and configuration of monitoring, metrics and logging infrastructure (Icinga/Nagios, Prometheus, Grafana, Graphite, Logstash/Kibana, etc.)before: You have extensive experience with performance analysis and tuning

Comfortable with shell and scripting languages used in an SRE/Operations engineering context (Python, Go, Bash, Perl, etc.) before: You have experience developing tools