27 days old

Manager, Site Reliability Engineering

Discovery Inc.
Sterling, VA 20164
  • Job Code
    118669277
Discovery Inc.


Manager, Site Reliability Engineering

US-VA-Sterling

Job ID: 26168
Type: Company Employee Full-Time
# of Openings: 1
Category: IT & Technical Operations
US-VA-Sterling

Overview

Our Team
As Discovery Inc’s portfolio continues to grow – around the world and across platforms – the Global Technology & Operations team is building media technology and IT systems that meet the world class standard for which Discovery is known. GT&O builds, implements and maintains the business systems and technology that are critical for delivering Discovery’s products, while articulating the long-term technology strategy that will enable Discovery’s growing pay-tv, digital terrestrial, free-to-air and online services to reach more audiences on more platforms.

From Amsterdam to Singapore and from satellite and broadcast operations to SAP, we are driving Discovery forward on the leading edge of technology.

The Role
Reporting to the Senior Director of Site Reliability Engineering, this position is critical to the mission of the Site Reliability Engineering team as part of the Technology Operations Group. The core purpose of the role is to ensure that our complex technologies and systems are managed and operated for high efficiency and low risk, adding value and functionality to key business initiatives, projects, and core infrastructure capabilities and lifecycle. This includes managing projects and staff engagement in planning, designing, and implementing infrastructure, monitoring solutions, metrics collection and reporting, post-mortem analysis, and automation systems.


The postholder has a background as an operations generalist and can work closely with our engineering and product development teams and internal architecture and design initiatives from the early stages of design all the way through identifying and resolving production issues. The ideal candidate will be passionate about a data and metrics-driven operations role that involves work across multiple IT and broadcast technologies and will also believe that automation is a key component to operating large-scale systems.

Highlights of the role
Based in either Sterling, VA or Knoxville, TN, the postholder is part of a team supporting our two global Technology Operations Centers based in Sterling, VA and London, UK. As a function these are the 24/7 Command & Control Hub for our media distribution and IT support services. The position is key to ensuring organizational improvements, consistently improving and maintaining our availability and uptime, establishing effective automation and monitoring solutions to deliver successes and areas of opportunity, as well as overseeing retrospective post-mortem and correction of error reporting.


The Site Reliability Engineering team partners with engineering and workforce technology teams to advocate sensible, scalable systems design as well as building the best tools to diagnose, resolve and prevent issues. The postholder is an ambassador for Site Reliability Engineering and good design within GT&O and so should be a great communicator and enthusiastic champion of Technology Operations.


In advocating and participating in good design practices and data-driven process and systems improvement, the Director Reliability Architecture services the internal needs of the Technology Operations team, the broader GT&O organization, as well internal customers across myriad technology and business teams.


The Postholder is expected to work regular office hours but during large events should expect to work outside of this including weekends and nights occasionally.



Responsibilities

1. Oversee TechOps monitoring tool and capabilities portfolio. Guide and lead efforts in the design, and implementation of monitoring tools and processes across platform, network, distribution, and digital focus areas
2. Manage project based and BAU monitoring initiatives and posture
3. Oversee post-mortem and correction of error collection and reporting as an output of the major incident management process
4. Collaborate with TechOps leadership coordinating SRE involvement needs in service design efforts in response to recurring major incidents
5. Partner with TechOps leadership in managing the TechOps SRE automation center of excellence
6. Manage key metrics and key performance indicators document repository
7. Serve as a point of contact for the Site Reliability Engineering team for project initiation and design consultation
8. Serve as a technical lead for monitoring-focused team staff
9. Gain deep knowledge of our complex applications
10. Participate in TechOps-led objective no-blame post-incident analysis and review process
11. Support the creation of end-to-end availability and performance of mission critical services
12. Function well in a fast-paced, rapidly changing environment.
13. Be on-call when required to support our operations centers



Qualifications

* Bachelor’s degree in Computer Science, Information Technology, Mathematics Software or Broadcast Engineering, or other technical discipline, or related practical experience.
* 3+ years experience with troubleshooting in Unix/Linux
* Good programming skills in one or more of C/C++, Java, Javascript, Python, Perl, Powershell and an ability to pick up new ones.
* Experience in the Linux environment and a good understanding of its fundamentals and internals: filesystems and modern memory management, threads and processes, the user/kernel-space divide, etc.
* Background in Configuration and management of large-scale platforms. (Virtualization, Cloud, Unix, Linux, Java, SQL, Oracle)
* A good understanding of large-scale distributed systems in practice, including multi-tier architectures, application security, monitoring and storage systems.
* Working knowledge of the TCP/IP stack, internet routing and load balancing
* Working exposure to linear and digital broadcasting and platforms preferred
* Knowledge of most of these: data structures, relational and non-relational databases, networking, Linux internals, filesystems, web architecture, and related topics
* Previous experience working with geographically distributed coworkers.
* Strong verbal, written, interpersonal communication and customer service skills and ability to work well in a global diverse, team-focused environment
* Good organizational and conceptual skills combined with proven critical thinking, analytic, problem solving, and decision-making abilities
* Ability to multitask within related functions
* Positive attitude and can-do mentality
* Experience of working for a Media Company/Broadcast is desirable but not essential
* Must have the legal right to work in US

sterling, virginia, va, dmv

Discovery Inc. is an equal opportunity employer. Discovery is committed to being an employer of choice, not just a good place to work, but a great and inclusive place to work. To that end, we strive to recruit and maintain a workforce that meaningfully represents the diverse and culturally rich communities that we serve. Qualified applicants will receive consideration for employment without regard to their race, color, religion, national origin, sex, sexual orientation, gender identity, protected veteran status or disabled status or, genetic information.

EEO is the Law

Pay Transparency Policy Statement

If you are an individual with a disability and need an accommodation during the application process, please send an email request to HR@discovery.com.

PI118669277

<b>Discovery Inc.</b><br/><br/><br/><b>Manager, Site Reliability Engineering</b><br/><br/>US-VA-Sterling<br/><br/><b>Job ID:</b> 26168<br/><b>Type:</b> Company Employee Full-Time<br/><b># of Openings:</b> 1<br/><b>Category:</b> IT & Technical Operations<br/>US-VA-Sterling<br/><br/><b>Overview</b><br/><br/><p style="margin: 0px;"><strong>Our Team</strong><br />As Discovery Incs portfolio continues to grow around the world and across platforms the Global Technology & Operations team is building media technology and IT systems that meet the world class standard for which Discovery is known. GT&O builds, implements and maintains the business systems and technology that are critical for delivering Discoverys products, while articulating the long-term technology strategy that will enable Discoverys growing pay-tv, digital terrestrial, free-to-air and online services to reach more audiences on more platforms.<br /><br />From Amsterdam to Singapore and from satellite and broadcast operations to SAP, we are driving Discovery forward on the leading edge of technology.</p><p style="margin: 0px;"> </p><p style="margin: 0px;"><strong>The Role </strong><br />Reporting to the Senior Director of Site Reliability Engineering, this position is critical to the mission of the Site Reliability Engineering team as part of the Technology Operations Group. The core purpose of the role is to ensure that our complex technologies and systems are managed and operated for high efficiency and low risk, adding value and functionality to key business initiatives, projects, and core infrastructure capabilities and lifecycle. This includes managing projects and staff engagement in planning, designing, and implementing infrastructure, monitoring solutions, metrics collection and reporting, post-mortem analysis, and automation systems.</p><p style="margin: 0px;"><br />The postholder has a background as an operations generalist and can work closely with our engineering and product development teams and internal architecture and design initiatives from the early stages of design all the way through identifying and resolving production issues. The ideal candidate will be passionate about a data and metrics-driven operations role that involves work across multiple IT and broadcast technologies and will also believe that automation is a key component to operating large-scale systems.</p><p style="margin: 0px;"> </p><p style="margin: 0px;"><strong>Highlights of the role</strong><br />Based in either Sterling, VA or Knoxville, TN, the postholder is part of a team supporting our two global Technology Operations Centers based in Sterling, VA and London, UK. As a function these are the 24/7 Command & Control Hub for our media distribution and IT support services. The position is key to ensuring organizational improvements, consistently improving and maintaining our availability and uptime, establishing effective automation and monitoring solutions to deliver successes and areas of opportunity, as well as overseeing retrospective post-mortem and correction of error reporting.</p><p style="margin: 0px;"><br />The Site Reliability Engineering team partners with engineering and workforce technology teams to advocate sensible, scalable systems design as well as building the best tools to diagnose, resolve and prevent issues. The postholder is an ambassador for Site Reliability Engineering and good design within GT&O and so should be a great communicator and enthusiastic champion of Technology Operations.</p><p style="margin: 0px;"><br />In advocating and participating in good design practices and data-driven process and systems improvement, the Director Reliability Architecture services the internal needs of the Technology Operations team, the broader GT&O organization, as well internal customers across myriad technology and business teams.</p><p style="margin: 0px;"><br />The Postholder is expected to work regular office hours but during large events should expect to work outside of this including weekends and nights occasionally. </p><br/><br/><b>Responsibilities</b><br/><br/><p style="margin: 0px;">1. Oversee TechOps monitoring tool and capabilities portfolio. Guide and lead efforts in the design, and implementation of monitoring tools and processes across platform, network, distribution, and digital focus areas<br />2. Manage project based and BAU monitoring initiatives and posture<br />3. Oversee post-mortem and correction of error collection and reporting as an output of the major incident management process<br />4. Collaborate with TechOps leadership coordinating SRE involvement needs in service design efforts in response to recurring major incidents<br />5. Partner with TechOps leadership in managing the TechOps SRE automation center of excellence<br />6. Manage key metrics and key performance indicators document repository<br />7. Serve as a point of contact for the Site Reliability Engineering team for project initiation and design consultation<br />8. Serve as a technical lead for monitoring-focused team staff<br />9. Gain deep knowledge of our complex applications<br />10. Participate in TechOps-led objective no-blame post-incident analysis and review process<br />11. Support the creation of end-to-end availability and performance of mission critical services<br />12. Function well in a fast-paced, rapidly changing environment.<br />13. Be on-call when required to support our operations centers</p><br/><br/><b>Qualifications</b><br/><br/><p style="margin: 0px;">* Bachelors degree in Computer Science, Information Technology, Mathematics Software or Broadcast Engineering, or other technical discipline, or related practical experience.<br />* 3+ years experience with troubleshooting in Unix/Linux<br />* Good programming skills in one or more of C/C++, Java, Javascript, Python, Perl, Powershell and an ability to pick up new ones. <br />* Experience in the Linux environment and a good understanding of its fundamentals and internals: filesystems and modern memory management, threads and processes, the user/kernel-space divide, etc. <br />* Background in Configuration and management of large-scale platforms. (Virtualization, Cloud, Unix, Linux, Java, SQL, Oracle) <br />* A good understanding of large-scale distributed systems in practice, including multi-tier architectures, application security, monitoring and storage systems. <br />* Working knowledge of the TCP/IP stack, internet routing and load balancing<br />* Working exposure to linear and digital broadcasting and platforms preferred<br />* Knowledge of most of these: data structures, relational and non-relational databases, networking, Linux internals, filesystems, web architecture, and related topics<br />* Previous experience working with geographically distributed coworkers.<br />* Strong verbal, written, interpersonal communication and customer service skills and ability to work well in a global diverse, team-focused environment <br />* Good organizational and conceptual skills combined with proven critical thinking, analytic, problem solving, and decision-making abilities<br />* Ability to multitask within related functions <br />* Positive attitude and can-do mentality<br />* Experience of working for a Media Company/Broadcast is desirable but not essential<br />* Must have the legal right to work in US</p><p style="margin: 0px;"><span style="color: #ffffff;">sterling, virginia, va, dmv</span></p>Discovery Inc. is an equal opportunity employer. Discovery is committed to being an employer of choice, not just a good place to work, but a great and inclusive place to work. To that end, we strive to recruit and maintain a workforce that meaningfully represents the diverse and culturally rich communities that we serve. Qualified applicants will receive consideration for employment without regard to their race, color, religion, national origin, sex, sexual orientation, gender identity, protected veteran status or disabled status or, genetic information. <br><br> <a href="https://discovery.icims.com/icims2/servlet/icims2?module=AppInert&action=download&id=581131&hashed=1619517695">EEO is the Law</a> <br><br> <a href="https://discovery.icims.com/icims2/servlet/icims2?module=AppInert&action=download&id=588767&hashed=2014725565">Pay Transparency Policy Statement</a><br><br> If you are an individual with a disability and need an accommodation during the application process, please send an email request to HR@discovery.com. <img src="https://analytics.click2apply.net/v/z1BrMnTOKPNdH7oqiRgNZ"> <p>PI118669277</p>

Categories

Posted: 2020-03-03 Expires: 2020-04-03

Before you go...

Our free job seeker tools include alerts for new jobs, saving your favorites, optimized job matching, and more! Just enter your email below.

Share this job:

Manager, Site Reliability Engineering

Discovery Inc.
Sterling, VA 20164

Join us to start saving your Favorite Jobs!

Sign In Create Account
Powered ByCareerCast