As a Senior Systems Reliability Engineer, you will apply your senior application product expert skills to support building processes that manage and improve OIT’s response posture to system events impacting end users and Veterans. This includes working with business partners to improve communication and responsiveness to application failures by minimizing impacts in performance degradation and availability, working towards a significant reduction in application downtime and impact to the users. You will be working with a team of site reliability engineers, both junior and senior level, to support an engineering team lead to perform the required deliverables.
- Utilize technical area expertise to assess, select, manage and implement enterprise application components, and to ensure that the technical solution solves the business problem as an organic part of the organization’s operational and functional baseline.
- Support Triage Major Incident Management (MIM) and Problem Management (PM) incidents by deconstructing application performance, interoperability, instrumentation, and human factors to facilitate resolution and development of resilient solutions.
- Support Triage efforts during Major Incidents by deconstructing application performance, interoperability, instrumentation, and human factors to facilitate resolution and development of resilient solutions.
- Support Problem Management’s enterprise root cause analysis (RCA) processes in collaboration with appropriate OI&T organizations.
- Capture technical information from the relevant stakeholders and synthesize it into useful information in various formats for OIT senior management and other VA components.
- Support the collection, development, and/or editing of content for white papers and other communication devices; and assessing and evaluating the effectiveness of executive communication to effect process improvement.
- Demonstrate proficiency with DevOps tools, JIRA, ServiceNow, MS Project and perform tasks using the tools.
- Masters Degree is preferred in Business Administration, Business Management, Computer Science, Information Systems, Information Resource Management, Industrial Engineering, Operations Research, or related fields
- 5+ years of relative experience
- Certifications in relevant software development or analytics plus 3-5 years of relevant experience
- 8 to 10 years of relevant experience may be substituted for education (13-15 years total)
- Experience troubleshooting large live production environments
- Experience with modern performance monitoring and diagnostics tools (examples: Splunk, ITSI, AppD, Dynatrace, WireShark.
- Be a technical expert with expertise across multiple technology areas and the ability to diagnose complex issues throughout many technologies, including cyber security, IAM or single sign on,network engineering(LAN/WAN/wLAN,deep packet analysis, databases(Windows, Oracle), Windows and Linux Infrastructure, Cloud engineering (AWS,Azure, other)
- Prefer candidates with software development experience or deep knowledge of software development lifecycle.
- Must be able to identify and mitigate risks to the product
- Must be able to provide oral and written discussion of analytical findings using narrative and graphic forms.
- Must be able to use qualitative and quantitative analytical skills to assess the effectiveness of the operations.
- Identifying symptoms for process improvement.
- Analytical and investigation, and organization skills
- Communications including being able to craft content for executive level presentations.
- IT background and ability to understand technical content.