Site Reliability Engineer (Managed Service Team)

VSOL provides top-notch services while strictly adhering to international standards. We remain in the public eye as experts in “the next big technologies”. VNG Solutions will provide you with a creative environment with an emphasis on B2B services, where you will have the opportunity to foster your abilities and learn about various technologies to advance your career.

Mission as a Site Reliability Engineering (SRE) is to ensure the stability, availability, and security of production environments and critical applications. Operating in a 24×7 shift model, this role serves as the sole point of access to production environments, maintaining compliance with data governance standards while resolving incidents and driving continuous service improvements. As a customer-facing representative, the SRE Level Leader collaborates closely with Level 3 teams and ensures efficient communication, rapid incident resolution, and adherence to service level agreements (SLAs).

Goals:

Maintain 100% compliance with security and access protocols for production environments and sensitive data.
Ensure the reliability and availability of applications and infrastructure within SLA commitments, minimizing downtime.
Act as the primary customer-facing escalation point for all critical issues during the shift.
Collaborate with Level 3 teams to identify root causes, implement permanent fixes, and optimize system performance.

Deliverables

Incident Resolution Logs: Comprehensive records of incidents resolved, actions taken, and compliance with production access protocols.
Shift Handover Reports: Detailed updates on ongoing issues, tasks, and system status for seamless shift transitions.
Root Cause Analysis (RCA): Documented analyses of critical incidents, including resolution timelines and preventive measures.
Monitoring Insights: Proactive identification of performance trends and recommendations for system improvements.
Performance Metrics Reports: Insights on SLA compliance, MTTR (Mean Time to Recovery), and system uptime.
Audit-Ready Logs: Complete and accurate documentation of production environment access for compliance purposes.

Key Responsibilities (70%)

Production Environment Management

Serve as the only authorized individual to access production environments during the shift, adhering strictly to security and compliance protocols.
Perform critical actions in production, such as resolving incidents, implementing approved workarounds, and conducting urgent maintenance.
Ensure all access to sensitive data and production systems is logged and audit-ready.

Incident Management and Resolution

Monitor production environments using tools (e.g., Prometheus, Grafana, Splunk, New Relic), identifying and addressing anomalies proactively.
Resolve incidents in real-time, ensuring minimal impact on application performance and availability.
Escalate unresolved issues to Level 3 teams, providing detailed documentation and analysis.

Customer Interaction and Escalation Management

Act as the primary customer-facing contact during critical incidents, providing timely updates and maintaining professionalism.
Lead incident debriefs with clients to explain resolutions, impacts, and next steps.
Foster customer trust by ensuring rapid and transparent communication during service disruptions.

System Optimization and Automation

Identify and implement automation opportunities for repetitive tasks to enhance system reliability.
Work with development teams to implement proactive monitoring, alerting, and self-healing capabilities.

ITILv4 Process Alignment

Ensure compliance with ITILv4 processes for incident, problem, and change management.
Track and report on key performance indicators (KPIs) related to system reliability and support operations.

And not limited to (30%)

Complying and contributing to IT policies, standards, Standard Operating Procedures (SOPs) and guidelines.
Self-studying and proposing valuable training courses (soft and hard skill), frameworks and best practices to Management team.
Volunteering to conduct internal training, working with sharing knowledge spirit and providing mentorship to junior support engineers.
Compliance with Conversation, Feedback and Recognitions (CFRs) method to build up company’s culture.

Knowledge:

SRE Practices: Strong understanding of SRE principles, including monitoring, automation, and performance optimization.
ITIL Framework: 1.2.1. Knowledge of Incident, Problem, and Change Management processes.
Awareness of SLAs and the importance of maintaining compliance in production environments.
Security Protocols: Strong understanding of data governance, production access controls, and compliance standards.

Application and Infrastructure:

Familiarity with application architecture, APIs, databases, and multi-tiered systems.
In-depth understanding of high-availability, distributed systems, and their components (e.g., servers, databases, networks, and applications).
Deep knowledge of Linux/Unix or Windows operating systems.
Monitoring Tools: Expertise in tools (e.g., Prometheus, Grafana, Splunk, New Relic) for real-time monitoring and analysis.

DevOps and Automation:

Familiarity with CI/CD pipelines, configuration management tools (e.g., Ansible, Terraform)
Understanding of scripting languages (Python, Bash, PowerShell).
Understanding of container orchestration (e.g., Kubernetes, Docker) and cloud platforms (e.g., AWS, Azure, GCP).

Skills:

Troubleshooting: Advanced diagnostic skills to resolve complex application and infrastructure issues.
Production Access Management: Proficiency in managing secure access to production environments and sensitive data.
Customer Communication: Exceptional ability to communicate technical concepts and updates clearly to clients and stakeholders.
Incident Management: Expertise in managing high-priority incidents efficiently and under pressure.
Documentation: Skilled in creating detailed logs, reports, and technical guides to ensure audit readiness and team consistency.

Tool Proficiency:

Expertise in monitoring, alerting, and incident tracking tools (e.g., PagerDuty, Jira, ServiceNow).
Competency in scripting for automating repetitive tasks and improving incident resolution efficiency.

Abilities:

Proactive Mindset: Ability to anticipate potential failures and implement preventive measures to ensure system reliability.
Decision-Making: Ability to make quick, sound decisions under pressure in production environments.
Adaptability: Capacity to adjust to changing priorities and evolving technical challenges in a 24×7 environment.
Customer Focus: Strong commitment to delivering high-quality service and maintaining customer satisfaction.
Collaboration: Capability to work seamlessly with cross-functional teams, including remote Level 3 resources.
Leadership: Ability to lead a shift team effectively, fostering a culture of accountability and excellence.
Resilience Under Pressure: Capacity to remain calm and focused during high-stakes production incidents.
Operational Excellence: Ability to balance immediate problem-solving with long-term improvements to production environments.
Educational Qualifications and Experience

Must have:

Int’l certified or hand-on >= 4 years’ experience in a customer-facing technical support role, or a similar role, preferably in managed services or IT environments.
Proven expertise in managing incidents, troubleshooting production environments, and adhering to security protocols.
ITIL Foundation knowledge.
Experience with ticket system tools and technologies (e.g., ServiceNow, Jira Service Desk, BMC). 1.5. Having IELTS >= 450 or equivalent.

Nice to have/Preferred:

Certifications:

SRE, DevOps, or related practices (e.g., Google SRE, AWS DevOps Engineer).
Monitoring tools (e.g., Splunk Certified User, Dynatrace Associate) or cloud platforms (e.g., AWS, Azure).

Careers

Site Reliability Engineer (Managed Service Team)

Apply for this job