VSOL provides top-notch services while strictly adhering to international standards. We remain in the public eye as experts in “the next big technologies”. VNG Solutions will provide you with a creative environment with an emphasis on B2B services, where you will have the opportunity to foster your abilities and learn about various technologies to advance your career.
Mission as a Site Reliability Engineering (SRE) is to ensure the stability, availability, and security of production environments and critical applications. Operating in a 24×7 shift model, this role serves as the sole point of access to production environments, maintaining compliance with data governance standards while resolving incidents and driving continuous service improvements. As a customer-facing representative, the SRE Level Leader collaborates closely with Level 3 teams and ensures efficient communication, rapid incident resolution, and adherence to service level agreements (SLAs).
Goals:
- Maintain 100% compliance with security and access protocols for production environments and sensitive data.
- Ensure the reliability and availability of applications and infrastructure within SLA commitments, minimizing downtime.
- Act as the primary customer-facing escalation point for all critical issues during the shift.
- Collaborate with Level 3 teams to identify root causes, implement permanent fixes, and optimize system performance.
Deliverables
- Incident Resolution Logs: Comprehensive records of incidents resolved, actions taken, and compliance with production access protocols.
- Shift Handover Reports: Detailed updates on ongoing issues, tasks, and system status for seamless shift transitions.
- Root Cause Analysis (RCA): Documented analyses of critical incidents, including resolution timelines and preventive measures.
- Monitoring Insights: Proactive identification of performance trends and recommendations for system improvements.
- Performance Metrics Reports: Insights on SLA compliance, MTTR (Mean Time to Recovery), and system uptime.
- Audit-Ready Logs: Complete and accurate documentation of production environment access for compliance purposes.
Key Responsibilities (70%)
Production Environment Management
- Serve as the only authorized individual to access production environments during the shift, adhering strictly to security and compliance protocols.
- Perform critical actions in production, such as resolving incidents, implementing approved workarounds, and conducting urgent maintenance.
- Ensure all access to sensitive data and production systems is logged and audit-ready.
Incident Management and Resolution
- Monitor production environments using tools (e.g., Prometheus, Grafana, Splunk, New Relic), identifying and addressing anomalies proactively.
- Resolve incidents in real-time, ensuring minimal impact on application performance and availability.
- Escalate unresolved issues to Level 3 teams, providing detailed documentation and analysis.
Customer Interaction and Escalation Management
- Act as the primary customer-facing contact during critical incidents, providing timely updates and maintaining professionalism.
- Lead incident debriefs with clients to explain resolutions, impacts, and next steps.
- Foster customer trust by ensuring rapid and transparent communication during service disruptions.
System Optimization and Automation
- Identify and implement automation opportunities for repetitive tasks to enhance system reliability.
- Work with development teams to implement proactive monitoring, alerting, and self-healing capabilities.
ITILv4 Process Alignment
- Ensure compliance with ITILv4 processes for incident, problem, and change management.
- Track and report on key performance indicators (KPIs) related to system reliability and support operations.
And not limited to (30%)
- Complying and contributing to IT policies, standards, Standard Operating Procedures (SOPs) and guidelines.
- Self-studying and proposing valuable training courses (soft and hard skill), frameworks and best practices to Management team.
- Volunteering to conduct internal training, working with sharing knowledge spirit and providing mentorship to junior support engineers.
- Compliance with Conversation, Feedback and Recognitions (CFRs) method to build up company’s culture.