Job Functions:- Identify, troubleshoot, resolve, and escalate incidents quickly and effectively.
Be responsible for the operational monitoring of the platform health and performance.- Be responsible for triaging problems reported by end-users.
Utilize application logs and stack traces in debugging reported issues.
- Work with other engineers in sustaining and improving application health and performance.
Develop tools, operational enhancements and automated solutions.- Participate in incident response and post-incident reviews.
Write clear and consumable documentation of the environment and operational procedures.
Bachelor's degree or higher education.
- Strong sense of ownership, customer service support, and integrity demonstrated through excellent written/verbal communication.
Ability to work through complex engineering obstacles using debugging and problem-solving skills.- Experience operating and troubleshooting Linux/Unix systems in a production environment.
Experience working with shell and scripting languages.
- Experience with containerization(Docker) and container orchestration platforms (Kubernetes)
Experience in logging, telemetry, and monitoring tool implementations like Splunk, Prometheus, and Grafana.- Experience in using version control such as Git.
Strong debugging and troubleshooting skills that span applications, systems, and networking (TCP/IP)
Job Type: Full-time
Education:
Experience:
- Systems Engineer: 3 years (Required)
* Linux: 3 years (Required)