ROLE SUMMARYOur client is looking for a Site Reliability Engineer to join the client's rapidly growing company in support of multiple SaaS applications. You will be responsible for cloud infrastructure, availability, reliability, performance, and security of production applications and systems.
- SCHEDULE: 9:00 AM 6:00 PM Pacific Daylight Time (12:00 AM 9:00 AM Philippine Standard Time), follows Philippine holidays
Create, deploy, and maintain production infrastructure within the AWS accounts, using IAC/Terraform- Utilize various AWS services, including EC2, EKS, RDS, RedShift, S3, and IAM
Create, implement, and maintain automated application releases using Bitbucket Pipelines
- Create, implement, and maintain application and infrastructure performance monitoring using Datadog or Prometheus/Loki/Grafana
Create, implement, and maintain application and infrastructure availability monitoring using Datadog or Prometheus/Loki/Grafana- Apply security practices and policies to identify and remediate security vulnerabilities
Oversee incident response procedures, including analysis and documentation of incidents to prevent future occurrences
A 4-year college degree (technical or quantitative science) is preferred or equivalent work experience with evidence of proficiency and achievement in virtual infrastructure management
- 3+ years experience in cloud computing and Infrastructure as Code (IaC) (e.g., Terraform, etc.) or related field
Experience with cloud-native tooling (Helm Charts, ArgoCD, HashiCorp Vault, Harbor, Reloader, Grafana, Prometheus, and Loki) is a plus- Experience with cloud native analytics tools (ElasticSearch, MongoDB, RedShift/SnowFlake, and Looker)
Any AWS certification is a big plus
- Proficient in Linux system administration and security
Proficient with containerization technologies, especially Kubernetes- Proficient with code versioning tools (e.g., Git, Bitbucket, etc.)
Proficient with CI/CD tools (e.g., Bitbucket Pipelines, etc.)
- Proficient in scripting languages such as Bash and Python
Exposure to Open Telemetry and Distributed Tracing- Awareness of recent industry trends related to observability and monitoring
Strong troubleshooting and problem-solving skills, with the ability to quickly diagnose and resolve complex issues
- Excellent oral and written communication skills
Job Type: Full-time
Benefits:
Supplemental Pay:
Experience:
- Terraform: 2 years (Preferred)
Kubernetes: 2 years (Preferred)
Site Reliability Engineer: 3 years (Preferred)