Location: Richmond Hill, Ontario
About the Team
Our client’s platform engineering group operates with a Site Reliability Engineering (SRE) mindset, committed to delivering highly reliable, scalable, and performant systems across a public cloud infrastructure. The team specializes in enhancing system transparency, enabling deep diagnostics, and ensuring seamless collaboration between development and operations. Shared ownership, proactive problem-solving, and continuous improvement are at the core of everything they do.
•
The Opportunity
Our client is looking for a Senior Software Engineer with a strong background in application development and a passion for observability and system reliability. This hybrid role blends hands-on development with reliability engineering. You’ll work closely with existing microservices in Node.js and/or Java to enhance instrumentation and build out scalable observability frameworks that support modern containerized workloads on Kubernetes.
What You’ll Be Doing
• Create Observability Frameworks: Design and implement tools that make it easier to embed metrics, logs, and traces into applications.
• Enhance Application Monitoring: Analyze and improve the instrumentation of Node.js and Java services using Elastic APM to capture performance data and operational context.
• Define and Evangelize SRE Best Practices: Collaborate with engineers to define meaningful SLIs, SLOs, and KPIs, integrating them into ongoing development workflows.
• Monitoring Systems Architecture: Build and maintain scalable observability platforms using Elastic APM, InfluxDB, and Prometheus.
• Performance Analysis: Use system metrics, performance test data, and application code insights to diagnose bottlenecks and suggest optimizations.
• Incident Response & Resolution: Serve as a go-to expert during incidents, leveraging observability tools to identify root causes and propose fixes.
• Postmortems & Continuous Improvement: Lead structured reviews after incidents, recommending and implementing system improvements to avoid recurrence.
• Mentorship & Cultural Impact: Promote observability-first thinking across engineering teams by mentoring peers and embedding SRE practices into the development culture.
Must Have Skills:
What You’ll Need to Succeed
Must-Haves
• Bachelor’s degree in Computer Science, Software Engineering, or related discipline
• 5+ years of hands-on software development experience in Node.js and/or Java
• Professional experience with Docker and Kubernetes
• Proficiency in object-oriented programming and understanding of HTTP protocols & RESTful APIs
• Familiarity with both SQL and NoSQL databases
• Experience working in Linux/Unix environments and writing scripts
• Strong debugging, analytical, and collaboration skills
• Exposure to modern JavaScript frameworks (React, Angular, Vue, ExtJS, etc.)
• Solid grasp of software architecture, testing strategies, and performance monitoring principles
Nice-to-Haves
• Practical experience with Elastic APM, OpenTelemetry, or similar observability tools
• Experience building REST APIs using Spring Boot or Node.js
• Understanding of performance tuning and system capacity planning
• Exposure to testing tools such as Selenium, JUnit, Mockito, Mocha
• Familiarity with Oracle databases, PL/SQL, or servlet-based Java frameworks (Spring MVC, Struts, etc.)
• Web server experience with Apache, Tomcat, or Nginx
• Experience in SRE-focused roles, including development and maintenance of monitoring platforms
• Background with Infrastructure as Code (Terraform, etc.)
• Experience working in public cloud environments (AWS, Azure, or GCP)