Location: REMOTE / Montreal, Quebec
This job allows you to work remotely.
Our client raised a Series C of $45million USD, for over $100million total financing and is now rapidly scaling. They spun out of the UC Berkeley AI Research Lab and develops artificial intelligence to support care for those with Alzheimer’s disease, dementia, and other cognitive impairments.
Alzheimer’s disease is the single most expensive disease in the US, costing an estimated $600b per year in direct and indirect costs. It affects 1 in 3 people over 85 and 1 in 9 people over 65. Their first product is focused on reducing the frequency and impact of falls, the leading cause of hospitalization for those living with Dementia. We have peer-reviewed results showing up to 80% fewer falls with an average of 40% fewer falls and 80% fewer ER visits from falls.
Your Role:
Guide and expand the technology platform. MLOps involves the upkeep and further development of the experimentation, training, and evaluation systems as well as management and control of the data corpus.
The right candidate will be able to use operational skills with an architectural mindset to maintain stability and accountability of the products while enabling rapid growth of new machine learning models and services and rapid problem-solving of issues.
Responsibilities:
Security -- Managing sensitive data requires expertise in secure architecture. You’ll need to understand data segmentation between storage, training, and inference, and manage just-in-time access for developers. Familiarity with VPC networking, AWS IAM roles, zero-trust principles, and encryption (in transit and at rest) is essential.
R&D Enablement -- ML researchers need fast, reliable environments. Using Weights & Biases and NVIDIA/CUDA hardware. You should know how to manage ML workflows—data collection, processing with Spark or Ray, and visualization—while maintaining low cycle times.
Model Deployment -- Models are deployed to process live sensor data. You’ll need experience with Docker, Python, PyTorch, and GStreamer. Familiarity with monitoring tools (Prometheus, Loki, Grafana, OpsGenie) is important for tracking performance and uptime.
Infrastructure & Stability -- Manage infrastructure on AWS using Terraform. ML systems run on Kubernetes (Helm) and EC2. Supporting tools include Voxel51 (cataloging), MongoDB (metadata), Airflow (orchestration), and vector databases like Qdrant. Keeping these systems stable and observable is key.
Data Management -- Large volumes of sensor and cloud data must be securely stored, cataloged, and used for ML and analytics—then deleted as needed. A strong approach to data lifecycle and governance is required.
Must Have Skills:
•5–7 years of experience implementing MLOps at scale
•Deep expertise in AWS with practical opinions on when to use managed services vs. DIY with EC2.
•Skilled at balancing infrastructure cost-efficiency with developer productivity.
•Strong understanding of the full machine learning model lifecycle—from creation to deprecation.
•Experience maintaining stability and productivity across both edge devices and server fleets.
•Broad knowledge of ML concepts, including LLM security, vector databases, and AI hardware optimization.
•Self-motivated and proactive; prefers autonomy over prescribed task lists.
•Familiar with Agile methodologies, including Scrum and Kanban.
•Obsessed with automation; actively eliminates toil through tooling and process design.
•Understands the dual nature of MLOps: experimentation for ML, speed and reliability for Ops.
Nice to Have Skills:
•A mission-driven company culture
•Fully remote
•Competitive salary & benefits, including fully paid employee premiums for Medical, Dental, and Vision
•Monthly Education, Well-being & WFH stipends
•Non-accrual PTO
•Growth Potential
•Company Retreats
•Medical & Family/Parental Leave