We are looking for a Lead Data Engineer to take charge of designing and managing our data infrastructure. You will lead efforts in developing scalable and high-performance data models. You’ll oversee our ETL pipelines, data ingestion processes, and collaborate closely with data scientists to ensure their machine learning models are smoothly integrated into production. You will also play a key role in defining the infrastructure necessary for heterogeneous data ingestion, ML training processes and ML Ops, ensuring the right pipelines, monitoring, and automation are in place.
Key Responsibilities:
Lead the design and optimization of data models and infrastructure to support large-scale data processing.
Oversee and manage the data layer architecture, currently built on Cube.dev and MongoDB, with a key objective to evaluate and potentially transition to an SQL-based system (e.g., PostgreSQL) for enhanced performance.
Handle geospatial data management, ensuring efficient handling of location-based data for analysis, storage, and visualization.
Build and maintain robust ETL pipelines and data ingestion streams that ensure high availability, reliability, and performance of data systems.
Collaborate with the data science team to ensure the integration of machine learning models into production environments, focusing on efficient model deployment, monitoring, and iteration.
Design and implement ML Ops infrastructure to support model training, experimentation, and deployment, including tracking, versioning, and scalability of training processes.
Define and implement best practices for data governance, ensuring security, quality, and compliance.
Evaluate and adopt new tools and technologies to improve data processing, with a focus on real-time data ingestion and scalable ML infrastructure.
Provide leadership in shaping the future of our data architecture, ensuring it aligns with the company’s goals of sustainability and high-impact analytics.
Strong experience in data engineering, including designing and managing data architectures, ETL pipelines, and data ingestion.
Expertise in NoSQL databases (e.g., MongoDB), with demonstrated experience or knowledge of transitioning to or optimizing SQL-based systems (e.g., PostgreSQL, MySQL) for performance.
Solid understanding of geospatial data management and the ability to handle location-based datasets efficiently (e.g., PostGIS, GeoJSON, or other geospatial tools).
Deep understanding of AWS services and cloud-based infrastructure for managing large datasets and building data pipelines.
Experience with ML Ops: setting up pipelines for training machine learning models, managing infrastructure for ML experimentation, and automating model deployment and monitoring in production.
Familiarity with ML platforms (e.g., Kubeflow, SageMaker, or similar) and experience integrating ML workflows into production environments.
Proficiency with data processing frameworks and tools like Apache Airflow, or similar.
Strong programming skills in Python, TypeScript, or Java.
Excellent leadership and communication skills, with the ability to collaborate with cross-functional teams.
Why Join Us?
Take a leading role in shaping the future of sustainable agriculture with cutting-edge data infrastructure.
Join a team of passionate innovators working on impactful, real-world problems.
Contribute to a mission-driven company focused on reducing pesticide use and improving ecological sustainability through data.
Ces entreprises recrutent aussi au poste de “Données/Business Intelligence”.