Senior AI Engineer
🤖 AI 估测:¥45K-70K
发布时间:16 天前
ℹ️关于这个职位
作为阿斯利康北京AI中心的首位平台工程师,你将负责设计和交付分布式AI训练基础设施与工程标准,确保中心的GPU投资能高效支持药物发现科学团队
你的核心工作是搭建并优化多节点多GPU的训练框架,制定AI工程规范,并作为科学团队与IT基础设施之间的桥梁,协调计算资源,加速AI在生物制剂工程、计算化学等领域的应用
✓工作职责
What You’ll Do
Distributed Training (First 90 Days)
Design and validate multi-node multi-GPU training templates (DDP, FSDP) for NVIDIA H20 GPUs
Build operational runbooks covering common failure modes, checkpointing, recovery
Establish baseline performance benchmarks (throughput, step time, scaling efficiency)
Optimize data loading pipelines to eliminate I/O bottlenecks in distributed settings
AI Engineering Standards
Define training method standards: naming conventions, experiment configuration, model registry requirements, reproducibility criteria
Create scheduling policies: GPU quota rules, priority tiers, job templates for the center's Kubernetes/Run:AI platform
Establish compute triage process: how science teams request and receive GPU allocation
MLOps: experiment tracking, model registry, CI/CD for ML, Kubernetes
Streamlining non-AI pipeline and workflow dependencies
Fine-tuning and Optimization
Build reusable fine-tuning pipeline templates for protein language models and scientific AI workloads
Optimize training code for H20 GPU to boost efficiency and throughput
Collaborate with NVIDIA on hardware-specific optimizations
Cross-Organizational Coordination
Translate Discovery team workload needs into infrastructure requirements for IT
Operate as "business owner" for AI compute, while IT operates as "system owner"
Participate in weekly coordination meetings across Discovery, AISI, and IT
⭐最低要求
Required
5+ years of experience in distributed deep learning training (DDP, FSDP, DeepSpeed, or equivalent)
Strong PyTorch expertise with production-grade model training
Experience with GPU workload optimization and multi-node cluster management
Kubernetes job scheduling experience (Kubeflow, Slurm, Run:AI, or equivalent)
Experience setting AI/ML engineering standards for teams (not just personal projects)
Hands-on experience improving Tranformer-based models (ex: FlashAttention or related optimizations)
Ability to work full-time in Beijing
👍优先资格
Preferred
Parameter-efficient fine-tuning methods (QLoRA, LoRA, adapters)
Reinforcement learning training infrastructure
AWS China or Alibaba Cloud experience
NVIDIA H20 or H100-series GPU familiarity
Desirable
Biopharma domain knowledge (molecular simulation, protein folding, drug discovery)
Experience working across organizational boundaries (e.g., AI platform team serving multiple science groups)