Watch Jobs
浏览职位数据统计洞察报告探索企业定价
我的收藏免费试用登录注册

Watch Jobs

我们专注于实时追踪各企业最新职位动态,帮助您节省求职时间,快速找到理想工作机会。

探索

  • 浏览职位
  • 数据统计
  • 洞察报告
  • 数据方法论
  • 探索企业

订阅

  • 免费试用
  • 价格方案
  • 常见问题
  • 隐私政策

关注我们

微信公众号小红书淘宝店铺

© 2026 Watch Jobs. 保留所有权利

Created by jianglicat - 讲礼猫
Watch Jobs
浏览职位数据统计洞察报告探索企业定价
我的收藏免费试用登录注册

职位搜索/微软/Principal Software Engineer
Microsoft logo
M
微软 (Microsoft)

职位信息

上海市
专家级经验
全职员工
混合式弹性办公
本科
普通员工/个人贡献者

标签

GPUPyTorchCUDANVIDIAAi InfrastructurePerformance OptimizationInference Engine
💡

核心评价

微软顶级AI基础设施专家岗,技术前沿、成长性极佳、使命感强,薪资竞争力高,但WLB可能面临挑战。

Principal Software Engineer

🤖 AI 估测:¥80K-150K

发布时间:大约 1 个月前

立即应聘

ℹ️关于这个职位

这是一个面向资深GPU工程师的专家级岗位,你将加入微软的AI基础设施团队,负责设计和优化支撑大规模AI模型的核心推理引擎
你的工作聚焦于硬件性能的极限突破,为生成式AI和深度学习工作负载降低延迟、提升吞吐量
你将深入深度学习算法与底层硬件的交叉领域,从零开始构建高效能的训练/推理执行引擎

✓工作职责

Responsibilities:
Custom Operator Development: Design and implement highly optimized GPU kernels (CUDA/Triton) for critical deep learning operations (e.g., FlashAttention, GEMM, LayerNorm) to outperform standard libraries.
Inference Engine Architecture: Contribute to the development of our high-performance inference engine, focusing on graph optimizations, operator fusion, and dynamic memory management (e.g., KV Cache optimization).
Performance Optimization: Deeply analyze and profile model performance using tools like Nsight Systems/Compute. Identify bottlenecks in memory bandwidth, instruction throughput, and kernel launch overheads.
Model Acceleration: Implement advanced acceleration techniques such as Quantization (INT8, FP8, AWQ), Kernel Fusion, and continuous batching.
Distributed Computing: Optimize communication primitives (NCCL) to enable efficient multi-GPU and multi-node inference (Tensor Parallelism, Pipeline Parallelism).
Hardware Adaptation: Ensure the software stack fully utilizes modern GPU architecture features (e.g., NVIDIA Hopper/Ampere Tensor Cores, Asynchronous Copy).

⭐最低要求

Required Qualifications:
Bachelor's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience.
Professional Depth: 5+ years of experience in systems programming, HPC, or GPU software development, featuring at least 5 years of hands-on CUDA/C++ kernel development.
Architectural Mastery: Expertise in the CUDA programming model and NVIDIA GPU architectures (specifically Ampere/Hopper). Deep understanding of the memory hierarchy (Shared Memory, L2 cache, Registers), warp-level primitives, occupancy optimization, and bank conflict resolution. Familiarity with advanced hardware features: Tensor Cores, TMA (Tensor Memory Accelerator), and asynchronous copy. Proven ability to navigate and modify complex, large-scale codebases (e.g., PyTorch internals, Linux kernel). Experience with build and binding ecosystems: CMake, pybind11, and CI/CD for GPU workloads.
Performance Engineering: Mastery of NVIDIA Nsight Systems/Compute. Ability to mathematically reason about performance using the Roofline Model, memory bandwidth utilization, and compute throughput.
Other Requirements: Ability to meet Microsoft, customer and/or government security screening requirements are required for this role. These requirements include but are not limited to the following specialized security screenings: Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud background check upon hire/transfer and every two years thereafter.

👍优先资格

Preferred Qualifications:
Master's Degree in Computer Science or related technical field AND 8+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR Bachelor's Degree in Computer Science or related technical field AND 12+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience.
Engine & Framework Expertise: Working knowledge of state-of-the-art inference/training stacks: sglang, vLLM, TensorRT-LLM, DeepSpeed, or Megatron-LM. Deep understanding of optimization patterns: PagedAttention, RadixAttention (Prefix Caching), continuous batching, and speculative decoding.
Operator & GEMM Optimization: Practical experience with CUTLASS, CuTe, or OpenAI Triton. Expertise in high-performance linear algebra (GEMM) optimization, including tiling strategies, data layouts, and mixed-precision accumulation.
Distributed Systems: Proficiency in multi-GPU/multi-node scaling using NCCL and parallelism strategies (Tensor, Pipeline, and Sequence parallelism).
Vibe Coding & AI-Native Velocity: An AI-native mindset: Expert at using vibe coding tools to bypass boilerplate and accelerate the development lifecycle. The technical intuition to architect systems rapidly, moving from "vibe" to "highly-optimized production code" with extreme velocity.
Watch Jobs

我们专注于实时追踪各企业最新职位动态,帮助您节省求职时间,快速找到理想工作机会。

探索

  • 浏览职位
  • 数据统计
  • 洞察报告
  • 数据方法论
  • 探索企业

订阅

  • 免费试用
  • 价格方案
  • 常见问题
  • 隐私政策

关注我们

微信公众号小红书淘宝店铺

© 2026 Watch Jobs. 保留所有权利

Created by jianglicat - 讲礼猫

微软 的其他在招职位

  • Software Engineer 2

    微软

    上海市 · 仅现场办公

  • Logistics Technician

    微软

    南通市 · 仅现场办公

  • AI Business Process Architect

    微软

    上海市 · 仅现场办公

  • Sales Excellence Manager

    微软

    北京市 · 仅现场办公

  • Senior Applied Scientist - M365 Copilot

    微软

    北京市 · 仅现场办公

相似职位推荐

  • 模型算法工程师

    中国平安

    深圳市 · 仅现场办公

  • 算法工程师

    中国平安

    深圳市 · 仅现场办公

  • 算法资深工程师(AI安全算法)

    中国平安

    深圳市 · 仅现场办公

  • CA-高级Java开发工程师

    中国平安

    深圳市 · 仅现场办公

  • 测试开发

    叠纸游戏

    上海市 · 仅现场办公