GPU Architecture Mastery: In-depth understanding of modern GPU underlying architectures, including streaming multiprocessors (SMs/CUs), memory hierarchy (registers, shared memory, L1/L2 cache, HBM), and warp/wavefront execution models.
Kernel Programming Expertise: Strong proficiency in C++ and parallel computing, with extensive hands-on experience in NVIDIA CUDA or AMD HIP kernel programming.
Performance Engineering: Demonstrated ability to debug and profile complex GPU workloads, interpreting low-level metrics to drive architectural-aware optimizations.
Systems Knowledge: Familiarity with asynchronous execution, stream management, and host-device memory transfers.
Python DSLs & Triton: Experience implementing kernels using OpenAI Triton or other Python-based DSLs for agile kernel development and auto-tuning.
Inference Engine Experience: Hands-on experience integrating custom kernels into large-scale inference frameworks such as vLLM , SGLang , or TensorRT-LLM .
Deep Learning Frameworks: Familiarity with writing custom extensions or operators for PyTorch (C++/CUDA extensions).
Hardware Agnosticism: Experience porting kernels between NVIDIA and AMD architectures or working with cross-platform HPC libraries.