Key Responsibilities:
Building the compute platform and machine learning libraries for large scale machine learning and simulation workloads
Focus on compute platform stability and efficiency on both CPU and GPU clusters, making the platform observable and scalable
Utilize cluster monitoring and profiling tools to identify bottlenecks and optimize both infrastructure and software system
Challenges You Will Tackle:
design, build and improve our compute platform for PB scale data model training and simulations with a wide range of machine learning models by leveraging our existing research infrastructure.