HTA for Meta Open-Sources: A fully scalable performance analysis tool to support state-of-the-art machine learning workloads

Machine learning and deep learning models perform remarkably well on various tasks thanks to recent technological advances. However, this outstanding performance is not without a cost. Machine learning models often require significant computational power and resources to achieve sophisticated accuracy, which makes scaling these models challenging. Moreover, because they are unaware of the performance limitations of their workloads, ML researchers and systems engineers often fail to scale their computational models. Often, only sometimes the number of resources required for a job is what is actually needed. Understanding resource usage and bottlenecks for distributed training workloads is critical to getting the most out of the model’s hardware stack.

The PyTorch team has been working on this problem statement and recently released a comprehensive tracking analysis (HTA), a performance analysis and visualization Python library. The library can be used to understand performance and identify bottlenecks in distributed training workloads. This is achieved by reviewing traces collected using the profiler PyTorch, also known as Kineto. Kineto’s effects are often complex to understand; This is where HTA helps leverage the performance data found in these traces. The library was first used internally at Meta to better understand performance issues for GPU-intensive distributed training tasks. Next, the team set out to work on improving and extending the many capabilities of HTA to support evolving ML workloads.

How to monitor your machine learning models (sponsored)

Many elements, such as how model operators interact with GPU hardware and how these interactions can be measured, are taken into account to understand GPU performance in distributed training functions. Three main kernel classes—computation (COMP), communication (COMM), and memory (MEM)—can be used to classify GPU operations throughout model execution. All computational operations performed during model execution are handled by the computation kernel. In turn, the communication core is responsible for synchronizing and transferring data between multiple GPU devices in a distributed training task. Memory cores control data transfer between host memory and GPUs as well as memory allocations on GPU hardware.

Performance evaluation of many GPU training functions depends critically on how the model is created and implemented for the GPU core. This is where the HTA library steps in as it provides insights into how the model implementation interacts with the GPU hardware and points out areas for speed improvement. The library seeks to give users a more comprehensive understanding of the inner workings of distributed GPU training.

It can be difficult for ordinary people to understand how to perform GPU training functions. This inspired the PyTorch team to create the HTA, which simplifies the trace analysis process and gives the user insights by looking at the effects of model execution. The HTA uses the following features to support the above tasks:

Time collapse: This feature provides a breakdown of the amount of time spent by GPUs across all ranks of computation, communication, memory events, and even idle time spent.

Kernel breakdown: This function separates the time invested in each of the three kernel types (COMM, COMP, and MEM) and arranges the time spent in increasing order of duration.

Kernel Duration Distribution: The distribution of average time spent by a given kernel across all ranks can be visualized using bar graphs produced by HTA. The graphs also display the minimum and maximum time spent by a given core on a given rank.

Overlapping Communication Account: When performing distributed training, many GPUs must communicate and synchronize with each other, which takes a significant amount of time. To achieve high GPU efficiency, it is necessary to prevent the GPU from blocking because it is waiting for data from other GPUs. Accounting for overlapping computational connections is one way to assess how much computation is hampered by data dependencies. This feature provided by the library helps in calculating the percentage of time that the connection and the account overlap.

Boost counters (queue length and memory bandwidth): For debugging purposes, the HTA creates augmented trace files that include statistics showing memory bandwidth used as well as the number of incomplete operations on each CUDA stream (aka queue length).

These key characteristics give users a snapshot of system performance and help them understand what is going on internally. The PyTorch team also intends to add more posts in the near future that will explain why certain things are happening and potential strategies for overcoming bottlenecks. HTA is made available as an open source library to serve a larger audience. It can be used for various purposes, including deep learning-based recommendation systems, NLP models, and computer vision-related tasks. Detailed documentation for the library can be found here.


scan the github And blog or note. All credit for this research goes to the researchers on this project. Also, don’t forget to join Our Reddit pageAnd discord channelAnd Email newsletterwhere we share the latest AI research news, cool AI projects, and more.


Khushboo Gupta is a Consultant Trainee at MarktechPost. She is currently pursuing her Bachelor of Technology degree from Indian Institute of Technology (IIT), Goa. She is passionate about the areas of machine learning, natural language processing, and web development. You enjoy learning more about the technical field by participating in various challenges.


Leave a Comment