Compilers and Runtimes for AI Accelerators
Flexer: Out-of-Order Scheduling for Multi-NPUs
Recent neural accelerators often comprise multiple neural processing units (NPUs) with shared cache and memory. The regular schedules of state-of-the-art scheduling techniques miss important opportunities for memory reuse. Flexer is an out-of-order (OoO) scheduler that maximizes instruction-level parallelism and data reuse on such multi-NPU systems. It employs a list scheduling algorithm to dynamically schedule the tiled workload to all NPUs. To cope with the irregular data access patterns of OoO schedules, several heuristics help maximize data reuse by considering the availability of data tiles at different levels in the memory hierarchy.
Evaluated with several neural networks on 2 to 4-core multi-NPUs, Flexer achieves a speedup of up to 2.2x and a 1.2-fold reduction in data transfers for individual layers compared to the best static execution order.
Publications:
-
Hyemi Min, Jungyoon Kwon, and Bernhard Egger. "Flexer: Out-of-Order Scheduling for Multi-NPUs." In Proceedings of the 21st ACM/IEEE International Symposium on Code Generation and Optimization (CGO ’23), Montréal, Canada, February/March 2023.
[pdf][bibtex][doi]
SENNA: Unified Hardware/Software Space Exploration for Parametrizable Neural Network Accelerators
Parametrizable neural network accelerators enable the deployment of targeted hardware for specialized environments. Finding the best architecture configuration for a given specification, however, is challenging. A large number of hardware configurations have to be considered, and for each hardware instance, an efficient software execution plan needs to be found, leading to a vast search space. Prior work has tackled this problem by dividing the search into subproblems for individual layers of a network. There is no guarantee, however, that the overall best hardware configuration that delivers the desired end-to-end performance across the entire network is among the best individual layer configurations.
SENNA is a unified hardware/software space exploration framework for parametrizable neural network accelerators. To guide the exploration towards the overall best configuration, SENNA employs a multi-objective genetic algorithm with a novel design space representation that encodes the configuration of hardware and software parameters in a single chromosome. Using the Parallel Island Model (PIM), each layer is represented by one or more individual islands each containing a separate population to simultaneously search for the best configuration across the entire network. A tailored gene migration technique enables the exchange of genes between the populations of different islands. SENNA is evaluated with three parametrizable architectures and four neural networks. The evaluation result demonstrates that SENNA achieves upto 1.92x EDP improvement compared to the State-of-the-Art. With equivalent evaluation budgets, SENNA shows 2.5x-9.3x speedup compared to an Oracle scheme and the State-of-the-Art.
Publications:
This work is currently under review - SENNA will be open-sourced as soon as the work is published.
Heterogen: Accelerating LLMs using CPU and GPU Resources
Large language models (LLMs) have recently captured the attention of a broad audience with their text-generation capabilities. To a large part, this exceptional performance was made possible by an exponential growth of the model parameters. This growth, however, comes at the expense of significantly higher operational costs and a decreased processing speed. Recent research has focused on running LLMs on commodity hardware, for example, by harnessing the memory hierarchy to augment throughput by increasing the number of batches. These studies, however, tend to overlook or inefficiently utilize the additional computational resources provided by the CPU. This work introduces a technique capable of efficiently harnessing all available computational resources through a finely tuned and dynamic workload allocation approach. This technique applies to decoder-based models on standard general-purpose hardware, effectively minimizing idle periods for both the CPU and the GPU. We conducted experiments involving various large language models, each representing distinct decoder-based architectures. Compared to the state-of-the-art, the results demonstrate a potential for an increase of up to 105% in throughput with the OPT-30B model.
This work is open-source. You can find the code on our Gitlab repository.
Publications:
-
Daon Park and Bernhard Egger. "Improving Throughput-oriented Large Language Model Inference with CPU Computations." In Proceedings of the 2024 International Conference on Parallel Architectures and Compilation Techniques (PACT'24), Long Beach, CA, October 2024.
[pdf][bibtex][doi][source]
Parallel Runtimes
Several efficient parallel runtimes are availabe in source-code form, along with models able to predict execution time or system utilization. At the moment, we provide two runtimes, one for NUMA architectures and one for integrated CPU/GPU processors. The packages are provided as an artifact.
Dopia: Online Parallelism Management for Integrated CPU/GPU Architectures
Achieving maximal performance on modern architectures where CPU and GPU cores are co-located on the same processor die turns out to be a surprisingly difficult task. Depending on the characteristics of the parallel application and the underlying hardware, disabling a number of CPU and GPU cores can lead to a significanly higher performance than employing all avilable cores.
The Dopia framework provides an environment to automatically determine the best degree of parallelism (DoP) and execute OpenCL parallel applications on integrated archtiectures. The framework statically analyzes the OpenCL kernel code, extracts performance-relevant features and re-writes the kernel to make it malleable. A plug-in runtime then feeds the extracted performance features into a pre-trained machine-learned model to determine the best DoP and executes the OpenCL kernel on the CPU and GPU cores at the given DoP.
Click the [software] link below to download Dopia. The framework includes training data and pre-trained models for Intel Skylake and AMD Kaveri integrated architectures.
This work has been presented at PPoPP 2022 as follows:
-
Younghyun Cho, Jiyeon Park, Florian Negele, Changyeon Jo, Thomas R. Gross, and Bernhard Egger. "Dopia: Online Parallelism Management for Integrated CPU/GPU Architectures." In 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP ’22), April 2–6, 2022, Seoul, Republic of Korea.
[pdf][bibtex][doi][software]
Maximizing system utilization for co-located applications
Modeling the underlying NUMA architecture using a sophisticated queuing model, this framework is able to improve turnaround time and throughput of co-located parallel applications by maximizing the overall system througput. Click on the [artifact] link to access the software.
-
Younghyun Cho, Camilo A.C. Guzman, and Bernhard Egger. "Maximizing System Utilization via Parallelism Management for Co-Located Parallel Applications." To appear in Proceedings of the the 2018 International Conference on Parallel Architectures and Compilation (PACT'18), Limassol, Cyprus, November 2018.
[pdf][bibtex][doi][artifact] -
Younghyun Cho, Surim Oh, and Bernhard Egger. "Online Scalability Characterization of Data-Parallel Programs on Many Cores." In Proceedings of the the 2016 International Conference on Parallel Architectures and Compilation (PACT'16), Haifa, Israel, September 2016.
[pdf][bibtex][doi][artifact]
Maximizing throughput of OpenCL applications on integrated CPU/GPU architectures
Integrated CPU/GPU architectures (APUs) enable fast and efficient workload balancing on the cores of the CPU and the GPU. Our approach is completely online and does not require any offline performance characterization or prior application profiling.
-
Younghyun Cho, Florian Negele, Seohong Park, Bernhard Egger, and Thomas R. Gross. "On-The-Fly Workload Partitioning for Integrated CPU/GPU Architectures." To appear in Proceedings of the the 2018 International Conference on Parallel Architectures and Compilation (PACT'18), Limassol, Cyprus, November 2018.
[pdf][bibtex][doi][artifact]
SnuMAP: profiling parallel architectures
SnuMAP is an open-source application- and system profiler for parallel architectures. SnuMAP provides detailed execution trace information and easy visualization for one or multiple concurrent parallel applications that are executed on a multi/many-core platform.
Virtualization
Efficient checkpointing
A patch for Xen to enable space-efficient VM checkpointing is available here. The corresponding research papers detailing the method are
-
Bernhard Egger, Eunbyung Park, Younghyun Cho, Changyeon Jo, and Jaejin Lee. "Efficient Checkpointing of Live Virtual Machines." In IEEE Transactions on Computers (TC), Volume 65, Issue 10, pp. 3041 - 3054, January 2016.
[pdf][bibtex][doi] -
Eunbyung Park, Bernhard Egger, and Jaejin Lee. "Fast and space efficient virtual machine checkpointing." In Proceedings of the ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments (VEE'11), Newport Beach, USA, March 2011.
[pdf][bibtex][doi]
Live migration modeling
We have developed an automatic, machine-learned model to predict several key metrics of live migration. To train the model, we have gathered around 50'000 data points of live migrations running different benchmarks. The data set and the machine learning model are available here, the details of our method are described in
-
Changyeon Jo, Youngsu Cho, and Bernhard Egger. "A Machine Learning Approach to Live Migration Modeling." In Proceedings of the 2017 ACM Symposium on Cloud Computing (SoCC'17), Santa Clara, USA, September 2017.
[pdf][bibtex][doi] -
Changyeon Jo, Changmin Ahn, and Bernhard Egger. "A Machine Learning-based Approach to Live Migration Modeling." Presented at the 4th International Workshop on Efficient Data Center Systems (EDCS'16), Seoul, Korea, June 2016.
[pdf][bibtex]