Auto-Tuning

TIMS Workshop on State-of-the-Art Technologies for
High Performance Computing Software Auto-Tuning
(in conjunction with EASIAM 2012)

About the Workshop:
Software auto-tuning (AT) is a crucial and a powerful technology to establish high performance numerical software on current and next generation complex computer architectures. To maximize efficiency of numerical computations is an essential task for numerical simulations performed on supercomputers. Targets of AT expand a wide spectrum in computer systems that range from language level, compiler level, to numerical algorithm and numerical library level. In this workshop, we invite researchers from Japan and Taiwan to discuss advanced technologies for high performance computing auto-tuning.

Opening Remark:
10:45:10:50 Takahiro Katagiri (The University of Tokyo)

Section 1: Numerical Library Algorithms
10:50-11:10 Takahiro Katagiri, Satoshi Ohshima, and Satoshi Ito (The University of Tokyo)

ppOpen-AT: An Auto-tuning Language for ppOpen-HPC ---Its New function and Impact to Application Software
Computer architectures are becoming more and more complex due to non-uniformed memory accesses and hierarchical caches. It is very difficult for scientists and engineers to optimize their code to extract potential performance improvements on these architectures. To solve this situation, an automatic tuning (AT) capability is important and critical technology for further development of new architectures and maintenance of the overall software framework. We focus on the AT function on ppOpen-HPC, which is numerical middleware for post Petascale era. In this presentation, several new functions for ppOpen-AT are proposed. A preliminary result is also shown by utilizing current computer architectures, including GPU.

11:10-11:30 Takao Sakurai and Ken Naono (Hitachi Ltd., Japan) Takahiro Katagiri, Kengo Nakajima, Satoshi Ohshima and Shoji Itoh (The University of Tokyo) Hisayasu Kuroda (Ehime University / The University of Tokyo) Mitsuyoshi Igai (Hitachi ULSI Systems Co., Ltd.)

OpenATLib: Automatic solver and preconditioner selection for sparse matrix libraries
Recently, many iteration solvers and preconditioner methods are proposed. So, library users can select them for solving their matrix. However, if they select the wrong method, they must spend much time for calculation or can’t get solution. Therefore, a new approach of automatic selection method for solvers and preconditioners is needed. In this presentation, we will present an Auto-tuning interface named OpenATLib. OpenATLib automatically selects best combination with solvers and preconditioners. In addition, we will show a result of performance evaluation with one node of the T2K Open supercomputer.

11:30-11:50 Yaohung M. Tsai (National Taiwan University), Ray-Bing Chen (National Cheng-Kung University), and Weichung Wang (National Taiwan University)

Optimizing the Block Size for QR Factorization on CPU-GPU Hybrid Systems
In CPU-GPU hybrid systems, the QR factorization in MAGMA results in CPU idle due to the fixed block size. To improve the computational efficiency of MAGMA QR factorization, we propose a variable block size auto-tuning scheme on CPU-GPU hybrid systems. First, we fit the CPU and GPU costs in MAGMA QR factorization via two independent regression models to formulate the CPU and GPU performance models. Next, we propose a block size optimization scheme to tune the block size adaptively and therefore to minimize a cost objective function. The cost objective function is designed to balance the workloads between CPU and GPU based on the performance models. Different procedures have been implemented to deal with the variation on performance. Finally, several numerical results demonstrate the performance gains due to the novel QR factorization algorithm.

Lunch Break
12:00-13:20 Lobby

Section 2: Performance Models
13:20-13:40 Reiji Suda (The University of Tokyo)

4DAC and One-Step Approximation: Mathematical Formulation and Algorithm for Automatic Tuning
We are investigating mathematical formulations and algorithms for automatic tuning (or autotuning). In this talk I introduce our work. First, we introduce our formulation of automatic tuning with emphasis on mathematical concern. We advocate "4DAC" as a methodology of analysis of mathematical concerns in automatic tuning and construction of mathematical components in automatic tuning. Second, we introduce "One-Step Approximation", our algorithm of experimental design for automatic tuning. It is based on Bayesian model of uncertainty, and derives several approximately-optimal experimental design for automatic tuning.

13:40-14:00 Teruo Tanaka, Ryo Otsuka, Akihiro Fujii (Kogakuin University), and Takahiro Katagiri (The University of Tokyo)

An Incremental Parameter Estimation Method Appling d-Spline for Software Automatic Tuning
Reduction of estimation time required for tuning of performance parameters is one of the important subjects for software automatic tuning, which computes optimal parameters suitable for a given computing environment. In order to reduce the estimation time, we propose a new method to estimate optimal performance parameters by inserting suitable sampling points referring to computational results of a fitting function "d-Spline". For the fitting function, we introduced d-Spline, which has high adaptability and requires less estimation time. For an evaluation, complexity of d-Spline is discussed.

14:00-14:20 Chenhan D. Yu and Weichung Wang (National Taiwan University)

Modeling and Optimizing Performance of Symmetric Positive Definite Multifrontal on Hybrid CPU-GPU Systems
Solving a large and sparse symmetric positive (SPD) definite linear system is at the heart of various scientific and engineering computing. By focusing on the multifrontal method, we investigate how a GPU can be used to accelerate the computations. A multifrontal method transforms a large and sparse linear system problem into a sequence of smaller and dense sub-problems that are defined on the frontals. A multifrontal method performs the Cholesky factorization of the SPD coefficient matrix by symbolic factorization and then numerical factorization. The elimination tree of the SPD matrix can be determined in the symbolic factorization step, and the dependency and the size of each frontal are provided. Thus, we may assign a static schedule to factor each frontal by analyzing the elimination tree. We first propose simple yet effective computation and communication models on a hybrid CPU-GPU system. The models lead to several CPU/GPU workload distribution schemes with the aim to achieve the shortest elapsed time. Although, the schemes are not necessarily leading to minimal execution time, yet the error time between the optimal and approximate scheduling is provided. Analytical and numerical results are presented to demonstrate the efficiency of the proposed algorithms and implementations.

Closing Remark:
14:20-14:25 Weichung Wang (National Taiwan University)

Time:
June 27, 2012

Place:
Room 202, Astronomy-Mathematics Building, National Taiwan University (Map)

Organizers:
Takahiro Katagiri (Information Technology Center, The University of Tokyo)
Weichung Wang (Department of Mathematics, National Taiwan University)

Contact Person:
Ms. Chia-Chen Chiang, tassist5@math.ntu.edu.tw, (02) 3366-4469

Weichung Wang

王 偉 仲 ：國 立 臺 灣 大 學：應 用 數 學 科 學 研 究 所 與 數 學 系

王偉仲：國立臺灣大學：應用數學科學研究所與數學系