# Dr. Bartosz Kostrzewa

## Fields of expertise

In the HPC/A team I can support current and potential users in the following topics, amongst others:

- Lattice field theory & lattice QCD
- Computational physics
- Statistical data analysis, including the usage of high-performance shared and distributed-memory frameworks
- Optimisation of stencil, matrix-vector, matrix-matrix and tensor kernels and solvers for various architectures
- Code profiling and understanding of performance bottlenecks and their relation to computer architecture and/or network configuration and topology
- Approaches for performance-portability
- Proper usage of batch systems and schedulers, in particular in relation to process and thread-pinning and problem / machine topology
- Authorship of proposals for computing projects (up to and including Tier-0 and PRACE / EuroHPC)

Dr. Bartosz Kostrzewa

Friedrich-Hirzebruch-Allee 8

53115 Bonn

## Research Interests

I pursue lattice QCD research in the context of the Extended Twisted Mass Collaboration (ETMC) and have been deeply involved in our road towards simulations at the physical point on the one hand and in various related physics projects on the other. With my collaborators in Bonn, I've performed ab initio calculations related to multi-particle scattering on the lattice, studying the properties of interactions in both weakly-interacting and resonant channels involving light and strange quark contributions from large correlator matrices computed using stochastic distillation techniques.

Together with collaborators from Bonn, Rome, Pisa and Cyprus, I also employ lattice field theory methods to study Physics beyond the Standard Model (SM) of Particle Physics in the Frezzotti-Rossi model for non-perturbative elementary particle mass generation. This model offers a mechanism to (partially) resolve the hierarchy and naturalness problems of the SM while offering a way to give mass to all massive elementary particles. It can be understood as a dynamical alternative to the Higgs mechanism without the problems of Technicolor. Our research has as its eventual goal the establishment of a prediction for a clear bound (in the few-TeV range) of an energy region to look for the new particles which it requires. As such, it is falsifiable and can serve as a basis for model builders to find appropriate extensions of the SM which survive all electroweak precision tests.

In the context of contemporary simulations in lattice field theory, large problem sizes and complicated observables have increased both the size and complexity of the datasets that are to be analysed. In addition, the availability of faster machines has led to improvements in statistical precision, which in turn increase the relative impact of systematic uncertainties on analysis results, requiring these systematic errors to be studied in detail. These factors lead to a situation where data analysis tasks in lattice field theory have become major endeavors requiring clean frameworks and significant computational resources. I've been involved in the design and implementation of several such analysis packages mostly written in the R programming language and continue to actively contribute to these.

I've been involved in the design and optimisation of computational kernels and solvers occurring in lattice QCD which make use of hybrid OpenMP/MPI parallelisation as well as GPU accelerators and have contributed to and co-authored software packages employed for this purpose on Europe's largest supercomputers. While the algorithms employed in lattice field theory have become substantially more complicated over the last few years, also the hardware that these algorithms must run on has become more diverse and more difficult to program for, even to the extent that optimal strategies for one architecture are essentially orthogonal to optimal strategies for other architectures. In addition, many parallel execution units, deep memory hierarchies and diverse inter-process communication strategies have resulted in a situation where auto-tuning computational kernels (over some space of launch parameters) is now mandatory to achieve the highest possible performance for all problem sizes and machine configurations that are practically encountered.

This combination of factors means that the historical practice of hand-optimising computational kernels for each new architecture is increasingly untrealistic and one would ideally like to develop strategies that minimise the amount of platform-specific code in software packages. To this end, I've been exploring frameworks such as Kokkos and HPX, various domain-specific-languages as well as the possibility of designing such a framework specifically for lattice field theory applications.

As we move towards exascale machines and ever larger problem sizes and complexities, the balance between achievable floating point performance and memory, network as well as I/O bandwidth is changing substantially. In addition, larger machine partitions and higher power densities have the potential of increasing the probability of node failures. On the one hand, large problem sizes make it essentially impossible to store most intermediate results due to unacceptable I/O overheads. On the other hand, long dependency chains, many subexpressions and relatively small memory spaces, especially on GPUs, make it difficult to organise complicated calculations by hand (via many nested loops and logic trees, for example) as these calculations, if represented by a graph, can have many hundreds of thousands of vertices. The situation is especially complicated if not all intermediate results fit into some small memory space in the memory hierarchy, as it might be necessary to actually force recomputation of intermediate objects in this case to even be able to perform the computation in the first place.

In order to optimize these workflows without writing complicated nested loops and logic trees by hand, I'm interested in automatically building dependency hierarchies for these sorts of calculations as directed graphs with (automatically determined) costs as edge weights and subject to various constraints. A goal might be, for example, to decide automatically whether it's better to keep an intermediate result resident in GPU memory, move it to CPU memory temporarily, write it out to disk and then read it back in or even to simply recompute it. An added benefit might be that a large complicated calculation can naturally be split into subgraphs, which can then be split into completed and uncompleted subsets, providing automatic checkpointing and some level of fault-tolerence upon node failure.

A further prospect might be the organisation of such calculations on completely heterogeneous machines, where a single job might employ different types of partitions (CPU, GPU, FPGA, high-performance I/O) for parts of the workload, requesting and freeing resources as required, adding another level of complexity.

## Curriculum vitae

- Permanent Research Associate HPC/A-Lab University of Bonn
- Postdoctoral Researcher HISKP University of Bonn
- Research Associate Humboldt-University Berlin / DESY Zeuthen

- PhD Physics Humboldt-University Berlin
- MSci Physics Imperial College

- Extended Twisted Mass Collaboration

## Publications

## Selected Software Development

### hadron R package for the analysis of data produced in lattice field theory

Co-author

### tmLQCD software suite for Hybrid Monte Carlo simulations of Wilson lattice QCD

Co-author

### QPhiX high performance kernel library for Wilson lattice QCD on x86 architectures including Intel KNC and Intel KNL

Contributor

### QUDA high performance lattice QCD library for NVIDIA and other GPUs

Contributor