Dr. Bartosz Kostrzewa
Dr. Bartosz Kostrzewa High Performance Computing & Analytics Lab (HPC/A) Digital Science Center (DiCe) Institut für Informatik, Universität Bonn, FriedrichHirzebruchAllee 8, 53115 Bonn, Germany EMail: kostrzewab [ at ] informatik.unibonn.de Phone: +49 (0) 228 73 69692 
Fields of Expertise
In the HPC/A team I can support current and potential users in the following topics, amongst others:
 lattice field theory & lattice qcd
 computational physics
 statistical data analysis, including the usage of highperformance shared and distributedmemory frameworks
 optimisation of sparse stencil, matrixvector, matrixmatrix and tensor kernels and solvers for various architectures
 code profiling and understanding of performance bottlenecks and their relation to computer architecture and/or network configuration and topology
 approaches for performance portability
 proper usage of batch systems and schedulers, in particular in relation to process and threadpinning and problem / machine topology
 authorship of proposals for computing projects (up to and including Tier0 and PRACE / EuroHPC)
Research Interests
 Lattice Field Theory & Lattice QCD
 I pursue lattice QCD research in the context of the Extented Twisted Mass Collaboration (ETMC) and have been deeply involved in our road towards simulations at the physical point on the one hand and in various related physics projects on the other. With my collaborators in Bonn, I've performed abinitio calculations related to multiparticle scattering on the lattice, studying the properties of interactions in both weaklyinteracting and resonant channels involving light and strange quark contributions from large correlator matrices computed using stochastic distillation techniques.
 Together with collaborators from Bonn, Rome, Pisa and Cyprus, I also employ lattice field theory methods to study Physics beyond the Standard Model (SM) of Particle Physics in the FrezzottiRossi model for nonperturbative elementary particle mass generation. This model offers a mechanism to (partially) resolve the hierarchy and naturalness problems of the SM while offering a way to give mass to all massive elementary particles. It can be understood as a dynamical alternative to the Higgs mechanism without the problems of Technicolor. Our research has as its eventual goal the establishment of a prediction for a clear bound (in the fewTeV range) of an energy region to look for the new particles which it requires. As such, it is falsifiable and can serve as a basis for model builders to find appropriate extensions of the SM which satisfy all electroweak precision tests.
 Performance Engineering & Performance Portability
 I've been involved in the design and optimisation of computational kernels and solvers occurring in lattice QCD which make use of hybrid OpenMP/MPI parallelisation as well as GPU accelerators and have contributed to and coauthored software packages employed for this purpose on Europe's largest supercomputers. While the algorithms employed in lattice field theory have become substantially more complicated over the last few years, also the hardware that these algorithms must run on has become more diverse and more difficult to program for, even to the extent that optimal strategies for one architecture are exactly orthogonal to optimal strategies for other architectures. In addition, many parallel execution units, deep memory hierarchies and diverse interprocess communication strategies have resulted in a situation where autotuning computational kernels (over some space of optimisation parameters) is now mandatory to achieve the highest possible performance for all problem sizes and machine configurations that are practically encountered.
 This combination of factors means that the historical practice of handoptimising computational kernels for each new architecture is increasingly untrealistic and one would ideally like to develop strategies that minimise the amount of platformspecific code in software packages. To this end, I've been exploring frameworks such as Kokkos and HPX, various domainspecificlanguages as well as the possibility of designing such a framework specifically for lattice field theory applications.
 Largescale Data Analysis
 In the context of contemporary simulations in lattice field theory, large problem sizes and complicated observables have increased both the size and complexity of the datasets that are to be analysed. In addition, the availability of faster machines has led to improvements in statistical precision, which in turn increase the relative impact of systematic uncertainties on analysis results, requiring these systematic errors to be studied in detail. These factors lead to a situation where data analysis tasks in lattice field theory have become major endeavors requiring clean frameworks and significant computational resources. I've been involved in the design and implementation of several such analysis packages mostly written in the R programming language and continue to actively contribute to these.
 Workflow Automation & Faulttolerance
 As we move towards exascale machines and ever larger problem sizes and complexities, the balance between achievable floating point performance and memory, network as well as I/O bandwidth is changing substantially. In addition, larger machine partitions and higher power densities increase the probability of node failures. On the one hand, large problem sizes make it essentially impossible to store intermediate results due to unacceptable I/O overheads. On the other hand, long dependency chains, many subexpressions and relatively small memory spaces, especially on GPUs, make it difficult to organise complicated calculations by hand (via many nested loops and logic trees, for example) as these calculations, if represented by a graph, can have many hundreds of thousands of vertices. The situation is especially complicated if not all intermediate results fit into some small memory space in the memory hierarchy, as it might be necessary to actually force recomputation of intermediate objects in this case to even be able to perform the computation in the first place.
 In order to optimize these workflows without writing complicated nested loops and logic trees by hand, I'm interested in automatically building dependency hierarchies for these sorts of calculations as directed graphs with (automatically determined) costs as edge weights and subject to various constraints. A goal might be, for example, to decide automatically whether it's better to keep an intermediate result resident in GPU memory, move it to CPU memory temporarily, write it out to disk and then read it back in or even to simply recompute it. An added benefit might be that a large complicated calculation can naturally be split into subgraphs, which can then be split into completed and uncompleted subsets, providing automatic checkpointing and some level of faulttolerence upon node failure.
 A further prospect might be the organisation of such calculations on completely heterogeneous machines, where a single job might employ different types of partitions (CPU, GPU, FPGA, highperformance I/O) for parts of the workload, requesting and freeing resources as required, adding another level of complexity.
Publications
Selected Software Development
 (coauthor) hadron R package for the analysis of data produced in lattice field theory
 hadron on github
 hadron on CRAN
 (coauthor) tmLQCD software suite for Hybrid Monte Carlo simulations of Wilson lattice QCD
 tmLQCD on github
 (contributor) QPhiX high performance kernel library for Wilson lattice QCD on x86 architectures incuding Intel KNC and Intel KNL
 QPhiX on github
 (contributor) QUDA high performance lattice QCD library for NVIDIA and other GPUs

QUDA on github