BEGIN:VCALENDAR
VERSION:2.0
PRODID:-//pretalx//pretalx.com//juliacon-2026//speaker//7KTZYQ
BEGIN:VTIMEZONE
TZID:CET
BEGIN:STANDARD
DTSTART:20001029T040000
RRULE:FREQ=YEARLY;BYDAY=-1SU;BYMONTH=10
TZNAME:CET
TZOFFSETFROM:+0200
TZOFFSETTO:+0100
END:STANDARD
BEGIN:DAYLIGHT
DTSTART:20000326T030000
RRULE:FREQ=YEARLY;BYDAY=-1SU;BYMONTH=3
TZNAME:CEST
TZOFFSETFROM:+0100
TZOFFSETTO:+0200
END:DAYLIGHT
END:VTIMEZONE
BEGIN:VEVENT
UID:pretalx-juliacon-2026-EFQ8YD@pretalx.com
DTSTART;TZID=CET:20260813T160000
DTEND;TZID=CET:20260813T161500
DESCRIPTION:GPU vendor libraries like cuBLAS deliver excellent performance 
 but come with hard constraints: limited type support\, fixed operators\, a
 nd single-vendor hardware. The Julia GPU ecosystem addresses portability t
 hrough an abstraction layer: KernelAbstractions.jl lets developers write k
 ernels that compile across CUDA\, AMD\, Intel\, and Apple backends. But ab
 straction currently comes at a cost: KA.jl lacks the intrinsics needed for
  fully optimized performance. Warp operations on extended types\, vectoriz
 ed memory access\, and explicit memory ordering for inter-workgroup commun
 ication are missing. We introduce [KernelForge.jl](https://epilliat.github
 .io/KernelForge.jl)\, a Julia package proving that portable GPU code can m
 atch vendor-optimized performance. To make this possible\, we developed [K
 ernelIntrinsics.jl](https://epilliat.github.io/KernelIntrinsics.jl)\, whic
 h exposes the missing primitives (currently CUDA-only\, though the approac
 h extends to other backends). KernelForge.jl provides kernels for matrix-v
 ector and vector-matrix products with arbitrary operators and bitstype ele
 ments\, mapreduce over 1D and 2D arrays\, prefix scan\, and copy operation
 s. Each is implemented as a single kernel using vectorized loads/stores to
  saturate memory bandwidth as much as possible\, warp-level reductions\, a
 nd strong memory ordering for correct inter-workgroup synchronization. Ben
 chmarks show that KernelForge.jl matches or exceeds both proprietary CUDA 
 functions and NVIDIA's CUB library. The kernels are stable and tested\, th
 ough views and strided arrays are not yet supported. The goal is straightf
 orward: open-source GPU code that is efficient\, flexible\, and eventually
  portable.
DTSTAMP:20260502T093959Z
LOCATION:Room 3
SUMMARY:KernelForge.jl: Fast\, Flexible GPU Computing Toward Portability - 
 Emmanuel Pilliat
URL:https://pretalx.com/juliacon-2026/talk/EFQ8YD/
END:VEVENT
END:VCALENDAR