BEGIN:VCALENDAR
VERSION:2.0
PRODID:-//pretalx//pretalx.com//scipy-2026//talk//3GRQ87
BEGIN:VTIMEZONE
TZID:CST
BEGIN:STANDARD
DTSTART:20001029T030000
RRULE:FREQ=YEARLY;BYDAY=-1SU;BYMONTH=10;UNTIL=20061029T080000Z
TZNAME:CST
TZOFFSETFROM:-0500
TZOFFSETTO:-0600
END:STANDARD
BEGIN:STANDARD
DTSTART:20071104T030000
RRULE:FREQ=YEARLY;BYDAY=1SU;BYMONTH=11
TZNAME:CST
TZOFFSETFROM:-0500
TZOFFSETTO:-0600
END:STANDARD
BEGIN:DAYLIGHT
DTSTART:20000402T030000
RRULE:FREQ=YEARLY;BYDAY=1SU;BYMONTH=4;UNTIL=20060402T090000Z
TZNAME:CDT
TZOFFSETFROM:-0600
TZOFFSETTO:-0500
END:DAYLIGHT
BEGIN:DAYLIGHT
DTSTART:20070311T030000
RRULE:FREQ=YEARLY;BYDAY=2SU;BYMONTH=3
TZNAME:CDT
TZOFFSETFROM:-0600
TZOFFSETTO:-0500
END:DAYLIGHT
END:VTIMEZONE
BEGIN:VEVENT
UID:pretalx-scipy-2026-3GRQ87@pretalx.com
DTSTART;TZID=CST:20260716T112500
DTEND;TZID=CST:20260716T115500
DESCRIPTION:Scientists apply rigorous methods to their research\, but rarel
 y to the AI tools they use to write code. We tested different LLM models i
 n combination with domain-specific tools (including MCP servers and skills
 ) to find the optimal combination for writing complex domain-specific code
 . We created a quantitative proficiency test for Starsim\, a disease model
 ing framework\, and evaluated different combinations of models and tools. 
 While Claude Opus outperformed other models\, access to tools improved per
 formance more than choosing the best model. Thus\, to improve LLM performa
 nce on domain-specific problems\, we recommend developing a set of tools w
 ith the help of quantitative evaluation.
DTSTAMP:20260622T110114Z
LOCATION:Johnson Great Room
SUMMARY:Vibes\, meet rigor: Evaluating and improving AI performance on comp
 lex scientific code - Cliff Kerr
URL:https://pretalx.com/scipy-2026/talk/3GRQ87/
END:VEVENT
END:VCALENDAR