With increased ease of smaller "AI" models, better chips and on-device learning, is it possible now to build and train your own models for your own use? In this keynote, we'll explore learnings of small, medium and large-sized model personalization, but driven by yourself and for yourself. A walk through what's possible, what's not and what we should prioritize if we'd like AI & ML to be made for everyone.
In the past few years, web-based engineering software has been steadily gaining momentum over traditional desktop-based applications. It represents a significant shift in how engineers access, collaborate, and utilize software tools for design, analysis, and simulation tasks. However, converting desktop-based applications to web applications presents considerable challenges, especially in translating the functionality of desktop interfaces to the web. It requires careful planning and design expertise to ensure intuitive navigation and responsiveness.
JupyterLab provides a flexible, interactive environment for scientific computing. Despite its popularity among data scientists and researchers, the full potential of JupyterLab as a platform for building scientific web applications has yet to be realized.
In this talk, we will explore how its modular architecture and extensive ecosystem facilitate the seamless integration of components for diverse functionalities: from rich user interfaces, accessibility, and real-time collaboration to cloud deployment options. To illustrate the platform's capabilities, we will demo JupyterCAD, a parametric 3D modeler built on top of JupyterLab components.
Python 3.12 introduced a new low-impact monitoring API with PEP669, which can be used to implement far faster debuggers than ever before. This talk covers the main advantages of this API and how you can use it to develop small tools.
In the last year there hasn’t been a day that passed without us hearing about a new generative AI innovation that will enhance some aspect of our lives. On a number of tasks large probabilistic systems are now outperforming humans, or at least they do so “on average”. “On average” means most of the time, but in many real life scenarios “average” performance is not enough: we need correctness ALL of the time, for example when you ask the system to dial 911.
In this talk we will explore the synergy between deterministic and probabilistic models to enhance the robustness and controllability of machine learning systems. Tailored for ML engineers, data scientists, and researchers, the presentation delves into the necessity of using both deterministic algorithms and probabilistic model types across various ML systems, from straightforward classification to advanced Generative AI models.
You will learn about the unique advantages each paradigm offers and gain insights into how to most effectively combine them for optimal performance in real-world applications. I will walk you through my past and current experiences in working with simple and complex NLP models, and show you what kind of pitfalls, shortcuts, and tricks are possible to deliver models that are both competent and reliable.
The session will be structured into a brief introduction to both model types, followed by case studies in classification and generative AI, concluding with a Q&A segment.
The Jupyter stack has undergone a significant transformation in recent years with the integration of collaborative editing features: users can now modify a shared document and see each other's changes in real time, with a user experience akin to that of Google Docs. The underlying technology uses a special data structure called Conflict-free Replicated Data Types (CRDTs), that automatically resolves conflicts when concurrent changes are made. This allows data to be distributed rather than centralized in a server, letting clients work as if data was local rather than remote.
In this talk, we look at new possibilities that CRDTs can unlock, and how they are redefining Jupyter's architecture. Different use cases are presented: a suggestion system similar to Google Doc's, a chat system allowing collaboration with an AI agent, an execution model allowing full notebook state recovery, a collaborative widget model. We also look at the benefits of using CRDTs in JupyterLite, where users can interact without a server. This may be a great example of a distributed system where every user owns their data and shares them with their peers.
Retrieval-augmented generation (RAG) has become a key application for large language models (LLMs), enhancing their responses with information from external databases. However, RAG systems are prone to errors, and their complexity has made evaluation a critical and challenging area. Various libraries (like RAGAS and TruLens) have introduced evaluation tools and metrics for RAGs, but these evaluations involve using one LLM to assess another, raising questions about their reliability. Our study examines the stability and usefulness of these evaluation methods across different datasets and domains, focusing on the effects of the choice of the evaluation LLM, query reformulation, and dataset characteristics on RAG performance. It also assesses the stability of the metrics on multiple runs of the evaluation and how metrics correlate with each other. The talk aims to guide users in selecting and interpreting LLM-based evaluations effectively.
Are you looking for a high performance visualization component for the web? Need to filter, sort, pivot, and aggregate static/streaming data in realtime? Daunted by the massive JS ecosystem? In this talk, we’ll build a high performance web frontend using the open source library Perspective.
Embark on a journey to explore how Quarto Dashboard can enhance the narrative of your analysis from your Jupyter Notebook. This talk will show how to create cool interactive charts and graphs that bring your data to life, by using Quarto - an open-source scientific and technical publishing system.
Learn how to make your data communications more engaging and dynamic using Quarto Dashboard. Practical examples and simple explanations will guide you through the process, making it easy to understand and apply to your projects.
Graph Retrieval Augmented Generation (Graph RAG) is emerging as a powerful addition to traditional vector search retrieval methods. Graphs are great at representing and storing heterogeneous and interconnected information in a structured manner, effortlessly capturing complex relationships and attributes across different data types. Using open weights LLMs removes the dependency on an external LLM provider while retaining complete control over the data flows and how the data is being shared and stored. In this talk, we construct and leverage the structured nature of graph databases, which organize data as nodes and relationships, to enhance the depth and contextuality of retrieved information to enhance RAG-based applications with open weights LLMs. We will show these capabilities with a demo.
Jupyter based environments are getting a lot of traction for teaching computing, programming, and data sciences. The narrative structure of notebooks has indeed proven its value for guiding each student at it's own pace to the discovery and understanding of new concepts or new idioms (e.g. how do I extract a column in pandas?). But then these new pieces of knowledge tend to quickly fade out and be forgotten. Indeed long term acquisition of knowledge and skills takes reinforcement by repetition. This is the foundation of many online learning platforms like Webwork or WIMS that offer exercises with randomization and automatic feedback. And of popular "AI-powered" apps -- e.g. to learn foreign languages -- that use spaced repetition algorithms designed by educational and neuro sciences to deliver just the right amount of repetition.
What if you could author such exercizes as notebooks, to benefit from everything that Jupyter can offer (think rich narratives, computations, visualization, interactions)? What if you could integrate such exercises right into your Jupyter based course? What if a learner could get personalized exercise recommandations based on their past learning records, without having to give away these sensitive pieces of information away?
That's Jupylates (work in progress). And thanks to the open source scientific stack, it's just a small Jupyter extension.
JupyterLite is a JupyterLab distribution that runs entirely in the web browser, backed by in-browser language kernels. With standard JupyterLab, where kernels run in separate processes and communicate with the client by message passing, JupyterLite uses kernels that run entirely in the browser, based on JavaScript and WebAssembly.
This means JupyterLite deployments can be scaled to millions of users without the need for individual containers for each user session, only static files need to be served, which can be done with a simple web server like GitHub pages.
This opens up new possibilities for large-scale deployments, eliminating the need for complex cloud computing infrastructure. JupyterLite is versatile and supports a wide range of languages, with the majority of its kernels implemented using Xeus, a C++ library for developing language-specific kernels.
In conjunction with JupyterLite, we present Emscripten-forge, a conda/mamba based distribution for WebAssembly packages. Conda-forge is a community effort and a GitHub organization which contains repositories of conda recipes and thus provides conda packages for a wide range of software and platforms. However, targeting WebAssembly is not supported by conda-forge. Emscripten-forge addresses this gap by providing conda packages for WebAssembly, making it possible to create custom JupyterLite deployments with tailored conda environments containing the required kernels and packages.
In this talk, we delve deep into the JupyterLite ecosystem, exploring its integration with Xeus Mamba and Emscripten-forge.
We will demonstrate how this can be used to create sophisticated JupyterLite deployments with custom conda environments and give an outlook for future developments like R packages and runtime package resolution.
For some natural language processing (NLP) tasks, based on your production constraints, a simpler custom model can be a good contender to off-the-shelf large language models (LLMs), as long as you have enough qualitative data to build it. The stumbling block being how to obtain such data? Going over some practical cases, we will see how we can leverage the help of LLMs during this phase of an NLP project. How can it help us select the data to work on, or (pre)annotate it? Which model is suitable for which task? What are common pitfalls and where should you put your efforts and focus?
Many Python frameworks are suitable for creating basic dashboards or prototypes but struggle with more complex ones. Taking lessons from the JavaScript community, the experts on building UI’s, we created a new framework called Solara. Solara scales to much more complex apps and compute-intensive dashboards. Built on the Jupyter stack, Solara apps and its reusable components run in the Jupyter notebook and on its own production quality server based on Starlette/FastAPI.
Solara has a declarative API that is designed for dynamic and complex UIs yet is easy to write. Reactive variables power our state management, which automatically triggers rerenders. Our component-centric architecture stimulates code reusability, and hot reloading promotes efficient workflows. With our rich set of UI and data-focused components, Solara spans the entire spectrum from rapid prototyping to robust, complex dashboards.
Pixi goes further than existing conda-based package managers in many ways:
- From scratch implemented in Rust and ships as a single binary
- Integrates a new SAT solver called resolvo
- Supports lockfiles like
poetry
/yarn
/cargo
do - Cross-platform task system (simple
bash
-like syntax) - Interoperability with PyPI packages by integrating
uv
- It's 100% open-source with a permissive licence
We’re looking forward to take a deep-dive together into what conda and PyPI packages are and how we are seamlessly integrating the two worlds in pixi.
We will show you how you can easily setup your new project using just one configuration file and always have a reproducible setup in your pocket. Which means that it will always run the same for your contributors, user and CI machine ( no more "but it worked on my machine!" ).
Using pixi's powerful cross-platform task system you can replace your Makefile
and a ton of developer documentation with just pixi run task
!
We’ll also look at benchmarks and explain more about the difference between the conda and pypi ecosystems.
This talk is for everyone who ever dealt with dependency hell.
More information about Pixi:
https://pixi.sh
https://prefix.dev
https://github.com/prefix-dev/pixi
Building scalable ETL pipelines and deploying them in the cloud can seem daunting. It shouldn't be. Leveraging proper technologies can make this process easy. We will discuss the whole process of developing a composable and scalable ETL pipeline centred around Dask that is fully built with Open Source tools and how we can deploy to the cloud.
Spatial Transcriptomics, named the method of the year by Nature in 2020, offers remarkable visuals of gene expression across tissues and organs, providing valuable insights into biological processes. This talk presents the Squidpy library for analyzing and visualizing spatial molecular data, including demonstrations of gene expression visualization in mouse brain tissue.
When scaling geoscience workloads to large datasets, many scientists and developers reach for Dask, a library for distributed computing that plugs seamlessly into Xarray and offers an Array API that wraps NumPy. Featuring a distributed environment capable of running your workload on large clusters, Dask promises to make it easy to scale from prototyping on your laptop to analyzing petabyte-scale datasets.
Dask has been the de-facto standard for scaling geoscience, but it hasn’t entirely lived up to its promise of operating effortlessly at massive scale. This comes up in a few ways:
- Correctly chunking your dataset has a significant impact on Dask’s ability to scale
- Workers accidentally run out of memory due to:
- Data being loaded too eagerly
- Rechunking
- Unmanaged memory
Over the last few months, Dask has addressed many of those pains and continues to do so through:
- Improvements to its scheduling algorithms
- A faster and more memory-stable method for rechunking
- First-of-its-kind logical optimization layer for a distributed array framework (ongoing)
Join us as we dive into real-world geoscience workloads, exploring how Dask empowers scientists and developers to run their analyses at massive scale. Discover the impact of improvements made to Dask, ongoing challenges, and future plans for making it truly effortless to scale from your laptop to the cloud.
Apache Arrow has become a de-facto standard for efficient in-memory columnar data representation. Beyond the standardized and language-independent columnar memory format for tabular data, the Apache Arrow project also has a growing set of supplementary specifications and language implementations. This talk will give an overview of the recent developments in the Apache Arrow ecosystem, including ADBC, nanoarrow, new data types, and the Arrow PyCapsule protocol.
In this presentation, we introduce Mamba 2.0, the latest version of the multi-platform, language-agnostic package manager that has garnered significant adoption within the scientific open-source community for its speed and efficiency.
Did you know that all top PyPI packages declare their 3rd party dependencies? In contrast, only about 53% of scientific projects do the same. The question arises: How can we reproduce Python-based scientific experiments if we're unaware of the necessary libraries for our environment?
In this talk, we delve into the Python packaging ecosystem and employ a data-driven approach to analyze the structure and reproducibility of packages. We compare two distinct groups of Python packages: the most popular ones on PyPI, which we anticipate to adhere more closely to best practices, and a selection from biomedical experiments. Through our analysis, we uncover common development patterns in Python projects and utilize our open-source library, FawltyDeps, to identify undeclared dependencies and assess the reproducibility of these projects.
This discussion is especially valuable for enthusiasts of clean Python code, as well as for data scientists and engineers eager to adopt best practices and enhance reproducibility. Attendees will depart with actionable insights on enhancing the transparency and reliability of their Python projects, thereby advancing the cause of reproducible scientific research.
Polars is a dataframe library taking the world by storm. It is very runtime and memory efficient and comes with a clean and expressive API. Sometimes, however, the built-in API isn't enough. And that's where its killer feature comes in: plugins. You can extend Polars, and solve practically any problem.
No prior Rust experience required, intermediate Python and programming experience required. By the end of the talk, you will know how to write your own Polars Plugin! This talk is aimed at data practitioners.
Almost all modern CPU have a vector processing unit, making it possible to write faster code for a large category of problems, at the cost of portability - there a re many different instruction sets in the wild! The xsimd library makes it possible to write portable C++ code that targets different architectures and sub-architectures. The specialization choice can be made at compile-time or at runtime, using a provided dispatching mechanism. Intel, ARM, RiscV and Webassembly are supported, and the library has already been adopted by Xtensor, Pythran, Apache Arrow and Firefox.
In this talk, we will go through everything open-source AI: the state of open-source AI, why it matters, the future of it and how you can get started with it.
In the rapidly evolving landscape of Artificial Intelligence (AI), open source and openness AI have emerged as crucial factors in fostering innovation, transparency, and accountability. Mistral AI's release of the open-weight Mistral 7B model has sparked significant adoption and demand, highlighting the importance of open-source and customization in building AI applications. This talk focuses on the Mistral AI model landscape, the benefits of open-source and customization, and the opportunities for building AI applications using Mistral models.
The astronomical community has built a good amount of software to visualize and analyze the images obtained with the James Webb Space Telescope (JWST). In this talk, I will present the open-source Python package Jdaviz. I will show you how to visualize publicly available JWST images and build the pretty color images that we have all seen in the media. Half the talk will be an introduction to JWST and Jdaviz and half will be a hands on session on a cloud platform (you will only need to create an account) or on your own machine (the package is available on PyPI).
NetworkX is arguably the most popular graph analytics library available today, but one of its greatest strengths - the pure-python implementation - is also possibly its biggest weakness. If you're a seasoned data scientists or a new student of the fascinating field of graph analytics, you're probably familiar with NetworkX and interested in how to make this extremely easy-to-use library powerful enough to handle realistically large graph workflows that often exceed the limitations of its pure-python implementation.
This talk will describe a relatively new capability of NetworkX; support for accelerated backends, and how they can benefit NetworkX users by allowing it to finally be both easy to use and fast. Through the use of backends, NetworkX can also be incorporated into workflows that take advantage of similar accelerators, such as Accelerated Pandas (cudf.pandas), to finally make these easy to use solutions scale to larger problems.
Attend this talk to learn about how you can leverage the various backends available to NetworkX today to seamlessly run graph analytics on GPUs, use GraphBLAS implementations, and more, all without leaving the comfort and convenience of the most popular graph analytics library available.
One of the more mundane tasks in the business analytics world is to measure KPIs: averages, sums, ratios, etc. Typically, these are measured period over period, to see how they trend. If you're a data analyst, you've likely been asked to debug/explain a metric, because a stakeholder wants to understand why a number has changed.
This topic isn't well grounded theory, and the answers we come up with can be lacklustre. In this talk, we discuss solutions to this very common topic. We will look at a methodology we have developed at Carbonfact, and the opensource Python tool we are sharing.
Transformers are everywhere: NLP, Computer Vision, sound generation and even protein-folding. Why not in forecasting? After all, what ChatGPT does is predicting the next word. Why this architecture isn't state-of-the-art in the time series domain?
In this talk, you will understand how Amazon Chronos and Salesforece's Moirai transformer-based forecasting models work, the datasets used to train them and how to evaluate them to see if they are a good fit for your use-case.
Understanding the effectiveness of various marketing channels is crucial to maximise the return on investment (ROI). However, the limitation of third-party cookies and an ever-growing focus on privacy make it difficult to rely on basic analytics. This talk discusses a pioneering project where a Bayesian model was employed to assess the marketing media mix effectiveness of WeRoad, the fastest-growing Italian tour operator.
The Bayesian approach allows for the incorporation of prior knowledge, seamlessly updating it with new data to provide robust, actionable insights. This project leveraged a Bayesian model to unravel the complex interactions between marketing channels such as online ads, social media, and promotions. We'll dive deep into how the Bayesian model was designed, discussing how we provided the AI system with expert knowledge, and presenting how delays and saturation were modelled.
We will also tackle aspects of the technical implementation, discussing how Python, PyMC, and Streamlit provided us with the all the tools we needed to develop an effective, efficient, and user-friendly system.
Attendees will walk away with:
- A simple understanding of the Bayesian approach and why it matters.
- Concrete examples of the transformative impact on WeRoad's marketing strategy.
- A blueprint to harness predictive models in their business strategies.
Aladin allows to visualize images of the sky or planetary surfaces just as an astronomical "openstreetmap" app. The view can be panned and explored interactively. In the ipyaladin widget -- that brings Aladin in the Jupyter Notebook environnement -- these abilities are extended with a python API. The users can send astronomical data in standard formats back and forth the viewer and their Python code. Such data can be images of the sky in different wavelengths, but also tabular data, complex shapes that characterize telescope observation regions, or even special sky features (such as probability region for the provenance of a gravitational event).
With these already existing features, and current work we are doing with the new development framework anywidget
, ipyaladin
is really close to a version 1.0.0. It is already used in its beta version in different experimental science platforms, for example in the ESCAPE European Science Cluster of Astronomy & Particle Physics project and in the experimental SKA (Square Kilometre Array, a telescope for radio astronomy) analysis platform.
In this presentation, we will share our feedback on the development of a widget thanks to anywidget
compared to the bare ipywidget
framework. And we will demonstrate the functionalities of the widget through scientific use cases.
Discover metalearners, a cutting-edge Python library designed for Causal Inference with particularly flexible and user-friendly MetaLearner implementations. metalearners leverages the power of conventional Machine Learning estimators and molds them into causal treatment effect estimators. This talk is targeted towards data professionals with some Python and Machine Learning competences, guiding them to optimizing interventions such as 'Which potential customers should receive a voucher to optimally allocate a voucher budget?' or 'Which patients should receive which medical treatment?' based on causal interpretations.
The video is available here: https://www.youtube.com/watch?v=yn1bR-BVfn8&list=PLGVZCDnMOq0pKya8gksd00ennKuyoH7v7&index=37
This talk introduces DataLab, a unique open-source platform for signal and image processing, seamlessly integrating scientific and industrial applications.
The main objective of this talk is to show how DataLab may be used as a complementary tool alongside with Jupyter notebooks or an IDE (e.g., Spyder), and how it can be extended with custom Python scripts or applications.
sktime is a widely used scikit-learn compatible library for learning with time series. sktime is easily extensible by anyone, and interoperable with the pydata/numfocus stack.
This talk presents progress, challenges, and newest features off the press, in extending the sktime framework to deep learning and foundation models.
Recent progress in generative AI and deep learning is leading to an ever-exploding number of popular “next generation AI” models for time series tasks like forecasting, classification, segmentation.
Particular challenges of the new AI ecosystem are inconsistent formal interfaces, different deep learning backends, vendor specific APIs and architectures which do not match sklearn-like patterns well – every practitioner who has tried to use at least two such models at the same time (outside sktime) will have their individual painful memories.
We show how sktime brings its unified interface architecture for time series modelling to the brave new AI frontier, using novel design patterns building on ideas from hugging face and scikit-learn, to provide modular, extensible building blocks with a simple specification language.
Markov chain Monte Carlo (MCMC) methods, a class of iterative algorithms that allow sampling almost arbitrary probability distributions, have become increasingly popular and accessible to statisticians and scientists. But they run into difficulties when applied to multimodal probability distributions. These occur, for example, in Bayesian data analysis, when multiple regions in the parameter space explain the data equally well or when some parameters are redundant. Inaccurate sampling then results in incomplete and misleading parameter estimates.
Markov chain Monte Carlo (MCMC) methods, a very popular class of iterative algorithms that allow sampling almost arbitrary probability distributions, run into difficulties when applied to multimodal probability distributions. These occur, for example, in Bayesian data analysis, when multiple regions in the parameter space explain the data equally well or when some parameters are redundant.
In this talk, intended for data scientists and statisticians with basic knowledge of MCMC and probabilistic programming, I present Chainsail, an open-source web service written entirely in Python. It implements Replica Exchange, an advanced MCMC method designed specifically to improve sampling of multimodal distributions.
Chainsail makes this algorithm easily accessible to users of probabilistic programming libraries by automatically tuning important parameters and exploiting easy on-demand provisioning of the (increased) computing resources necessary for running Replica Exchange.
In their seminal paper "Why propensity scores should not be used for matching," King and Nielsen (2019) highlighted the shortcomings of Propensity Score Matching (PSM). Despite these concerns, PSM remains prevalent in mitigating selection bias across numerous retrospective medical studies each year and continues to be endorsed by health authorities. Guidelines to mitigating these issues have been proposed, but many researchers encounter difficulties in both adhering to these guidelines and in thoroughly documenting the entire process.
In this presentation, I show the inherent variability in outcomes resulting from the commonly accepted validation condition of Standardized Mean Difference (SMD) below 10%. This variability can significantly impact treatment comparisons, potentially leading to misleading conclusions. To address this issue, I introduce A2A, a novel metric computed on a task specifically designed for the problem at hand. By integrating A2A with SMD, our approach substantially reduces the variability of predicted Average Treatment Effects (ATE) by up to 90% across validated matching techniques.
These findings collectively enhance the reliability of PSM outcomes and lay the groundwork for a comprehensive automated bias correction procedure. Additionally, to facilitate seamless adoption across programming languages, I have integrated these methods into "popmatch," a Python package that not only incorporates these techniques but also offers a convenient Python interface for R's MatchIt methods.
In this talk, we'll look into why Insee had to go beyond usual tools like JupyterHub. With data science growing, it has become important to have tools that are easy to use, can change as needed, and help people work together. The opensource software Onyxia brings a new answer by offering a user-friendly way to boost creativity in a data environment that uses massively containerization and object storage.
MAPIE (Model Agnostic Prediction Interval Estimator) is your go-to solution for managing uncertainties and risks in machine learning models. This Python library, nestled within scikit-learn-contrib, offers a way to calculate prediction intervals with controlled coverage rates for regression, classification, and even time series analysis. But it doesn't stop there - MAPIE can also be used to handle more complex tasks like multi-label classification and semantic segmentation in computer vision, ensuring probabilistic guarantees on crucial metrics like recall and precision. MAPIE can be integrated with any model - whether it's scikit-learn, TensorFlow, or PyTorch. Join us as we delve into the world of conformal predictions and how to quickly manage your uncertainties using MAPIE.
Link to Github: https://github.com/scikit-learn-contrib/MAPIE
Open Source Software, the backbone of today’s digital infrastructure, must be sustainable for the long-term. Qureshi and Fang (2011) find that motivating, engaging, and retaining new contributors is what makes open source projects sustainable.
Yet, as Steinmacher, et al. (2015) identifies, first-time open source contributors often lack timely answers to questions, newcomer orientation, mentors, and clear documentation. Moreover, since the term was first coined in 1998, open source lags far behind other technical domains in participant diversity. Trinkenreich, et al. (2022) reports that only about 5% of projects were reported to have women as core developers, and women authored less than 5% of pull requests, but had similar or even higher rates of pull request acceptances to men. So, how can we achieve more diversity in open source communities and projects?
Bloomberg’s Women in Technology (BWIT) community, Open Source Program Office (OSPO), and Corporate Philanthropy team collaborated with NumFOCUS to develop a volunteer incentive model that aligns business value, philanthropic impact, and individual technical growth. Through it, participating Bloomberg engineers were given the opportunity to convert their hours spent contributing to the pandas open source project into a charitable donation to a non-profit of their choice.
The presenters will discuss how we wove together differing viewpoints: non-profit foundation and for-profit corporation, corporate philanthropy and engineers, first-time contributors and core devs. They will showcase why and how we converted technical contributions into charitable dollars, the difference this community-building model had in terms of creating a diverse and sustained group of new open source contributors, and the viability of extending this to other open source projects and corporate partners to contribute to the long-term sustainability of open source—thereby demonstrating the true convergence of tech and social impact.
NOTE:
[1] Qureshi, I, and Fang, Y. "Socialization in open source software projects: A growth mixture modeling approach." 2011.
[2] Steinmacher, I., et al. "Social barriers faced by newcomers placing their first contribution in open source software projects." 2015.
[3] Trinkenreich, B., et al. "Women’s participation in open source software: A survey of the literature." 2022.
Retrieval is the process of searching for a given item (image, text, …) in a large database that are similar to one or more query items. A classical approach is to transform the database items and the query item into vectors (also called embeddings) with a trained model so that they can be compared via a distance metric. It has many applications in various fields, e.g. to build a visual recommendation system like Google Lens or a RAG (Retrieval Augmented Generation), a technique used to inject specific knowledge into LLMs depending on the query.
Vector databases ease the management, serving and retrieval of the vectors in production and implement efficient indexes, to rapidly search through millions of vectors. They gained a lot of attention over the past year, due to the rise of LLMs and RAGs.
Although people working with LLMs are increasingly familiar with the basic principles of vector databases, the finer details and nuances often remain obscure. This lack of clarity hinders the ability to make optimal use of these systems.
In this talk, we will detail two examples of real-life projects (Deduplication of real estate adverts using the image embedding model DinoV2 and RAG for a medical company using the text embedding model Ada-2) and deep dive into retrieval and vector databases to demystify the key aspects and highlight the limitations: HSNW index, comparison of the providers, metadata filtering (the related plunge of performance when filtering too many nodes and how indexing partially helps it), partitioning, reciprocal rank fusion, the performance and limitations of the representations created by SOTA image and text embedding models, …
Adaptive prediction intervals, which represent prediction uncertainty, are crucial for practitioners involved in decision-making. Having an adaptivity feature is challenging yet essential, as an uncertainty measure must reflect the model's confidence for each observation. Attendees will learn about state-of-the-art algorithms for constructing adaptive prediction intervals, which is an active area of research.
The EU Commission is likely to vote on the Cyber Resilience Act (CRA) later this year. In this talk we will look at the timeline for the new legislation, any critical discussions happening around implementation and most importantly, the new responsibilities outlined by the CRA. We’ll also discuss what the PSF is doing for CPython and for PyPI and what each of us in the Python ecosystem might want to do to get ready for a new era of increased certainty – and liability – around security.
The MedTech industry is undergoing a revolutionary transformation with continuous innovations promising greater precision, efficiency, and accessibility. In particular oncology, a branch of medicine that focuses on cancer, will benefit immensely from these new technologies, which may enable clinicians to detect cancer earlier and increase chances of survival. Detecting cancerous cells in microscopic photography of cells (Whole Slide Images, aka WSIs) is usually done with segmentation algorithms, which neural networks (NNs) are very good at. While using ML and NNs for image segmentation is a fairly standard task with established solutions, doing it on WSIs is a different kettle of fish. Most training pipelines and systems have been designed for analytics, meaning huge columns of small individual datums. In the case of WSIs, a single image is so huge that its file can be up to dozens of gigabytes. To allow innovation in medical imaging with AI, we need efficient and affordable ways to store and process these WSIs at scale.
In this talk, we provide an update on the latest scikit-learn
features that have been implemented in versions 1.4 and 1.5. We will particularly discuss the following features:
- the metadata routing API allowing to pass metadata around estimators;
- the
TunedThresholdClassifierCV
allowing to tuned operational decision through custom metric; - better support for categorical features and missing values;
- interoperability of array and dataframe.
Scaling machine learning at large organizations like Renault Group presents unique challenges, in terms of scales, legal requirements, and diversity of use cases. Data scientists require streamlined workflows and automated processes to efficiently deploy models into production. We present an MLOps pipeline based on python Kubeflow and GCP Vertex AI API designed specifically for this purpose. It enables data scientists to focus on code development for pre-processing, training, evaluation, and prediction. This MLOPS pipeline is a cornerstone of the AI@Scale program, which aims to roll out AI across the Group.
We choose a Python-first approach, allowing Data scientists to focus purely on writing preprocessing or ML oriented Python code, also allowing data retrieval through SQL queries. The pipeline addresses key questions such as prediction type (batch or API), model versioning, resource allocation, drift monitoring, and alert generation. It favors faster time to market with automated deployment and infrastructure management. Although we encountered pitfalls and design difficulties, that we will discuss during the presentation, this pipeline integrates with a CI/CD process, ensuring efficient and automated model deployment and serving.
Finally, this MLOps solution empowers Renault data scientists to seamlessly translate innovative models into production, and smoothen the development of scalable, and impactful AI-driven solutions.
Rising concerns over IT's carbon footprint necessitate tools that gauge and mitigate these impacts. This session introduces CodeCarbon, an open-source tool that estimates computing's carbon emissions by measuring energy use across hardware components. Aimed at AI researchers and data scientists, CodeCarbon provides actionable insights into the environmental costs of computational projects, supporting efforts towards sustainability without requiring deep technical expertise.
This talk from the main contributors of Code Carbon will cover the environmental impact of IT, the possibilities to estimate it and a demo of CodeCarbon.
Machine Learning practitioners build predictive models from "noisy" data resulting in uncertain predictions. But what does "noise" mean in a machine learning context?