PyCon Lithuania 2024
"Data Processing with Apache Spark and Apache Iceberg" is a dynamic workshop designed to equip data professionals with advanced skills in managing and processing large-scale data. Participants will be introduced to the essential table formats before delving into Apache Iceberg's integration with Apache Spark. This session focuses on practical applications, including schema evolution and efficient file management, to enhance data processing efficiency and scalability. Ideal for data engineers and scientists,
Polars is the new dataframe on the block taking the world by storm.
You'll learn:
- what Polars is, and what it can do for you
- Polars basics and core concepts (including expressions and lazy computation)
- how to work with different datatypes, and how the List datatype gives you superpowers
- interoperability with other tools: NumPy, SciPy, Arrow, pandas, Numba
- migrating from pandas
What better way to learn it than by attending a PyCon Lithuania tutorial, delivered by a Polars core dev?
Large language models are very widely used today for various applications: one can detect fraud, analyze and converse with own documents or create a commercial chatbot.
This "Aligning and Using an Open-Source LLM" workshop is an intensive four-hour exploration designed to demystify the world of Language Models (LLMs) and open-source frameworks. In the rpidly evolving landscape of artificial intelligence, the effective use of LLMs has become a crucial skill for machine learning engineer.
Introduction to the event and the day.
My mother loves Šaltibarščiai.
Raw Django doesn't take the first places when comparing the performance of Python web frameworks. However, it can be pretty fast if we identify the bottlenecks and find ways to avoid them. Comparing performance and implementation complexity before and after gives us an understanding of which features should be implemented and what can be skipped.
In this presentation, we will be exploring Observability on a Python web application. We will delve into a real-world application, showcase, and discuss the importance of Obversing for Services. We will focus on the three foundations of Observability: Logs, Metrics, and Tracing.
Discover some tools for observing and monitoring, particularly showcase a Demo of how to integrate DataDog in a Python service. The presentation will show examples of logs and Metrics, and display how to trace a request.
In today's data-driven world, knowing how to gather and analyze information is more critical than ever. Join us for a compact session on using Python for crawling the web and solve real-time problems. We'll cover the basics, and then dive into a practical example of collecting data from the internet.
Join us for a presentation where we share the mysteries of anti-bot systems, guarding websites, APIs, and mobile applications ! 🌐📲
🛠️ What's in Store:
1/ Exploring the Defence Layers
2/ Anti-Bot Reputation Score Demystified
3/ Strategies for Evasion
After this talk, you'll emerge well-equipped with knowledge to navigate and comprehend the nuances of these protective measures! 🚀🔒
Enhance Django with HTMX: Elevate your web applications with seamless client-side interactivity and build dynamic, engaging experiences without page reloads - no React / Vue / Angular required!
Frameworks like Django use advanced Python features to provide devs with the magical tools they know and love.
In this live-coded talk we’ll take a look at a couple of Django snippets that use descriptors under the hood and we’ll use them as motivating examples for why Python needs descriptors.
By the end of the talk, you’ll understand how descriptors work and how they power Django behind the scenes.
Django's async capabilities and batteries-included tooling make it an ideal framework for quickly building MVPs and iterating. This talk demonstrates building a document search MVP with Django templates, ChromaDB, and hosted large language models. It then shows how to refactor and scale it using Elasticsearch, Celery/RabbitMQ workers, React, self-hosted vLLM, and auth. With Django async, you can rapidly build, constantly improve, and deploy the latest AI models in your product.
Join Tadas Gedgaudas in an enlightening talk on revolutionizing web scraping with machine learning. Uncover how ChatGPT can adapt to website layout changes, making scraping more efficient and reducing maintenance needs. Delve into data structurization with ML, the seamless integration of ChatGPT for parsing, and its practical impact for developers.
Facing challenges with search capabilities in your web applications? Discover how the combination of OpenSearch, Python, and serverless architecture can be your solution. This talk provides hands-on examples, from building efficient queries to implementing production-ready practices. You'll gain actionable insights and the practical know-how to build and deploy robust, query-efficient search applications that solve real-world challenges.
Dive into the world of modern web development by fusing the power of Django and FastAPI. This talk will guide you through the process of building robust, scalable, and efficient APIs using Django Ninja, a web framework that combines Django's reliability and FastAPI's speed. We'll explore how to leverage Django's ORM and user authentication while enjoying FastAPI's performance and type checking. Whether you're a Django veteran looking to supercharge your APIs or a beginner eager to learn cutting-edge techniq
At Corner Case Technologies, we offer clients a service to migrate from on-premises infrastructure to AWS for various purposes, including high availability, cost optimization, and maintainability. Each migration is unique and necessitates thorough preparation for planning, execution, and subsequent development.
In this talk, I will present a specific use case of a migration that we conducted, with a particular focus on the lessons learned during the planning and execution phases.
A standard Django project involves working with multiple files and folders from the start. Let's see how the work with a Django project changes when we have only one file. This solution automatically transforms Django into a microservice-oriented async framework with "batteries included” philosophy.
A glance behind the curtains into how an execution part is going with Danske Banks Lift and Shift journey to public cloud. Let's deep dive into some of the technical challenges and a snek Python stack standing right in front helping orchestrate Cloud Migration at Scale.
So you've built an AI startup using Async DJango - the MVP looks great and your hand full of users love it. Now you need to clean up the MVP, so you can scale.
This is the Part Two, to building an AI startup with Async DJango - we talk about moving from ChromaDb to a OpenSearch/ElasticSearch, document processing steps to Celery/RabbitMQ, selfhosting via vLLM, migrating from Django templates to a ReactJs APP, better monitoring and logging
At Mozilla, we maintain services that are used by millions of users daily. These services are the backbone of expanding Firefox and providing users with useful features, all while protecting privacy.
Learn about how one service, Merino, was planned to meet user needs at scale. This service providers users with search recommendations and suggestions from local and remote providers. Get some insights on how we develop, deploy and monitor and maintain this modern Python service.
Unlock the full potential of web scraping with this session! From novice to virtuoso, join us on an exciting journey of data extraction as we unravel secrets and advanced techniques.
🔍 Session Highlights:
1/ Building Web Scrapers - The Art Unveiled 🛠️
2/ Proxy and Browser Farms Adventure 🌐
3/ Scrapoxy Orchestration - Elevate Your Scalability 🚀
4/ Protection Measures Disclosed 🔒
This concise session will immerse you in the fascinating world of web scraping.
Join us, and discover how Alexa's ability to recognize and convert speech into text can be used to create applications that break the monotony of your daily routine without the need to use a keyboard at all. We will teach you about the main components of Alexa, how to get started with the Developer console, and how to customize Alexa using our favorite language, Python in a serverless way. We will also demonstrate how to incorporate Alexa into your daily developer life, and you might find that, after this t
Being an Open Source citizen. I'll be talking about the motivation behind Encode OSS. How we can work towards properly funding open source development, why that's valuable, and what we've been working on lately.
Reception drinks.
Look at your system's design! Are the major structures and technology choices the result of conscious decisions, or have they emerged as the system has evolved? Is the design stuck in a local minima while ever more features are piled into the system? How can we design systems which withstand the major forces acting on a solution?
We’ll see why system designers should focus deliberately on the constraints and qualities of system design, and avoid getting too distracted by features.
In May 2023, there was a big buzz in the AI community as a brand-new programming language called 'Mojo' made its debut. People were talking about it in blog posts like: 'Mojo may be the biggest programming language advance in decades'.
In this talk, we'll dive into Mojo, checking out what it promises and where it stands right now, and also pondering what the future could hold for it.
Target Audience: Software Developers
Prerequisites: General knowledge about programming languages
Debuggers are indispensable tools for all Python developers, empowering them to conquer bugs and unravel complex systems. Let's create our own.
We will create a new Python package from scratch using the best practices and will deploy it to pypi.org. We will also learn the benefits and how to use the bleeding edge tools for code linting, unit testing and deployment. Let's make Python ecosystem even more awesome!
The data amount and the complexity of the queries are not particularly large in this industry. The challenge comes from using the STDF format, a binary file format with roots in the 1980's.
A method to make this data source available to modern data analysis tools (jupyter/streamlit) using the construct library will be discussed. The focus is on how the data can be collected, converted and made available in a fast and efficient way, using both pypy and cpython.
The speech will address Python's limitations in AI and how MAX Platform can overcome them by offering superior speed, seamless Python code execution, and hardware compatibility. It will inspire Pythonistas to explore MAX Platform and unlock new possibilities in AI development and beyond.
The hype for GenAI keeps rising. Nowadays, almost every company wants to adopt this technology in their business, but in order to successfully deliver a GenAI project, it takes much more than just figuring out, what to ask ChatGPT. During the presentation, I'll introduce you to an AI platform, that allows users to deliver GenAI projects with confidence.
New tools are changing how people program, and even who programs. Type hints, modern editor support and, more recently, AI-powered tools like GitHub Copilot and ChatGPT are truly transforming our workflows and improving developer productivity. But what does this mean for how we should be writing and designing our APIs and libraries?
Crafting scalable event-driven applications using Python can be a tricky endeavor, requiring careful consideration of various factors, from understanding synchronous and asynchronous network calls to tackling the Python Global Interpreter Lock (GIL) bottleneck and implementing robust auto-scaling strategies. This talk delves into advanced techniques and concepts for designing and implementing scalable event-driven applications with Python, empowering you to overcome these challenges effectively.
Let’s see how to build an SDK that works for years and is used by other developers. We’ll learn which patterns actually work, how mistakes made in the early stage affect the software years later, and how to make sure we don’t break users’ code when introducing changes.
Lunch
A newly developed deadcode Python package to detect and automatically fix unused Python code will be introduced. Real-world scenarios, when the deadcode saves development time will be provided. The main features and options of the deadcode package will be presented and it will be shown, why this tool is superior to vulture. Also some implementation details and complexities will be discussed.
While Functional Programming gains traction, I'll showcase how OOP, done right, yields clean, efficient code. Explore a fresh perspective, gain insights, and reshape your coding approach.
.
Python's ecosystem is one of the best out there, and this is mainly due to its community and what lies inside its core, a C API.
Being partially in C enables Python to interact with many languages out there which might be known by you like C++, Rust or Zig. But how does it work?
On this talk, you will be able to understand how Python can embrace the power and performance of other languages, in order to expose modules that improve the whole ecosystem.
Agenda:
- What is mutation testing?
- Why isn't test coverage enough?
- What are its pros and cons?
- How does it work (overview and details)?
- Simple example (finding and fixing bad test)
- Complex example (finding and fixing bad/missing test)
- Complex example (finding and fixing redundant code)
- FAQs -- history, why it's so CPU/RAM intensive, and more if time allows
- Unusual applications, if time allows
- Wrapup
- Q&A
In this talk we'll review some of the changes we've made to Pydantic since 2.0 to push performance even further. This is possible largely because Pydantic chose to implement the core in Rust. We'll focus on two main topics:
- Come learn about optimizations Pydantic has been working on since 2.0
- Come see our draft ideas how Pydantic v3 could be even faster than v2
You should leave this talk excited about performance wins for your apps using Pydantic and inspired to try Rust in your own code.
I've been working full-time on a Python FOSS project for 503 days, so what did I learn?
Am I a better (Python) programmer?
Better teammate?
Better person?
In this talk I will share some lessons I learned over the course of these 503 days:
- how to get a tech job in this day & age
- how to put your ego aside
- how to deal with mistakes
- how to interact with users & contributors online
- how it feels to collaborate to a large codebase
As for the first 3 questions... Ask my colleagues!
Performing climate science within the context of climate change requires creative solutions to challenges such as data collection and storage management, optimizations for better memory and CPU usage, in addition to ensuring that analysis outputs are trustworthy. This talk will showcase xclim and finch, two pieces of software built for performing climate analyses on large datasets using Python, WPS, and the PANGEO software stack of technologies.
Learn about Python's memory handling, including:
- what pointers are, and why it matters
- what object IDs are, and what they mean
- how CPython can tell when you're done with an object, and what happens next
No C knowledge required!
An overcomplicated project increases development and maintenance time.
If a complete redesign is not possible, we can distribute the complexity across the existing codebase.
If AI assistants cannot help us with this task yet, we should discuss manual methods and tools that can be useful.
Using examples of real large projects, we will discuss that despite different business types, geographical and social contexts, these projects share similar architectural mistakes and how they can be redesigned.
Columnar databases are on the rise! They provide an efficient and scalable data warehouse for many use cases including time series data. The problem? Many conventional database drivers and querying methods become the bottleneck for data processing and analytics within our client-side applications. Learn how to leverage open-source projects like Apache Arrow Flight and Apache Parquet alongside industry-standard analytics tools to build the foundations of a performant analytics application.
SQLAlchemy is one of the most popular ORM libraries in Python. In this talk I will try to present caveats and gotchas that other Pythonists can find on their way while writing the asynchronous backend application using SQLAlchemy as an ORM. Mainly we will focus on how SQLAlchemy handles transactions and connections to the database and what issues we may face because of it.
Sometimes you have a Python object and you want it somewhere else: maybe you want to save your data to disk and load it again tomorrow; or you want to send some complex parameters over the network.
I'll talk about pickle - the usual way to do this, including ways it can go wrong, how to extend it, compare it to other approaches like JSON or storing in a database; and I'll stick a little bit of theory in my talk too.
At ArjanCodes, we use LLMs in various ways. They are part of the content we produce, we use them in platforms we develop, such as Learntail, they are integrated in automations that streamline our internal processes, and they’re part of our personal workflows, whether that’s for sales and marketing, operations, or software development.
In this talk, I’ll go over all of these use cases and share the things that we learned from working with LLMs and where LLMs provide us with the most value. Hopefully this wi
.
Polars is an OLAP query engine that focusses on the DataFrame use case. Machines have changed a lot in the last decade and Polars is a query engine that is written from scratch in Rust to benefit from the modern hardware.
Effective parallelism, cache efficient data structures and algorithms are ingrained in its design. This talk will go through recent changes and plans of the project.
In 2023, we saw several libraries - which had previously only supported pandas - add support for other dataframe libraries such as Polars, Modin, and cuDF.
- How did they do it?
- Are there any drawbacks to how they did it?
- What comes next, and what other solutions are there?
This talk could be of interest to anyone working with dataframes. In particular, those maintaining or contributing to libraries which use dataframes will learn about how they can best support multiple dataframe libraries.
Machine learning (ML) model serialization helps to optimize inference latency, memory, and disk space requirements and provides more options for model deployment. We will explore the use cases that benefit the most from this technique and some drawbacks.
Learn to make practical decisions in data engineering with Python's vast ecosystem. Avoid blindly following market guidelines and consider the reality of your situation for better performance and architecture
Developer tools power many LLM-based chat and Retrieval Augmented Generation applications today. However, there is a non-trivial knowledge barrier for entrants that could hinder developer experience. Our discussion intends to offer actionable insights into building and maintaining generative AI solutions in a secure and economical way, thereby improving the developer experience in this Generative AI wave.
Data is new oil, and one of the ways is leakage and poisoning the surrounding environment. What happens if you pollute one of the datasets used in some decision makers facing dashboards? In this talk, I will explain the reemergence of the Write-Audit-Publish pattern and how you can achieve it using Apache Iceberg and Apache Spark.
Polars conquered dataframes, and now it is coming for machine learning! With Polars-powered feature-extraction and a best-of-the-class set of diagnostic tools, functime enables forecasting thousands of time series all at once, from the comfort of your laptop.
Though forecasting practitioners are the intended audience, the talk has something for every data scientist. With Polars, we can push the boundary for what "reasonable scale" means - and build a new generation of tools for machine learning.
Presentation about how we (few local NLP enthusiasts) trained Language Transformer to generate meaningful text in Lithuanian language. Everything was based on volunteer work with huge R&D flavor.
During this presentation I will not only cover what kind of data we used to train this model and what results we got but also present other initiatives we drive in NLP field. Will try to do both technical and interactive presentation.
Passing metadata such as sample_weight
and groups
through a scikit-learn cross_validate
, GridSearchCV
, or a Pipeline
to the right estimators, scorers, and CV splitters has been either cumbersome, hacky, or impossible.
The new metadata routing mechanism in scikit-learn enables you to pass metadata through these objects. As a use-case, we study how you can implement a revenue sensitive scoring while doing a hyperparameter search within a GridSearchCV
object.
Introducing an open source library in Python: Quix Streams. It solves all the complexities of stream processing in a cloud native package with a familiar Pandas DataFrame API interface. This library lets you work with data like they are static in your Jupyter Notebook without any hassle associated with streaming technologies. Our mission is to bring masses of Python developers into streaming and make the journey as smooth as possible so real-time applications using ML are not so difficult
In 2023, vector databases are attracting great interest, as evidenced by the Google Trends search statistics. This type of database has a direct link with Large Language Models (LLM), such as ChatGPT , by enabling “Retrieval Augmented Generation” (RAG) for example. This approach offers the possibility of exploiting the power of a conversational agent using our own data.
But... Do you really need a vector database ?
Machine learning models are a new artifact to build, version and deploy, explore there impacts on your architecture.
If you are GPU-poor you need to become data-rich. I will give an overview of what we learned from looking at Alpaca, LIMA, Dolly, UltraFeedback and Zephyr and how we applied that to fine-tuning a state-of-the-art open source LLM called Notus and Notux by becoming data-rich.
In today's world, large language models (LLMs) are revolutionizing how we interact with technology, allowing us to have conversations, organize data, write text with minimal human effort.However, It is likely that when using an LLM, you have received incorrect answers or not specialized answers.
For this reason, fine-tuning models that have been pre-trained with this large corpus of data is crucial to: (1) obtain better performance in the quality of responses, and (2) tune the model to a specific domain.
Python is a leading language of choice for the Databricks and ML ecosystem, alongside a delta tables stack leveraging Unity catalog to manage petabytes of structured data. To build and experiment with ML data and models, version control has become the backbone of modern machine learning (ML) projects, bringing critical aspects of reproducibility and experimentation to teams who are able to experiment in isolation, while still collaborating on projects.
Large language models (LLMs) often require huge compute resources to serve. This is a common challenge for those who want to avoid sharing their data with cloud API providers, or to deploy their stack in air-gapped environments. We will take a look at how the open source llama-cpp-python library opens the door to lower hardware requirements and simplifies deployment significantly.
Unlocking the value of data often hinges on the ability to communicate insights effectively to non-technical audiences. What if you could go beyond static charts and captivate your audience with animated data stories? Join us in this workshop to discover the power of animated storytelling using ipyvizzu-story, an innovative open-source presentation tool designed to work seamlessly within Jupyter Notebook and similar platforms.
Fast and accurate search results are a crucial components of any e-shop and thus can make the difference between high user satisfaction and user frustration. With recent advancements in vector search technologies, enhanced search systems have become more efficient, leading to better user experiences and improved conversion rates. In this talk, we’ll explore how to implement a hybrid search system for a non-english e-commerce site using Pinecone, a high-performance vector search engine.
How to use KDTree from sklearn library to prototype RAG (Retrieval-Augmented Generation) applications.
With the latest advancements in Natural Language Processing and Large Language Models (LLMs), and big companies like OpenAI dominating the space, many people wonder: Are we heading further into a black box era with larger and larger models, obscured behind APIs controlled by big tech monopolies? I don’t think so, and in this talk, I’ll show you why.