PyConDE & PyData Berlin 2024

Encoding Charactersets - may the force be with you
2024-04-23 , A1

Understanding and repairing garbled text (Mojibake)
is despite Unicode a permanent ongoing task in IT projects.
Garbled text is the result of text being decoded using an unintended character encoding.

Example: Die UTF-8 Selbsthilfegruppe trifft sich heute Abend im grünen Saal

This talks explains how to analyze and fix such encoding problems with python.
The topics of this talk contains:

  • difference between grapheme and codepoints
  • Unicode vs. UTF-8
  • decoding and encoding files, database result sets, REST-APIs calls
  • the unicodedata module
  • handling of ISO charsets in the unicode world

This talk shows short code examples for real world problems and solutions.


Understanding and repairing garbled text (Mojibake)
is despite Unicode a permanent ongoing task in IT projects.
Garbled text is the result of text being decoded using an unintended character encoding.

The topics of this talk contains the following points. To every point there are code examples:

  • Explore the nuances of text representation: Grapheme vs. Codepoints. Unravel the essence of characters in computing.
  • Delve into the realm of character encoding: Unicode vs. UTF-8. Decipher the key distinctions shaping text globalization.
  • Master the art of data interchange. Decode and encode files, database results, and REST-APIs seamlessly for universal communication.
  • Unlock the power of the unicodedata module. Learn how it aids in character information retrieval and manipulation in Python.
  • Navigate the challenges of ISO charsets in the Unicode era. Gain insights into effective strategies for handling diverse character sets.

Expected audience expertise: Domain:

Intermediate

Expected audience expertise: Python:

Novice

Abstract as a tweet (X) or toot (Mastodon):

Understanding and repairing garbled text (Mojibake) with Python

See also: Encoding Charsets (2.2 MB)

Working for over 25 years for ORDIX AG as consultant in topics databases and programming. Focused on programming python in the last years. Giving lectures for beginners and advanced customers. Having lots of fun in edutainment difficult but all-day problems.