Dictionaries.jl - for improved productivity and performance
2021-07-28 , Purple

Dictionaries.jl presents an alternative interface for dictionaries in Julia, for improved productivity and performance. During this talk we'll learn how to use Julia's data manipulation tools (such as indexing, broadcasting, map, filter, reduce, etc) with dictionaries and explore some implementation decisions made in this package. We will end with applications, including recent work on tabular data with primary and/or grouping keys.


This talk will be divided into roughly three sections.

Motivation

Julia is an awesome language for manipulating data, with excellent built-in functionality that is easy to extend by packages and users. Arrays, sets and dictionaries form the basis of core data structures necessary for a wide range of workloads. Of the three, Julia's AbstractArray interface is most extensive — supporting a wide range of data structures and a rich set of operations, designed around the same core set of functionality (primarily, indexing and iteration).

On the other hand, AbstractSet and AbstractDict do not support such a wide range of operations (like broadcasting, map, or reduce) and the interface for a user to create a new, fully-functional AbstractDict is not as clear cut or simple as it is for an array. Dictionaries.jl is an attempt to remedy this situation, by applying the learnings of the AbstractArray interface to create a new AbstractDictionary interface. By using Dictionaries.jl, users can experience improved programmer productivity as well as signficantly faster execution for many operations (especially with analytics-style workloads).

Implementation

A Dictionaries.jl AbstractDictionary differs from Julia's AbstractDict in three main ways:

  1. Dictionaries iterate values, like an array, instead of key-value pairs.
  2. The indices (or keys) of dictionary are a special kind of dictionary, much like the indices (or keys) of an array is a special kind of array.
  3. Dictionaries (and their indices) by default iterate in a well-defined order based on insertion, rather than quasi-randomly based on how the hashmap was constructed.

Since the indices of a dictionary are distinct, they naturally form a set (in the mathematical sense). In Dictionaries.jl this type is represented by AbstractIndices <: AbstractDictionary (unlike for Dict, there was no obvious alternative spelling of Set available). Every AbstractIndices has the special property that it's values are the same as it's keys, so if i ∈ indices then indices[i] is just i.

From these alone, natural definitions of broadcast, map, filter and reduce follow, for both indices (i.e. "sets") or dictionaries. For example, the map operation preserves the indices and maps the values to new values. You can even map a set of indices/keys into a new dictionary.

This property leads to Dictionaries.jl's primary efficiency gain. The indices of dictionaries (i.e. the expensive part) can be shared between different dictionaries. For example, the provided hashmap Dictionary can share its hash Indices with other dictionaries. The values are stored in a (mostly) dense array, so operating on all the values with an operation like map, filter and reduce is as fast as it is for a similarly sized array. One can use map or broadcast with multiple similar dictionaries (i.e. those that share compatible "tokens"), co-iterating values together at speed and with zero hash lookups.

Applications

We will turn our attention to some example applications of this interface, first highlighting the convenience and speed of Dictionary for some common analytics tasks.

We will then see how Dictionaries.jl is used in conjunction with other packages, for example how SplitApplyCombine.jl's group operation now supports a simple and easy split-apply-combine workflow.

Finally, we will explore recent work in TypedTables.jl which uses Dictionaries.jl to provide a table with a primary key (enabling easy lookup of rows based on data rather than an array index). The idea is similarly extended to grouped/partitioned data tables and their grouping keys.

I first used Julia to perform research during postdocs in quantum simulation, before using it in industry with a team working on various geospatial applications. Nowadays we use Julia at ELARA AI to simulate, analyze and optimize real-world businesses. I have contributed packages including StaticArrays, CoordinateTransformations, Rotations, TypedTables, SplitApplyCombine, AcceleratedArrays and Dictionaries.