JuliaCon 2020 (times are in UTC)

How similar do two strings look? Visual distances in Julia
07-31, 19:20–19:30 (UTC), Purple Track

We will describe a Julia package VisualStringDistances.jl which provides notions of distance between two strings based on how close they look when printed, and discuss a possible application of this: an automated check in the General Registry auto-merge process to help prevent malicious lookalike registrations.


The Julia package VisualStringDistances.jl provides several notions of distance between two strings based on how they are rendered by GNU Unifont; e.g., capital-eye (“I”) and lowercase-ell (“l”) are close together, while “a” and e.g. “X” are far apart, even though they are both one character apart. By comparing strings visually, this package provides a means for quantifying how easily two strings might be confused when read by a human.

This measure of distance is calculated by the means of “unbalanced optimal transport” via the package UnbalancedOptimalTransport.jl which will also be discussed. Loosely speaking, this measures the cost of moving “mass” (i.e. black pixels in the printed representation of a string) from one place to another in order to transform the printed representation of one string into another, allowing the destruction or creation of mass (with some cost). This will be illustrated visually in the talk to provide an understanding of this interesting technique that has been applied to a variety of fields (image registration, economics, traffic flows, etc).

The motivating application of VisualStringDistances.jl is for establishing automated checks for Julia’s General registry of packages in order to flag new packages for manual review. A malicious agent might try to register a package with a name that looks very similar to the name of some popular package, and then suggest users use it in online postings or tutorials. A user who copy-pastes the name or code that adds the package might not realize the name is different from that of the popular package. To aid in preventing this scenario, an automated check can be added to the General registry in order to prevent automated merging of new packages whose names look similar to those of existing packages.

A related task is that of measuring “typo-similarity” to prevent automerging of packages who names are likely to be entered by mistake when typing the name of another package. This will be discussed as well, time permitting.

PhD student studying quantum information theory at Cambridge University

This speaker also appears in: