JuliaCon 2023

Exploring synthesis of flexible neural machines with Zygote.jl
07-28, 15:00–15:30 (US/Eastern), 32-124

We were able to successfully synthesize simple compact high-level neural machines via a novel algorithm for neural architecture search using flexible differentiable programming capabilities of Zygote.jl.

We are studying an interesting class of flexible neural machines where single neurons process streams of vectors with tree-shaped indices rather than streams of numbers (see https://anhinga.github.io/ and https://arxiv.org/abs/1712.07447, "Dataflow Matrix Machines and V-values: a Bridge between Programs and Neural Nets"). These machines are expressive enough to be used for general-purpose stream-oriented programming (dataflow or functional reactive programs). They are also expressive enough to flexibly and compositionally modify their own weight-connectivity matrices on the fly. They can be thought of as flexible attention machines, with neuron inputs computing linear combinations of vectors with tree-shaped indices and thus working as attention devices.

We would like to explore a variety of machine learning experiments using this class of neural machines. In particular, it would be nice to be able to use differentiable programming and gradient methods in those experiments. Traditional machine learning frameworks would necessitate difficult engineering work reshaping those flexible tree-shaped vectors into flat tensors.

Zygote.jl provides flexible differentiable programming facilities and, in particular, allows users to take gradients with respect to variables aggregated inside a tree. That makes it a perfect fit for our machine learning experiments with flexible neural machines.

We consider the following novel algorithm for neural architecture search. Consider neural architectures where connections between neural modules are gated with scalar multipliers. Start with a sufficiently large network and perform sparsifying training of this network on a set of test problems using sparsifying regularization and gradually pruning inter-module connections with low gating weights.

In our case, the neural modules are powerful "super-neurons" of dataflow matrix machines. In our exploratory set of experiments we consider a duplicate character detector problem (a hand-crafted compact neural machine solving this problem is described in Section 3 of https://arxiv.org/abs/1606.09470, "Programming Patterns in Dataflow Matrix Machines and Generalized Recurrent Neural Nets").

The history of our preliminary experiments is documented at https://github.com/anhinga/DMM-synthesis-lab-journal/blob/main/history.md

These are moderate scale experiments successfully performed on a home laptop CPU.

Zygote.jl is quite awesome, but using it is still not without some difficulties. In particular, while "backpropagation theorem" promises us reverse-mode gradients at cost of the corresponding forward computation multiplied by a small constant, in practice we often observe larger slowdowns (presumably this happens when the forward computation is friendly to compiler optimizations, while the induced gradient computation is not). When this slowdown is still moderate one can successfully use Zygote.jl. When the slowdown starts to exceed two-three orders of magnitude, it makes sense to switch to the stochastic approximation of gradient computations via OpenAI flavor of evolution strategies even if one is computing sequentially and not on a cluster (see https://arxiv.org/abs/1712.06564, "On the Relationship Between the OpenAI Evolution Strategy and Stochastic Gradient Descent" by Xingwen Zhang, Jeff Clune, and Kenneth O. Stanley).

In our case, we have initially bumped into heavy slowdowns and memory swaps when trying to perform this neural synthesis in the most ambitious prior-free manner starting with a fully-connected recurrent machine with long backpropagation through time.

Because we have been aiming to synthesize feedforward circuits with local recurrences, we decided to scale our ambition down and rely on some priors and start with fully connected feedforward machines with skip connections and local recurrences.

Then we have obtained acceptable performance and have been able to synthesize very compact, human-readable neural machines with rather remarkable generalization properties using a tiny training set.

These are very exciting preliminary results where nice-looking compact, human-readable neural circuits approximating a hand-written neural program have been automatically synthesized for the first time.

We hope that this line of exploration will be continued, by ourselves and by other people. (The speaker is looking for collaboration opportunities.)

main theme

My main research focus has been to find, study, and develop a high-level computer programming formalism
allowing to deform programs in a continuous fashion
(just as one can deform recurrent neural networks in a continuous fashion).

I was trying to approach this problem from various angles: doing research in the mathematics of continuous domains
for denotational semantics of programming languages, studying theoretical neuroscience, and so on.

Finally, our research collaboration was starting to see the hints of the possible solution from approximately Fall of 2012,
and the formalism for continuously deformable programs was developed by our research collaborations in 2015-2016.

These days I am continuing to focus on studying and experimenting with this formalism and I am hoping that it will
eventually stop being a purely research subject and will become a technology.

I maintain a Web site for this formalism here: https://anhinga.github.io/

I also maintain a list of open problems and promising research and technological directions and interdisciplinary
connections related to this formalism: https://www.cs.brandeis.edu/~bukatin/dmm-collaborative-research-agenda.pdf

brief timeline

My background in software, mathematics, and science goes back to Soviet Union, to machine code, Algol-60, Fortran-4,
and to punched cards; to Pushchino, the Biological Center of the Soviet Academy of Sciences, and to
the Mathematical class of Moscow High School number 7.

I started to focus on continuous models of computations in college, then emigrated to USA, worked as
a scientific programmer for Alex Rashin at Biosym Technologies doing computational geometry and computational chemistry
(I was the second author on several papers in The Journal of Physical Chemistry and Biophysical Chemistry
from that period), then did a PhD in Computer Science at Brandeis University focusing of mathematics
of continuous domains for denotational semantics (this is a copy of my 2002 PhD thesis: https://arxiv.org/abs/1512.03868).

In parallel, I worked in various places in the software industry. There I had a chance to first touch
dataflow programming, Common Lisp, and actor model of programming.

This century I have been working at a geographic software company (ownership of it went through acquisitions, spin-offs,
and such, so one very long employment looks like several shorter ones from a formal viewpoint),
while doing research in parallel. My research focus was mostly on theoretical neuroscience for a while,
then a research collaboration on deep connections between partial metrics and fuzzy equalities,
and finally (from approximately Fall of 2012) a research collaboration
on deep connections between partial contradictions and vector semantics of programming languages
and, from 2014-2015 on, a series of research collaborations on neuromophic computations with linear streams.

Starting from about 2011 I was gradually moving from just being a lover of computer animation and electronic music to
my first attempts to make some visual, audio, and audio-visual art of my own, and I am continuing to make new computer art every few months or so.
It involved playing a bit with MilkDrop 2 for WinAmp, mixing music a bit with Serato DJ,
doing a lot of animations and a bit of sound work in Processing, doing a tiny bit of that in Clojure,
and finally working a bit with shader-based GLSL animations.


Linear streams are streams for which linear combinations of several streams are defined. If one makes sure that
linear computations and general (often non-linear) computations are interleaved, then one gets continuously deformable programs which
we call Dataflow matrix machines (DMMs). Another way to obtain DMMs is to start with recurrent neural networks
and replace streams of numbers with linear streams and allow complicated "activation functions"
(that is, transformations of linear streams) with arbitrary arity.

This setup also allows these neural machines to have very natural and flexible self-modification facilities.
There are toy implementations in Processing with mutable matrices, and the reference implementation in Clojure with
immutable streams of tree-shaped "flexible-rank tensors". The reference paper on DMMs is https://arxiv.org/abs/1712.07447

I hope to create the next application of this formalism in Julia
(both Julia Flux and JAX are the machine learning frameworks which finally have sufficient flexibility
we need to take full advantage of the flexibility of DMMs). I started to switch to Julia in the early 2020.
I recently sketched a three-page note outlining my hopes in this sense: https://www.cs.brandeis.edu/~bukatin/towards-practical-dmms.pdf


In June 2022 I was able to perform the first successful experiments in DMM training and in program synthesis/circuit synthesis/DMM synthesis
via neural architecture search using Zygote.jl. The synthesized DMMs had pretty
impressive generalization properties. In September 2022 I was able to open source those experiments:


Those experiments are the subject of the proposed talk.

I am looking for collaborators. Creating an issue at one of my GitHub repositories is the easiest way to contact me.