The idea of a "\(\mathsf{C}\)set" was introduced in an earlier post. To recap, a \(\mathsf{C}\)set is a functor from a small category \(\mathsf{C}\) to \(\mathsf{Set}\). When implementing this idea on a computer, we restrict it to a functor from a finitely presented category \(\mathsf{C}\) to \(\mathsf{FinSet}\). The object of this blog post is to explain how we implement such \(\mathsf{C}\)sets as a data structure in the Julia programming language.
To describe a functor from a finitely presented category, it suffices to give the value of the functor on the objects of the category and on the generating morphisms. Knowing this data, we can derive the value of the functor on all other morphisms, because all other morphisms are finite compositions of the generating morphisms. Notice that this is a nonunique way of serializing the data of a Cset; we could be working with a presentation of \(\mathsf{C}\) with redundant generators, for instance.
In Catlab, an object of \(\mathsf{FinSet}\) is just a wrapper around a natural number, which we will write in mathematical notation as \([n]\). A map from \([n]\) to \([m]\) is a length\(n\) array of integers ranging from \(1\) to \(m\). Therefore, in order to represent a functor from the category defined bytes
@present TheoryGraph(FreeSchema) begin
V::Ob
E::Ob
src::Hom(E,V)
tgt::Hom(E,V)
end
we could use a data structure that looks like this:
struct Graph
nv::Int
ne::Int
src::Vector{Int}
tgt::Vector{Int}
end
The vertices of g::Graph
are 1:g.nv
and the edges are 1:g.ne
. If j
is in 1:g.ne
, then g.src[j]
is the source of j
, and g.tgt[j]
is the target. This is actually the way that a common data structure for graphs stores the data of a graph (an edge list).
However, if that was all there was to Csets, there would not be much reason to use them. Sure, from an aesthetic standpoint it is elegant to unify the implementation of many different data structures, but handwriting a data structure like Graph
is not so hard. Writing functions to add and modify vertices and edges is also not so hard. The advantages of implementing Csets generically start to really shine when we talk upgrade the edge list implementation to an adjacency list implementation, which adds an index that caches the inverses of src
and tgt
. This data structure looks like
struct Graph
nv::Int
ne::Int
src::Vector{Int}
tgt::Vector{Int}
src_inv::Vector{Vector{Int}}
tgt_inv::Vector{Vector{Int}}
end
If i
is between 1 and g.nv
(where g::Graph
), then g.src_inv[i]
is a list of all of the edges that have i
as their source, and g.tgt_inv[i]
is a list of all the edges with i
as their target. Now, whenever you add or modify a vertex or edge, you have to update the indexes as well, which can be somewhat tricky to do correctly. However, with the generic Cset implementation in Catlab.jl
, the code to add and modify each data structure is automatically generated, and the only thing that an implementor of a new Csetbased structure has to do is give memorable names to the automatically generated functions (here as in the rest of CS, the one thing that cannot be automated is naming!).
To sum up, while from a mathematical perspective the purpose of Csets is that they generalize many interesting mathematical objects, from a scientific computing perspective the purpose of Csets is that tedious and error prone code that keeps track of indexes can be automatically generated.
There's an old computer science joke which says that there are two problems in computer science:
Cache invalidation
Naming
Offbyone errors
CSets solves the first of the two.
The question we will answer in this section is how do automatically derive a data structure like Graph
from a presentation like TheoryGraph
.^{[1]}
Our approach is to look at the structure like a database (the viewpoint in Databases are categories). Each object in the category corresponds to a table, and each morphism out of that object corresponds to a column in that table. Therefore, for a graph, we have two tables. The table for vertices has no columns, and the table for edges has two columns. To store the tables, we turn to the StructArrays.jl package.^{[2]} For instance, the table for edges is stored using the type
StructVector{NamedTuple{(:src,:tgt),Tuple{Int,Int}}}
This type of data structure should be familiar to most data scientists. In Python, it is what is provided by pandas, and in R it is called a data frame. Logically, it is a list of records, however it is actually implemented by a list of columns, each column holding one field of the record.
All of the tables for the objects are stored in a NamedTuple with keys of the names of the objects. So, the entire type that stores that data of a graph looks like
NamedTuple{(:V,:E),Tuple{
StructVector{NamedTuple{(),Tuple{}}},
StructVector{NamedTuple{(:src,:tgt),Tuple{Int,Int}}}}}
The indexes we store in a more straightforward way, just a NamedTuple with a field corresponding to each indexed morphism, so for graphs it would be
NamedTuple{(:src,:tgt), Tuple{Vector{Vector{Int}},Vector{Vector{Int}}}}
I suppose that we could store the indexes in StructArrays grouped by codomain, just like the data is stored in StructArrays grouped by domain, but we haven't done that yet, for no particular reason.
To actually implement this, we use the following data type (note: this does not appear in Catlab because in Catlab we also deal with attributes).
struct CSet{CD <: CatDesc, Indexed, TablesType <: NamedTuple, IndexType <: NamedTuple}
tables::TablesType
index::IndexType
function CSet{CD, Indexed}()
tables = make_empty_tables(CD)
index = make_empty_index(CD,Indexed)
CSet{CD,Indexed,typeof(tables),typeof(index)}(tables,index)
end
end
The idea is that TablesType
and IndexType
are uniquely specified by CD
and Indexed
, however it is not possible to do sophisticated enough type wrangling to write this into the struct, so instead the types are generated whenever we construct a new CSet. The types of TablesType and IndexType are in the type of a concrete CSet so that when the compiler encounters a function that handles CSets, it has all the type information it needs to compile the function to something very fast.
But what is CatDesc
? In order to parameterize CSet
by a (presentation of a) category, we need a parameter that represents a (presentation of a) category. One choice for this would be a struct that looked like this:
struct CatDesc1
ob::Vector{Symbol}
hom::Vector{Symbol}
dom::Vector{Int}
codom::Vector{Int}
end
In this structure, the domain and codomain of hom[i]
are ob[dom[i]]
and ob[codom[i]]
, respectively. However, in Julia not all values can parameterize types. For reasons I do not quite understand, a tuple of ints or a tuple of symbols can parameterize a type, but a tuple of tuples cannot, and anything with vectors is Right Out (this at least makes sense because vectors are mutable). One approach would be to have a separate parameter in CSet
for tuples of ob
, hom
, dom
, and codom
, and indeed the first iteration of Csets did exactly that. But with some cleverness, we can fit that into single parameter.
struct CatDesc{Ob,Hom,Dom,Codom}
end
We do this because Julia allows types to be parameterized by other types, so while Julia rejects
const Graph = CSet{CatDesc1([:V,:E],[:src,:tgt],[2,2],[1,1]),(:src,:tgt)}
it accepts
const Graph = CSet{CatDesc{(:V,:E),(:src,:tgt),(2,2),(1,1)},(:src,:tgt)}
We can even recover the behavior of the first example by overloading getproperty
:
function Base.getproperty(T::Type{CatDesc{Ob,Hom,Dom,Codom}},key::Symbol)
where {Ob,Hom,Dom,Codom}
@match key begin
:ob => Ob
:hom => Hom
:dom => Dom
:codom => Codom
_ => getfield(T,key)
end
end
Then in functions that take in a CSet
, we can inspect the category \(\mathsf{C}\) using those properties, as in the following example
function print_tables(cs::CSet{CD}) where {CD}
for ob in CD.ob
println("$ob = ")
println(cs.tables[ob])
end
end
Now that you have the definition of CatDesc
, we leave it as an exercise to the reader to implement make_empty_tables
and make_empty_index
. Or you could look at the source code for Catlab!
[1]  In Catlab, Csets come in two flavors: those with attributes, and those without attributes. Instead of categories, attributed Csets are based off of things called "schemas". In later posts, we will discuss attributes, so this is not relevant now; we mention it so that the reader will not be caught offguard while reading the Catlab source code. 
[2]  One problem with StructVector s is that they do not support structs with no fields. So, for instance, the table for vertices, StructVector{NamedTuple{(),Tuple{}}} does not work. From the point of view of StructArrays.jl , this is perfectly reasonable: there was no reason for them to expect that zerosized tuples would be a desirable thing to have. In any case, we get around this in Catlab with a bit of a hack, but for the purposes of this blog post we will assume that StructVectors s support empty tuples.

CSet
sIn this section, we will give an example of how we write generic code to operate on Csets.
The basic operation of modifying a CSet
is adding data to it. To be more precise, let \(A\) be an object of \(\mathsf{C}\), and let \(F : C \to \mathsf{FinSet}\). We call an element of \(F(A)\) a part. In order to add a part to \(F(A)\), we must also modify the definition of \(F(g)\) for all maps \(g\) with domain \(A\). In other words, we must add an entire row to the table in slot \(A\) of \(F\). If \(x \in F(A)\), we call \(F(g)(x)\) a subpart of \(x\); in the data structure, the subparts are the elements in the rows of the tables. With all this in mind, the function to add a part has the following signature (or at least, this is one of the signatures, more signatures are available in the actual implementation).
function add_part!(cs::CSet, type::Symbol, subparts::NamedTuple)
To call this on a graph, we might invoke it like this:
add_edge!(g::Graph,s::Int,t::Int) = add_part!(g, :E, (src=s,tgt=t))
add_vertex!(g::Graph) = add_part!(g, :V, (;))
Now, we want to make this as efficient as handwritten functions. Therefore, before we talk about the implementation of add_part!
, let's look at how we might handwrite add_edge!
and add_vertex!
.
function add_edge!(g::Graph,s::Int,t::Int)
# make sure that the source and target exist
@assert 1 <= s <= length(g.tables.V) && 1 <= t <= length(g.tables.V)
push!(g.tables.E, (src=s,tgt=t))
part_num = length(g.tables.E)
# The indices are kept sorted for easy access
insertsorted!(g.index.src[s],part_num)
insertsorted!(g.index.tgt[t],part_num)
return part_num
end
function add_vertex!(g::Graph)
push!(g.tables.V, (;))
part_num = length(g.tables.V)
push!(g.index.src,[])
push!(g.index.tgt,[])
return part_num
end
That wasn't too bad, but there are some subtleties that could easily trip you up. For instance, in add_vertex!
you have to add new empty arrays to the index, so that when you add edges, the edges can be added to those arrays. Also, we want to additionally have functions that add many edges at once, or many vertices, and those come with more complexity. And in real library code (LightGraph.jl) there are even more things that must be dealt with. There are clear benefits to writing this once, generically.
On the other hand, it's unclear how we could write generic code that approaches the performance of this handwritten code. We would need to loop over the morphisms, and execute conditionals to check whether morphisms were indexed. While we might get the same asymptotic complexity, this would be a significant performance cost.
Luckily, Julia provides us a mechanism to have the best of both worlds: the performance of the handwritten function written in a generic way. To do this, we use an @generated
function to automatically generate different code for each type that a function could be called with. This is similar to a macro, but while a macro only has access to the syntax of its arguments, a @generated
function has access to the type of its arguments. This is best illustrated by an example, but also if you have not encountered macros in Julia, I encourage you to read up a little on them, because many of the techniques carry over to generated functions.
@generated function unrolled_addto!(x::StaticVector{T,n},y::StaticVector{T,n}) where {T,n}
code = Expr(:block)
for i in 1:n
push!(code.args, :(x[$i] += y[$i]))
end
code
end
StaticVector{T,n}
is a vector with element type T
and length n
. If x::StaticVector{Int,3}
and y::StaticVector{Int,3}
, then calling unrolled_addto!(x,y)
is equivalent to running
x[1] += y[1]
x[2] += y[2]
x[3] += y[3]
To summarize, an @generated
function generates code at compiletime (or in Julia terms, typedispatch time) based on the types of its arguments, and then that code is run at runtime. The code is then cached, so the next time the function is called with those types, it just runs the compiled code immediately.
Now we will define add_part!
. First of all, there is a trick that we use to get access to the object that we are adding a part to:
add_part!(cs::CSet,ob::Symbol,subparts::NamedTuple) = _add_part!(cs,Val(ob),subparts)
@generated function _add_part!(cs::CSet{CD,Indexed},::Val{ob},subparts::NamedTuple{columns}) where
{CD, Indexed, ob, columns}
...
end
Val is a useful little type with definition
struct Val{x}
Val(x) = new{x}()
end
_add_part!
itself has 5 parts.
We check that the subparts exist.
We add the new part with its subparts to the right table.
We add new empty indices for all of the indexed morphisms with codomain of the type of the part.
We add the new part to the indices for all of the indexed morphisms with domain of the type of the part.
We return the number of the new part.
Without further ado, we implement _add_part!
. We use some functions which do not exist, their definition is left to the reader.
@generated function _add_part!(cs::CSet{CD,Indexed},::Val{ob},subparts::NamedTuple{columns}) where
{CD, Indexed, ob, columns}
# Make sure that we have all the subparts
@assert get_outgoing_morphisms(CD,ob) == columns
code = Expr(:block)
# Part 1
for hom in columns
push!(code.args, :(@assert 1 <= subparts.$hom <= length(cs.tables.$(codom(CD,hom)))))
end
# Part 2
push!(code.args, :(push!(cs.tables.$ob, subparts)))
push!(code.args, :(part_num = length(cs.tables.$ob)))
# Part 3
for hom in CD.hom
if codom(CD,hom) == ob && hom ∈ Indexed
push!(code.args, :(push!(cs.index.$hom,[])))
end
end
# Part 4
for hom in columns
if hom ∈ Indexed
push!(code.args, :(insertsorted!(cs.index.$hom,part_num)))
end
end
# Part 5
push!(code.args, :(return part_num))
code
end
This generates code very similar to the handwritten code that we wrote above, automatically. We use this strategry to implement almost all of the generic operations on CSet
s, and this is what makes CSet
s competitive with handwritten data structures.
We have illustrated how the theory of Csets can be implemented in a concrete data structure, and then manipulated both generically and efficiently with @generated
functions. We hope that the reader has gained an appreciation for the advantages of the Cset data structure, and also more generally for how @generated
code brings new tools to the table. In future posts, we will talk about how we add attributes to Csets, and what that means both categorically and technically. To give a teaser, attributes allow us to talk about decorated graphs, where each edge has, for instance, an associated real number. So stay tuned!