On the implementation of C-sets

By Owen Lynch on September 7, 2020

The idea of a "\(\mathsf{C}\)-set" was introduced in an earlier post. To recap, a \(\mathsf{C}\)-set is a functor from a small category \(\mathsf{C}\) to \(\mathsf{Set}\). When implementing this idea on a computer, we restrict it to a functor from a finitely presented category \(\mathsf{C}\) to \(\mathsf{FinSet}\). The object of this blog post is to explain how we implement such \(\mathsf{C}\)-sets as a data structure in the Julia programming language.

To describe a functor from a finitely presented category, it suffices to give the value of the functor on the objects of the category and on the generating morphisms. Knowing this data, we can derive the value of the functor on all other morphisms, because all other morphisms are finite compositions of the generating morphisms. Notice that this is a non-unique way of serializing the data of a C-set; we could be working with a presentation of \(\mathsf{C}\) with redundant generators, for instance.

In Catlab, an object of \(\mathsf{FinSet}\) is just a wrapper around a natural number, which we will write in mathematical notation as \([n]\). A map from \([n]\) to \([m]\) is a length-\(n\) array of integers ranging from \(1\) to \(m\). Therefore, in order to represent a functor from the category defined bytes

@present TheoryGraph(FreeSchema) begin
  V::Ob
  E::Ob
  src::Hom(E,V)
  tgt::Hom(E,V)
end

we could use a data structure that looks like this:

struct Graph
  nv::Int
  ne::Int
  src::Vector{Int}
  tgt::Vector{Int}
end

The vertices of g::Graph are 1:g.nv and the edges are 1:g.ne. If j is in 1:g.ne, then g.src[j] is the source of j, and g.tgt[j] is the target. This is actually the way that a common data structure for graphs stores the data of a graph (an edge list).

However, if that was all there was to C-sets, there would not be much reason to use them. Sure, from an aesthetic standpoint it is elegant to unify the implementation of many different data structures, but handwriting a data structure like Graph is not so hard. Writing functions to add and modify vertices and edges is also not so hard. The advantages of implementing C-sets generically start to really shine when we talk upgrade the edge list implementation to an adjacency list implementation, which adds an index that caches the inverses of src and tgt. This data structure looks like

struct Graph
  nv::Int
  ne::Int
  src::Vector{Int}
  tgt::Vector{Int}
  src_inv::Vector{Vector{Int}}
  tgt_inv::Vector{Vector{Int}}
end

If i is between 1 and g.nv (where g::Graph), then g.src_inv[i] is a list of all of the edges that have i as their source, and g.tgt_inv[i] is a list of all the edges with i as their target. Now, whenever you add or modify a vertex or edge, you have to update the indexes as well, which can be somewhat tricky to do correctly. However, with the generic C-set implementation in Catlab.jl, the code to add and modify each data structure is automatically generated, and the only thing that an implementor of a new C-set-based structure has to do is give memorable names to the automatically generated functions (here as in the rest of CS, the one thing that cannot be automated is naming!).

To sum up, while from a mathematical perspective the purpose of C-sets is that they generalize many interesting mathematical objects, from a scientific computing perspective the purpose of C-sets is that tedious and error prone code that keeps track of indexes can be automatically generated.

There's an old computer science joke which says that there are two problems in computer science:

  1. Cache invalidation

  2. Naming

  3. Off-by-one errors

CSets solves the first of the two.

The core data structure

The question we will answer in this section is how do automatically derive a data structure like Graph from a presentation like TheoryGraph.[1]

Our approach is to look at the structure like a database (the viewpoint in Databases are categories). Each object in the category corresponds to a table, and each morphism out of that object corresponds to a column in that table. Therefore, for a graph, we have two tables. The table for vertices has no columns, and the table for edges has two columns. To store the tables, we turn to the StructArrays.jl package.[2] For instance, the table for edges is stored using the type

StructVector{NamedTuple{(:src,:tgt),Tuple{Int,Int}}}

This type of data structure should be familiar to most data scientists. In Python, it is what is provided by pandas, and in R it is called a data frame. Logically, it is a list of records, however it is actually implemented by a list of columns, each column holding one field of the record.

All of the tables for the objects are stored in a NamedTuple with keys of the names of the objects. So, the entire type that stores that data of a graph looks like

NamedTuple{(:V,:E),Tuple{
  StructVector{NamedTuple{(),Tuple{}}},
  StructVector{NamedTuple{(:src,:tgt),Tuple{Int,Int}}}}}

The indexes we store in a more straightforward way, just a NamedTuple with a field corresponding to each indexed morphism, so for graphs it would be

NamedTuple{(:src,:tgt), Tuple{Vector{Vector{Int}},Vector{Vector{Int}}}}

I suppose that we could store the indexes in StructArrays grouped by codomain, just like the data is stored in StructArrays grouped by domain, but we haven't done that yet, for no particular reason.

To actually implement this, we use the following data type (note: this does not appear in Catlab because in Catlab we also deal with attributes).

struct CSet{CD <: CatDesc, Indexed, TablesType <: NamedTuple, IndexType <: NamedTuple}
  tables::TablesType
  index::IndexType
  function CSet{CD, Indexed}()
    tables = make_empty_tables(CD)
    index = make_empty_index(CD,Indexed)
    CSet{CD,Indexed,typeof(tables),typeof(index)}(tables,index)
  end
end

The idea is that TablesType and IndexType are uniquely specified by CD and Indexed, however it is not possible to do sophisticated enough type wrangling to write this into the struct, so instead the types are generated whenever we construct a new CSet. The types of TablesType and IndexType are in the type of a concrete CSet so that when the compiler encounters a function that handles CSets, it has all the type information it needs to compile the function to something very fast.

But what is CatDesc? In order to parameterize CSet by a (presentation of a) category, we need a parameter that represents a (presentation of a) category. One choice for this would be a struct that looked like this:

struct CatDesc1
  ob::Vector{Symbol}
  hom::Vector{Symbol}
  dom::Vector{Int}
  codom::Vector{Int}
end

In this structure, the domain and codomain of hom[i] are ob[dom[i]] and ob[codom[i]], respectively. However, in Julia not all values can parameterize types. For reasons I do not quite understand, a tuple of ints or a tuple of symbols can parameterize a type, but a tuple of tuples cannot, and anything with vectors is Right Out (this at least makes sense because vectors are mutable). One approach would be to have a separate parameter in CSet for tuples of ob, hom, dom, and codom, and indeed the first iteration of C-sets did exactly that. But with some cleverness, we can fit that into single parameter.

struct CatDesc{Ob,Hom,Dom,Codom}
end

We do this because Julia allows types to be parameterized by other types, so while Julia rejects

const Graph = CSet{CatDesc1([:V,:E],[:src,:tgt],[2,2],[1,1]),(:src,:tgt)}

it accepts

const Graph = CSet{CatDesc{(:V,:E),(:src,:tgt),(2,2),(1,1)},(:src,:tgt)}

We can even recover the behavior of the first example by overloading getproperty:

function Base.getproperty(T::Type{CatDesc{Ob,Hom,Dom,Codom}},key::Symbol)
  where {Ob,Hom,Dom,Codom}
  @match key begin
    :ob => Ob
    :hom => Hom
    :dom => Dom
    :codom => Codom
    _ => getfield(T,key)
  end
end

Then in functions that take in a CSet, we can inspect the category \(\mathsf{C}\) using those properties, as in the following example

function print_tables(cs::CSet{CD}) where {CD}
  for ob in CD.ob
    println("$ob = ")
    println(cs.tables[ob])
  end
end

Now that you have the definition of CatDesc, we leave it as an exercise to the reader to implement make_empty_tables and make_empty_index. Or you could look at the source code for Catlab!

[1] In Catlab, C-sets come in two flavors: those with attributes, and those without attributes. Instead of categories, attributed C-sets are based off of things called "schemas". In later posts, we will discuss attributes, so this is not relevant now; we mention it so that the reader will not be caught off-guard while reading the Catlab source code.
[2] One problem with StructVectors is that they do not support structs with no fields. So, for instance, the table for vertices, StructVector{NamedTuple{(),Tuple{}}} does not work. From the point of view of StructArrays.jl, this is perfectly reasonable: there was no reason for them to expect that zero-sized tuples would be a desirable thing to have. In any case, we get around this in Catlab with a bit of a hack, but for the purposes of this blog post we will assume that StructVectorss support empty tuples.

Operations on CSets

In this section, we will give an example of how we write generic code to operate on C-sets.

The basic operation of modifying a CSet is adding data to it. To be more precise, let \(A\) be an object of \(\mathsf{C}\), and let \(F : C \to \mathsf{FinSet}\). We call an element of \(F(A)\) a part. In order to add a part to \(F(A)\), we must also modify the definition of \(F(g)\) for all maps \(g\) with domain \(A\). In other words, we must add an entire row to the table in slot \(A\) of \(F\). If \(x \in F(A)\), we call \(F(g)(x)\) a subpart of \(x\); in the data structure, the subparts are the elements in the rows of the tables. With all this in mind, the function to add a part has the following signature (or at least, this is one of the signatures, more signatures are available in the actual implementation).

function add_part!(cs::CSet, type::Symbol, subparts::NamedTuple)

To call this on a graph, we might invoke it like this:

add_edge!(g::Graph,s::Int,t::Int) = add_part!(g, :E, (src=s,tgt=t))
add_vertex!(g::Graph) = add_part!(g, :V, (;))

Now, we want to make this as efficient as handwritten functions. Therefore, before we talk about the implementation of add_part!, let's look at how we might handwrite add_edge! and add_vertex!.

function add_edge!(g::Graph,s::Int,t::Int)
  # make sure that the source and target exist
  @assert 1 <= s <= length(g.tables.V) && 1 <= t <= length(g.tables.V)
  push!(g.tables.E, (src=s,tgt=t))
  part_num = length(g.tables.E)
  # The indices are kept sorted for easy access
  insertsorted!(g.index.src[s],part_num)
  insertsorted!(g.index.tgt[t],part_num)
  return part_num
end

function add_vertex!(g::Graph)
  push!(g.tables.V, (;))
  part_num = length(g.tables.V)
  push!(g.index.src,[])
  push!(g.index.tgt,[])
  return part_num
end

That wasn't too bad, but there are some subtleties that could easily trip you up. For instance, in add_vertex! you have to add new empty arrays to the index, so that when you add edges, the edges can be added to those arrays. Also, we want to additionally have functions that add many edges at once, or many vertices, and those come with more complexity. And in real library code (LightGraph.jl) there are even more things that must be dealt with. There are clear benefits to writing this once, generically.

On the other hand, it's unclear how we could write generic code that approaches the performance of this handwritten code. We would need to loop over the morphisms, and execute conditionals to check whether morphisms were indexed. While we might get the same asymptotic complexity, this would be a significant performance cost.

Luckily, Julia provides us a mechanism to have the best of both worlds: the performance of the hand-written function written in a generic way. To do this, we use an @generated function to automatically generate different code for each type that a function could be called with. This is similar to a macro, but while a macro only has access to the syntax of its arguments, a @generated function has access to the type of its arguments. This is best illustrated by an example, but also if you have not encountered macros in Julia, I encourage you to read up a little on them, because many of the techniques carry over to generated functions.

@generated function unrolled_addto!(x::StaticVector{T,n},y::StaticVector{T,n}) where {T,n}
  code = Expr(:block)
  for i in 1:n
    push!(code.args, :(x[$i] += y[$i]))
  end
  code
end

StaticVector{T,n} is a vector with element type T and length n. If x::StaticVector{Int,3} and y::StaticVector{Int,3}, then calling unrolled_addto!(x,y) is equivalent to running

x[1] += y[1]
x[2] += y[2]
x[3] += y[3]

To summarize, an @generated function generates code at compile-time (or in Julia terms, type-dispatch time) based on the types of its arguments, and then that code is run at run-time. The code is then cached, so the next time the function is called with those types, it just runs the compiled code immediately.

Now we will define add_part!. First of all, there is a trick that we use to get access to the object that we are adding a part to:

add_part!(cs::CSet,ob::Symbol,subparts::NamedTuple) = _add_part!(cs,Val(ob),subparts)

@generated function _add_part!(cs::CSet{CD,Indexed},::Val{ob},subparts::NamedTuple{columns}) where
  {CD, Indexed, ob, columns}
  ...
end

Val is a useful little type with definition

struct Val{x}
  Val(x) = new{x}()
end

_add_part! itself has 5 parts.

  1. We check that the subparts exist.

  2. We add the new part with its subparts to the right table.

  3. We add new empty indices for all of the indexed morphisms with codomain of the type of the part.

  4. We add the new part to the indices for all of the indexed morphisms with domain of the type of the part.

  5. We return the number of the new part.

Without further ado, we implement _add_part!. We use some functions which do not exist, their definition is left to the reader.

@generated function _add_part!(cs::CSet{CD,Indexed},::Val{ob},subparts::NamedTuple{columns}) where
  {CD, Indexed, ob, columns}
  # Make sure that we have all the subparts
  @assert get_outgoing_morphisms(CD,ob) == columns
  code = Expr(:block)
  # Part 1
  for hom in columns
    push!(code.args, :(@assert 1 <= subparts.$hom <= length(cs.tables.$(codom(CD,hom)))))
  end
  # Part 2
  push!(code.args, :(push!(cs.tables.$ob, subparts)))
  push!(code.args, :(part_num = length(cs.tables.$ob)))
  # Part 3
  for hom in CD.hom
    if codom(CD,hom) == ob && hom ∈ Indexed
      push!(code.args, :(push!(cs.index.$hom,[])))
    end
  end
  # Part 4
  for hom in columns
    if hom ∈ Indexed
      push!(code.args, :(insertsorted!(cs.index.$hom,part_num)))
    end
  end
  # Part 5
  push!(code.args, :(return part_num))
  code
end

This generates code very similar to the handwritten code that we wrote above, automatically. We use this strategry to implement almost all of the generic operations on CSets, and this is what makes CSets competitive with handwritten data structures.

Wrapup

We have illustrated how the theory of C-sets can be implemented in a concrete data structure, and then manipulated both generically and efficiently with @generated functions. We hope that the reader has gained an appreciation for the advantages of the C-set data structure, and also more generally for how @generated code brings new tools to the table. In future posts, we will talk about how we add attributes to C-sets, and what that means both categorically and technically. To give a teaser, attributes allow us to talk about decorated graphs, where each edge has, for instance, an associated real number. So stay tuned!