So far on the blog we have motivated \(\mathsf{C}\)sets mainly as a unifying abstraction that encompasses graphs, wiring diagrams, Petri nets, and other graphlike objects. But what makes attributed \(\mathsf{C}\)sets so useful is that they are a joint generalization of two essential data structures in computer science: graphs, as we have seen, but also data frames, a mainstay in any modern environment for data analysis.
A data frame represents a table of data as a set of named columns, where different columns may have different data types. Here is an excerpt from a data frame with four columns, provided by the Sitka
dataset in the R package MASS.
size  time  tree  treat  

Float64  Int64  Int64  Cat…  
1  4.51  152  1  ozone 
2  4.98  174  1  ozone 
3  5.41  201  1  ozone 
4  5.9  227  1  ozone 
5  6.15  258  1  ozone 
6  4.24  152  2  ozone 
7  4.2  174  2  ozone 
8  4.68  201  2  ozone 
9  4.92  227  2  ozone 
10  4.96  258  2  ozone 
⋮  ⋮  ⋮  ⋮  ⋮ 
Mathematically, a data frame is a multispan of sets whose apex is a finite set. A multispan, in turn, is a span that may have a number of legs different than two. The Sitka
data frame is a multispan with four legs:
The apex is the set \([n] := \{1,\dots,n\}\) tabulating the \(n = 395\) rows of the data frame. The first leg of the multispan, the function \(\mathrm{size}: [n] \to \mathbb{R}\), encodes the first column of the table, a vector in \(\mathbb{R}^n\).
Although data frames are useful in isolation, many datasets have a relational structure that is better expressed by a collection of interlinked tables than by a single table. Arguably that is true even of this simple example. The Sitka
dataset is longitudinal data, consisting of repeated measurements over time of the sizes of Sitka spruce trees, some of which were grown in ozoneenriched chambers (ozone
) and others not (control
). Thus, the data has a twolevel structure: there is a set of trees and, for each tree, a set of measurements. The two types of units, trees and measurements, should be recorded in separate tables, where each measurement is linked to the tree that it measures. Graphically, the schema \(\mathsf{C}\) is:
Let's use Catlab to set up the corresponding attributed \(\mathsf{C}\)set. The complete schema, including data attributes, is:
@present SitkaSchema(FreeSchema) begin
(Numerical, Time, Treatment)::Data
Tree::Ob
treat::Attr(Tree, Treatment)
Measurement::Ob
tree::Hom(Measurement, Tree)
size::Attr(Measurement, Numerical)
time::Attr(Measurement, Time)
end
const SitkaData = ACSetType(SitkaSchema, index=[:tree])
Notice that the assigned treatment (treat
) is properly an attribute of the trees, not of the measurements as the original data frame would suggest. Having defined the schema and corresponding Julia type, we load the data from the data frame (sitka_df
) into the attributed \(\mathsf{C}\)set (sitka
):
sitka = SitkaData{Float64,Int,String}()
# Add trees and measurements, including data attributes of the latter.
trees = unique!(sort(sitka_df.tree))
@assert trees == 1:last(trees)
add_parts!(sitka, :Tree, length(trees))
add_parts!(sitka, :Measurement, nrow(sitka_df),
tree=sitka_df.tree, size=sitka_df.size, time=sitka_df.time)
# Assign treatment to each tree, checking for data consistency.
for tree in trees
measurements = incident(sitka, tree, :tree)
sitka[tree, :treat] = only(unique(sitka_df.treat[measurements]))
end
map(length, tables(sitka))
(Tree = 79, Measurement = 395)
The resulting tables, excerpts of which are shown below, comprise 395 measurements spread across 79 trees.
pretty_tables(sitka, crop=:both, display_size=(15,80))
┌──────┬───────┐
│ Tree │ treat │
├──────┼───────┤
│ 1 │ ozone │
│ 2 │ ozone │
│ 3 │ ozone │
│ 4 │ ozone │
│ 5 │ ozone │
│ 6 │ ozone │
│ 7 │ ozone │
│ ⋮ │ ⋮ │
└──────┴───────┘
72 rows omitted
┌─────────────┬──────┬──────┬──────┐
│ Measurement │ tree │ size │ time │
├─────────────┼──────┼──────┼──────┤
│ 1 │ 1 │ 4.51 │ 152 │
│ 2 │ 1 │ 4.98 │ 174 │
│ 3 │ 1 │ 5.41 │ 201 │
│ 4 │ 1 │ 5.9 │ 227 │
│ 5 │ 1 │ 6.15 │ 258 │
│ 6 │ 2 │ 4.24 │ 152 │
│ 7 │ 2 │ 4.2 │ 174 │
│ ⋮ │ ⋮ │ ⋮ │ ⋮ │
└─────────────┴──────┴──────┴──────┘
388 rows omitted
The data munging was simple enough, with one qualification. In order to consistently set the treat
attribute of the trees, we had to verify that each tree is actually assigned a unique treatment value in the original table. Nothing about the structure of the table guarantees that. This subtlety illustrates a key strength of the relational data model: even in the absence of equational axioms, the schema can enforce logical constraints that a single table cannot, eliminating redundancy and reducing the risk of transcription errors.
Of course, flattened views of relational data are often needed for particular data analyses. These views are easily obtained by making queries against the \(\mathsf{C}\)set. For example, the following query recovers the original data frame.
select_all = @relation (size=size, time=time, tree=t, treat=treat) begin
Tree(_id=t, treat=treat)
Measurement(tree=t, size=size, time=time)
end
query(sitka, select_all, table_type=DataFrame) == sitka_df
true
The query joins the Tree
and Measurement
tables along their respective trees, carrying along the data attributes of both tables.
Readers familiar with relational databases will recognize the foregoing ideas. Relational data is ubiquitous, and relational databases, in the form of SQL databases like MySQL and SQLite, constitute a large industry. The mathematics of relational databases as \(\mathsf{C}\)sets or similar categorical structures is also now well established.^{[1]} Despite this, relational databases play only a minor role in the data analysis environments offered by languages like Python, R, and Julia. A typical analysis might begin with a query against a SQL database, returning a single large table, but will then proceed by ad hoc manipulation of data frames. If a query involving multiple tables (or multiple copies of the same table) is needed, it is carried out manually through a sequence of joins. With relational databases, by contrast, queries are specified declaratively and the database system does the work of translating the query into an efficient algorithm.
Important though they are, data frames are the lowest common denominator of data science tooling. We can see no reason for this state of affairs besides tradition and the tendency of SQL databases to be siloed from other systems. We hope that \(\mathsf{C}\)sets will be a useful tool for data analysis, serving as an inmemory relational database that interoperates smoothly with the JuliaData ecoystem. Indeed, the \(\mathsf{C}\)sets in Catlab do not compete with this ecosystem but rather depend on it: the tables of a \(\mathsf{C}\)set are provided by TypedTables.jl and may be exchanged for any other columnar table compliant with Tables.jl.
In the remainder of this post, we explain how conjunctive queries work and illustrate them on a dataset with more complex relational structure than the Sitka data. We then explain the categorical underpinnings of conjunctive queries, particularly the close connection between joins in relational databases and finite limits in the category of sets.
[1]  See the line of research by David Spivak and collaborators (Spivak 2012, Spivak 2014, Schultz et al 2016) and also the earlier blog post by Owen Lynch on the mathematics of attributed \(\mathsf{C}\)sets. 
The mutagenesis dataset, available through the Relational Dataset Repository, is a collection of molecular graphs from a study on the mutagenicity of certain chemical compounds. The dataset has a threelevel relational structure: there is a set of molecules, each of which has a set of atoms, which may be connected pairwise by chemical bonds.
The complete schema includes data attributes for each of the three types.
@present MutagenesisSchema(FreeSchema) begin
(Indicator, Categorical, Name, Numerical)::Data
Molecule::Ob
molecule_id::Attr(Molecule, Name)
(ind1, inda)::Attr(Molecule, Indicator)
(logp, lumo)::Attr(Molecule, Numerical)
mutagenic::Attr(Molecule, Indicator)
Atom::Ob
molecule::Hom(Atom, Molecule)
atom_id::Attr(Atom, Name)
atom_type::Attr(Atom, Categorical)
element::Attr(Atom, Name)
charge::Attr(Atom, Numerical)
Bond::Ob
(atom1, atom2)::Hom(Bond, Atom)
bond_type::Attr(Bond, Categorical)
end
const MutagenesisData = ACSetType(MutagenesisSchema,
index=[:atom1, :atom2, :molecule],
unique_index=[:atom_id, :molecule_id])
The attributes molecule_id
and atom_id
do not appear to be meaningful but are carried over from the primary keys in the original SQL database.
The data is loaded into Catlab by making three SELECT * from [table];
queries on the SQL database and mapping the primary keys to integer IDs. We omit this routine code, which should in any case be automated in the future. The result is an ACSet mutagenesis
having the following dimensions.
map(length, tables(mutagenesis))
(Molecule = 188, Atom = 4893, Bond = 5243)
Here are excerpts from each of the tables:
pretty_tables(mutagenesis, crop=:both, display_size=(15,80))
┌──────────┬─────────────┬───────┬───────┬──────┬────────┬───────────┐
│ Molecule │ molecule_id │ ind1 │ inda │ logp │ lumo │ mutagenic │
├──────────┼─────────────┼───────┼───────┼──────┼────────┼───────────┤
│ 1 │ d1 │ true │ false │ 4.23 │ 1.246 │ true │
│ 2 │ d10 │ true │ false │ 4.62 │ 1.387 │ true │
│ 3 │ d100 │ false │ false │ 2.68 │ 1.034 │ false │
│ 4 │ d101 │ true │ false │ 6.26 │ 1.598 │ true │
│ 5 │ d102 │ true │ false │ 2.4 │ 3.172 │ true │
│ 6 │ d103 │ true │ false │ 4.69 │ 1.487 │ true │
│ 7 │ d104 │ true │ false │ 6.26 │ 1.546 │ true │
│ ⋮ │ ⋮ │ ⋮ │ ⋮ │ ⋮ │ ⋮ │ ⋮ │
└──────────┴─────────────┴───────┴───────┴──────┴────────┴───────────┘
181 rows omitted
┌──────┬──────────┬─────────┬───────────┬─────────┬────────┐
│ Atom │ molecule │ atom_id │ atom_type │ element │ charge │
├──────┼──────────┼─────────┼───────────┼─────────┼────────┤
│ 1 │ 3 │ d100_1 │ 22 │ c │ 0.128 │
│ 2 │ 3 │ d100_10 │ 3 │ h │ 0.132 │
│ 3 │ 3 │ d100_11 │ 29 │ c │ 0.002 │
│ 4 │ 3 │ d100_12 │ 22 │ c │ 0.128 │
│ 5 │ 3 │ d100_13 │ 22 │ c │ 0.128 │
│ 6 │ 3 │ d100_14 │ 22 │ c │ 0.128 │
│ 7 │ 3 │ d100_15 │ 22 │ c │ 0.202 │
│ ⋮ │ ⋮ │ ⋮ │ ⋮ │ ⋮ │ ⋮ │
└──────┴──────────┴─────────┴───────────┴─────────┴────────┘
4886 rows omitted
┌──────┬───────┬───────┬───────────┐
│ Bond │ atom1 │ atom2 │ bond_type │
├──────┼───────┼───────┼───────────┤
│ 1 │ 1 │ 12 │ 7 │
│ 2 │ 1 │ 24 │ 1 │
│ 3 │ 3 │ 4 │ 7 │
│ 4 │ 4 │ 5 │ 7 │
│ 5 │ 4 │ 9 │ 1 │
│ 6 │ 5 │ 6 │ 7 │
│ 7 │ 5 │ 10 │ 1 │
│ ⋮ │ ⋮ │ ⋮ │ ⋮ │
└──────┴───────┴───────┴───────────┘
5236 rows omitted
We are now going to make queries against the mutagenesis data. Generally speaking, Catlab supports evaluating conjunctive queries on attributed \(\mathsf{C}\)sets. Although more restrictive than arbitrary SQL queries, conjunctive queries play in a central role in the theory and practice of relational databases. They are precisely the queries that can be expressed in regular logic, the subsystem of firstorder logic with connectives \(\exists, \wedge, \top\). Alternatively, as we will see later, they are the queries that can be computed as finite limits in \(\mathsf{Set}\).
In Catlab, conjunctive queries are most easily specified using the @relation
macro, which has the general form:
@relation (out1=var1, out2=var2, …) [where (x,y,z,…)] begin
T₁(col1=var3, col2=var4, …)
⋮
Tₙ(col1=var5, col2=var6, …)
end
Every relation statement in the macro body specifies a table \(T_i\) and a subset of its columns, with each column assigned to a variable. To evaluate the query, you take the cartesian product of the selected tables \(T_1, \dots, T_n\), then filter the rows so that any two columns assigned to the same variable are equal. Finally, you return a table that exposes a subset of the variables, given by the output names in the first clause. The full set of variables is optionally stated in the where
clause; if it is omitted, the set of variables is inferred.
Conjunctive queries are more intuitive than this description may make them appear. Returning to the mutagenesis data, we might assume that molecular graphs are undirected, so that for every for bond between atoms \(a_1\) and \(a_2\), there is also a bond between atoms \(a_2\) and \(a_1\). We can check this assumption using the following query.
symmetric_pairs = @relation (atom1=a1, atom2=a2) begin
Bond(atom1=a1, atom2=a2)
Bond(atom1=a2, atom2=a1)
end
length(query(mutagenesis, symmetric_pairs))
0
In fact, no bond in this dataset has a corresponding bond in the opposite direction. The creators of the data presumably left the oppositely oriented bonds implicit.
A more fundamental constraint is that any two bonded atoms must belong to the same molecule.
bond_within_molecule = @relation (Bond=b) begin
Bond(_id=b, atom1=a1, atom2=a2)
Atom(_id=a1, molecule=m)
Atom(_id=a2, molecule=m)
end
result = query(mutagenesis, bond_within_molecule)
length(result) == nparts(mutagenesis, :Bond)
true
At least this assumption is satisfied by the data! The schema presented earlier has no equations, but we could have added an equation explitictly stating the assumption:
@present MutagenesisSchema(FreeSchema) begin
[...]
compose(atom1, molecule) == compose(atom2, molecule)
end
By default, the variables in a query may take any values, but variables can be fixed to specific values using the optional third argument to query
. For example, the following query returns all carbon atoms belonging to a mutagenic molecule, together with the charges of the atoms.
atoms_in_molecule = @relation (Atom=a, molecule=m, charge=charge) begin
Atom(_id=a, element=element, charge=charge, molecule=m)
Molecule(_id=m, mutagenic=mutagenic)
end
result = query(mutagenesis, atoms_in_molecule, (element="c", mutagenic=true))
Table with 3 columns and 1833 rows:
Atom molecule charge
┌───────────────────────
1 │ 27 4 0.121
2 │ 28 4 0.091
3 │ 29 4 0.009
4 │ 30 4 0.009
5 │ 31 4 0.009
6 │ 32 4 0.121
7 │ 33 4 0.121
8 │ 34 4 0.122
⋮ │ ⋮ ⋮ ⋮
The mutagenesis data is essentially a collection of graphs, one for each molecule, where the vertices are atoms and the edges are bonds. The conventional way to process such data would be to convert it into a graph, using a specialized data structure for directed graphs. In Catlab, this conversion is easily achieved by functorial data migration in its basic, contravariant form:
using Catlab.Graphs
g = Graph(mutagenesis, Dict(:V => :Atom, :E => :Bond),
Dict(:src => :atom1, :tgt => :atom2))
(V = nv(g), E = ne(g))
(V = 4893, E = 5243)
The graph's connected components correspond to the molecules. An advantage of the graph representation is that you can directly apply any graph analytics procedure that expects a graph as input. However, in contrast to other systems, in Catlab there are no further advantages to performing this conversion. Traversing the derived graph will be no faster or slower than traversing the MutagenesisData
object, because graphs in Catlab are \(\mathsf{C}\)sets isomorphic to the atombond substructure of the mutagenesis structure. This what we meant at the outset when we said that attributed \(\mathsf{C}\)sets are a joint generalization of graphs and data frames.
The unification of graphs with relational data cuts in both directions. On the one hand, since graphs in Catlab are \(\mathsf{C}\)sets, we can efficiently evaluate conjunctive queries against them. As a simple example, here are all the directed paths of length two in the derived graph:
paths2 = @relation (vert1=v1, vert2=v2, vert3=v3) begin
E(src=v1, tgt=v2)
E(src=v2, tgt=v3)
end
result = query(g, paths2)
Table with 3 columns and 5539 rows:
vert1 vert2 vert3
┌────────────────────
1 │ 1 12 20
2 │ 1 12 25
3 │ 3 4 5
4 │ 3 4 9
5 │ 4 5 6
6 │ 4 5 10
7 │ 5 6 7
8 │ 5 6 11
⋮ │ ⋮ ⋮ ⋮
On the other hand, we can formulate richer queries against the mutagenesis
data, where more data attributes are available. The following query refines the previous one to find all directed paths from a carbon atom to a nitrogen atom to another carbon atom. The resulting table is much smaller.
paths2 = @relation (atom1=a1, atom2=a2, atom3=a3) begin
Bond(atom1=a1, atom2=a2)
Bond(atom1=a2, atom2=a3)
Atom(_id=a1, element=elem1)
Atom(_id=a2, element=elem2)
Atom(_id=a3, element=elem3)
end
result = query(mutagenesis, paths2, (elem1="c", elem2="n", elem3="c"))
Table with 3 columns and 41 rows:
atom1 atom2 atom3
┌────────────────────
1 │ 163 164 165
2 │ 169 170 163
3 │ 620 624 608
4 │ 768 769 787
5 │ 786 766 767
6 │ 1105 1104 1086
7 │ 1315 1319 1320
8 │ 1316 1321 1320
⋮ │ ⋮ ⋮ ⋮
The theory of relational databases is category theory. This statement is true in many ways and from many perspectives, but we will focus on just one: the equivalence between conjunctive queries on relational databases and finite limits in the category of sets. This equivalence is almost immediate if you are familiar with the relevant concepts; it is certainly not new (Alagic & Alagic 1991). Nevertheless, it is a fruitful observation for both category theory and relational databases. In one direction, conjunctive queries inherit an intuitive diagrammatic syntax through the operad of undirected wiring diagrams. Conversely, any generalpurpose package for computational category theory, which Catlab aims to be, ought to be able to efficiently compute finite limits of sets. Algorithms for doing so have been extensively developed by database theorists and engineers.
Here is a dictionary between conjunctive queries of different kinds and finite limits in \(\mathsf{Set}\) of different shapes.
cross joins of tables \(X\) and \(Y\) are products
joins^{[2]} along attributes \(a_X: X \to A\) and \(a_Y: Y \to A\) are pullbacks
multiway joins along attributes \(a_i: X_i \to A\), for \(i=1,\dots,n\), are wide pullbacks
general conjunctive queries are limits of finite bipartite diagrams
The last statement, about conjunctive queries and bipartite diagrams, requires elaboration. We will break it down into two steps. First, conjunctive queries can be represented by undirected wiring diagrams (UWDs); in Catlab, they actually are UWDs.^{[3]} Second, the UWD translates into a bipartite diagram, whose limit is the result of the query.
As an illustration, recall this conjunctive query on the mutagenesis data.
bond_within_molecule = @relation (Bond=b) begin
Bond(_id=b, atom1=a1, atom2=a2)
Atom(_id=a1, molecule=m)
Atom(_id=a2, molecule=m)
end
bond_within_molecule isa UndirectedWiringDiagram
true
The tables in the query correspond to boxes (unfilled circles) in the UWD and the variables correspond to junctions (filled circles).
We won't say much more about UWDs here. For additional examples of diagrammatic query syntax, see the related earlier post by Andrew Baas on UWDs and SQL queries.
In the second step, the UWD translates into a bipartite diagram in \(\mathsf{Set}\), where the boxes of the UWD become vertices in the diagram's top layer and the junctions become vertices in the bottom layer.
The result \(R\) of the conjunctive query is then the limit of this bipartite diagram, which can be thought of as a "generalized pullback."
In technical terms, isomorphism classes of data frames (multispans) constitute an algebra of the operad of UWDs, where composition is given by limits of the induced bipartite diagrams.
All this is to say what a conjunctive query is but not how to compute it. There are many ways to evaluate a conjunctive query. The simplest, sketched in the previous section, is to take the cartesian product of the sets in the top layer of the bipartite diagram, then filter the tuples, keeping only those that project to equal values under the functions over any given set in the bottom layer. In a special case, this procedure is known to category theorists as "computing a pullback by taking an equalizer of a product." However, it is probably the worst possible algorithm for computing a limit, generally being extremely inefficient.
The inefficiency arises from having to iterate over the elements of a cartesian product \(X_1 \times \cdots \times X_n\), whose cardinality grows exponentially in \(n\). To avoid this, the conjunctive query should be decomposed into a sequence of joins, a problem known as query planning. Catlab currently employs a simple greedy heuristic for choosing joins.
Joins, or equivalently pullbacks, may themselves be computed in many ways, of which the most common are (Mishra & Eich 1992):
Nested loop join: the naive algorithm that iterates over all pairs of elements
Sortmerge join: both vectors of attributes are sorted, then merged together by synchronized iteration
Hash join: a hash map (index of preimages) is built for one attribute, then the other attribute vector is iterated, looking up preimages for each element
All three join algorithms are implemented in Catlab, for twoway and multiway joins. Both the sortmerge join and the hash join are generally much faster than the nested loop join. The hash join performs especially well when an attribute is already indexed in the attributed \(\mathsf{C}\)set, as happens for the source and target functions in a graph.
[2]  In this post, by "join" we always mean both "inner join" and "equijoin." Catlab does not yet support outer joins or joins with predicates other than equality. 
[3]  And UWDs are, in turn, \(\mathsf{C}\)sets for a schema \(\mathsf{C}\) with tables Box , Junction , Port , and OuterPort . This is a circularity but not a vicious one. It does imply that we could evaluate conjunctive queries on conjunctive queries (!), for whatever that is worth.

Despite its simplistic query planning, Catlab's conjunctive queries on \(\mathsf{C}\)sets are already fairly usable. Over time we will implement better query planning algorithms, as well as more general kinds of queries. Possibilities include unions of conjunctive queries, the class of queries expressible in coherent logic, and recursive queries, as in Datalog.
As we have seen, relational databases enjoy two key advantages over isolated data frames. First, relational schemas enable the data's encoding to mirror its underlying relational structure, which eliminates redundancy in storage and promotes internal consistency. Second, declarative query languages make it easy to ask questions about the data and view it through different lenses. On the other hand, SQL databases and logic programming languages like Prolog and Datalog have generally suffered from the Achilles' heel of all domainspecific languages (DSLs): you cannot step outside the DSL without abandoning the system entirely. Yet stepping outside the DSL is often necessary, since even the most expressive query languages have their limitations.
We think that \(\mathsf{C}\)sets can offer the best of both worlds to data analysis, providing the benefits of the relational data model while existing as ordinary data structures in a generalpurpose programming language. If you cannot express your query using the @relation
macro, you can always just write the for
loops or the map
calls! Due to the design of the Julia programming language and the \(\mathsf{C}\)sets in Catlab, doing so should cause no performance degradations. In this way, we hope that \(\mathsf{C}\)sets will eventually become a valuable component of the JuliaData ecosystem.
Suad Alagic and Mara Alagic, 1991. Joins as pullbacks. DBLP:AlagicA91
Priti Mishra and Margaret Eich, 1992. Join processing in relational databases. DOI:10.1145/128762.128764
Patrick Schultz, David Spivak, Christina Vasilakopoulou, and Ryan Wisnesky, 2016. Algebraic databases. arXiv:1602.03501
David Spivak, 2012. Functorial data migration. DOI:10.1016/j.ic.2012.05.001 arXiv:1009.1166
David Spivak, 2014. Category Theory for the Sciences. Online