C-sets for data analysis: relational data and conjunctive queries

By Evan Patterson on December 28, 2020

So far on the blog we have motivated \(\mathsf{C}\)-sets mainly as a unifying abstraction that encompasses graphs, wiring diagrams, Petri nets, and other graph-like objects. But what makes attributed \(\mathsf{C}\)-sets so useful is that they are a joint generalization of two essential data structures in computer science: graphs, as we have seen, but also data frames, a mainstay in any modern environment for data analysis.

A data frame represents a table of data as a set of named columns, where different columns may have different data types. Here is an excerpt from a data frame with four columns, provided by the Sitka dataset in the R package MASS.

sizetimetreetreat
Float64Int64Int64Cat…
14.511521ozone
24.981741ozone
35.412011ozone
45.92271ozone
56.152581ozone
64.241522ozone
74.21742ozone
84.682012ozone
94.922272ozone
104.962582ozone

Mathematically, a data frame is a multispan of sets whose apex is a finite set. A multispan, in turn, is a span that may have a number of legs different than two. The Sitka data frame is a multispan with four legs:

The apex is the set \([n] := \{1,\dots,n\}\) tabulating the \(n = 395\) rows of the data frame. The first leg of the multispan, the function \(\mathrm{size}: [n] \to \mathbb{R}\), encodes the first column of the table, a vector in \(\mathbb{R}^n\).

Although data frames are useful in isolation, many datasets have a relational structure that is better expressed by a collection of interlinked tables than by a single table. Arguably that is true even of this simple example. The Sitka dataset is longitudinal data, consisting of repeated measurements over time of the sizes of Sitka spruce trees, some of which were grown in ozone-enriched chambers (ozone) and others not (control). Thus, the data has a two-level structure: there is a set of trees and, for each tree, a set of measurements. The two types of units, trees and measurements, should be recorded in separate tables, where each measurement is linked to the tree that it measures. Graphically, the schema \(\mathsf{C}\) is:

Let's use Catlab to set up the corresponding attributed \(\mathsf{C}\)-set. The complete schema, including data attributes, is:

@present SitkaSchema(FreeSchema) begin
  (Numerical, Time, Treatment)::Data

  Tree::Ob
  treat::Attr(Tree, Treatment)
  
  Measurement::Ob
  tree::Hom(Measurement, Tree)
  size::Attr(Measurement, Numerical)
  time::Attr(Measurement, Time)
end

const SitkaData = ACSetType(SitkaSchema, index=[:tree])

Notice that the assigned treatment (treat) is properly an attribute of the trees, not of the measurements as the original data frame would suggest. Having defined the schema and corresponding Julia type, we load the data from the data frame (sitka_df) into the attributed \(\mathsf{C}\)-set (sitka):

sitka = SitkaData{Float64,Int,String}()

# Add trees and measurements, including data attributes of the latter.
trees = unique!(sort(sitka_df.tree))
@assert trees == 1:last(trees)
add_parts!(sitka, :Tree, length(trees))
add_parts!(sitka, :Measurement, nrow(sitka_df),
           tree=sitka_df.tree, size=sitka_df.size, time=sitka_df.time)

# Assign treatment to each tree, checking for data consistency.
for tree in trees
  measurements = incident(sitka, tree, :tree)
  sitka[tree, :treat] = only(unique(sitka_df.treat[measurements]))
end

map(length, tables(sitka))
(Tree = 79, Measurement = 395)

The resulting tables, excerpts of which are shown below, comprise 395 measurements spread across 79 trees.

pretty_tables(sitka, crop=:both, display_size=(15,80))
┌──────┬───────┐
│ Tree │ treat │
├──────┼───────┤
│    1 │ ozone │
│    2 │ ozone │
│    3 │ ozone │
│    4 │ ozone │
│    5 │ ozone │
│    6 │ ozone │
│    7 │ ozone │
│  ⋮   │   ⋮   │
└──────┴───────┘
 72 rows omitted
┌─────────────┬──────┬──────┬──────┐
│ Measurement │ tree │ size │ time │
├─────────────┼──────┼──────┼──────┤
│           1 │    1 │ 4.51 │  152 │
│           2 │    1 │ 4.98 │  174 │
│           3 │    1 │ 5.41 │  201 │
│           4 │    1 │  5.9 │  227 │
│           5 │    1 │ 6.15 │  258 │
│           6 │    2 │ 4.24 │  152 │
│           7 │    2 │  4.2 │  174 │
│      ⋮      │  ⋮   │  ⋮   │  ⋮   │
└─────────────┴──────┴──────┴──────┘
                    388 rows omitted

The data munging was simple enough, with one qualification. In order to consistently set the treat attribute of the trees, we had to verify that each tree is actually assigned a unique treatment value in the original table. Nothing about the structure of the table guarantees that. This subtlety illustrates a key strength of the relational data model: even in the absence of equational axioms, the schema can enforce logical constraints that a single table cannot, eliminating redundancy and reducing the risk of transcription errors.

Of course, flattened views of relational data are often needed for particular data analyses. These views are easily obtained by making queries against the \(\mathsf{C}\)-set. For example, the following query recovers the original data frame.

select_all = @relation (size=size, time=time, tree=t, treat=treat) begin
  Tree(_id=t, treat=treat)
  Measurement(tree=t, size=size, time=time)
end

query(sitka, select_all, table_type=DataFrame) == sitka_df
true

The query joins the Tree and Measurement tables along their respective trees, carrying along the data attributes of both tables.

Readers familiar with relational databases will recognize the foregoing ideas. Relational data is ubiquitous, and relational databases, in the form of SQL databases like MySQL and SQLite, constitute a large industry. The mathematics of relational databases as \(\mathsf{C}\)-sets or similar categorical structures is also now well established.[1] Despite this, relational databases play only a minor role in the data analysis environments offered by languages like Python, R, and Julia. A typical analysis might begin with a query against a SQL database, returning a single large table, but will then proceed by ad hoc manipulation of data frames. If a query involving multiple tables (or multiple copies of the same table) is needed, it is carried out manually through a sequence of joins. With relational databases, by contrast, queries are specified declaratively and the database system does the work of translating the query into an efficient algorithm.

Important though they are, data frames are the lowest common denominator of data science tooling. We can see no reason for this state of affairs besides tradition and the tendency of SQL databases to be siloed from other systems. We hope that \(\mathsf{C}\)-sets will be a useful tool for data analysis, serving as an in-memory relational database that interoperates smoothly with the JuliaData ecoystem. Indeed, the \(\mathsf{C}\)-sets in Catlab do not compete with this ecosystem but rather depend on it: the tables of a \(\mathsf{C}\)-set are provided by TypedTables.jl and may be exchanged for any other columnar table compliant with Tables.jl.

In the remainder of this post, we explain how conjunctive queries work and illustrate them on a dataset with more complex relational structure than the Sitka data. We then explain the categorical underpinnings of conjunctive queries, particularly the close connection between joins in relational databases and finite limits in the category of sets.

[1] See the line of research by David Spivak and collaborators (Spivak 2012, Spivak 2014, Schultz et al 2016) and also the earlier blog post by Owen Lynch on the mathematics of attributed \(\mathsf{C}\)-sets.

Conjunctive queries on C-sets

The mutagenesis dataset, available through the Relational Dataset Repository, is a collection of molecular graphs from a study on the mutagenicity of certain chemical compounds. The dataset has a three-level relational structure: there is a set of molecules, each of which has a set of atoms, which may be connected pairwise by chemical bonds.

The complete schema includes data attributes for each of the three types.

@present MutagenesisSchema(FreeSchema) begin
  (Indicator, Categorical, Name, Numerical)::Data
  
  Molecule::Ob
  molecule_id::Attr(Molecule, Name)
  (ind1, inda)::Attr(Molecule, Indicator)
  (logp, lumo)::Attr(Molecule, Numerical)
  mutagenic::Attr(Molecule, Indicator)
  
  Atom::Ob
  molecule::Hom(Atom, Molecule)
  atom_id::Attr(Atom, Name)
  atom_type::Attr(Atom, Categorical)
  element::Attr(Atom, Name)
  charge::Attr(Atom, Numerical)
  
  Bond::Ob
  (atom1, atom2)::Hom(Bond, Atom)
  bond_type::Attr(Bond, Categorical)
end

const MutagenesisData = ACSetType(MutagenesisSchema, 
  index=[:atom1, :atom2, :molecule],
  unique_index=[:atom_id, :molecule_id])

The attributes molecule_id and atom_id do not appear to be meaningful but are carried over from the primary keys in the original SQL database.

The data is loaded into Catlab by making three SELECT * from [table]; queries on the SQL database and mapping the primary keys to integer IDs. We omit this routine code, which should in any case be automated in the future. The result is an ACSet mutagenesis having the following dimensions.

map(length, tables(mutagenesis))
(Molecule = 188, Atom = 4893, Bond = 5243)

Here are excerpts from each of the tables:

pretty_tables(mutagenesis, crop=:both, display_size=(15,80))
┌──────────┬─────────────┬───────┬───────┬──────┬────────┬───────────┐
│ Molecule │ molecule_id │  ind1 │  inda │ logp │   lumo │ mutagenic │
├──────────┼─────────────┼───────┼───────┼──────┼────────┼───────────┤
│        1 │          d1 │  true │ false │ 4.23 │ -1.246 │      true │
│        2 │         d10 │  true │ false │ 4.62 │ -1.387 │      true │
│        3 │        d100 │ false │ false │ 2.68 │ -1.034 │     false │
│        4 │        d101 │  true │ false │ 6.26 │ -1.598 │      true │
│        5 │        d102 │  true │ false │  2.4 │ -3.172 │      true │
│        6 │        d103 │  true │ false │ 4.69 │ -1.487 │      true │
│        7 │        d104 │  true │ false │ 6.26 │ -1.546 │      true │
│    ⋮     │      ⋮      │   ⋮   │   ⋮   │  ⋮   │   ⋮    │     ⋮     │
└──────────┴─────────────┴───────┴───────┴──────┴────────┴───────────┘
                                                      181 rows omitted
┌──────┬──────────┬─────────┬───────────┬─────────┬────────┐
│ Atom │ molecule │ atom_id │ atom_type │ element │ charge │
├──────┼──────────┼─────────┼───────────┼─────────┼────────┤
│    1 │        3 │  d100_1 │        22 │       c │ -0.128 │
│    2 │        3 │ d100_10 │         3 │       h │  0.132 │
│    3 │        3 │ d100_11 │        29 │       c │  0.002 │
│    4 │        3 │ d100_12 │        22 │       c │ -0.128 │
│    5 │        3 │ d100_13 │        22 │       c │ -0.128 │
│    6 │        3 │ d100_14 │        22 │       c │ -0.128 │
│    7 │        3 │ d100_15 │        22 │       c │  0.202 │
│  ⋮   │    ⋮     │    ⋮    │     ⋮     │    ⋮    │   ⋮    │
└──────┴──────────┴─────────┴───────────┴─────────┴────────┘
                                           4886 rows omitted
┌──────┬───────┬───────┬───────────┐
│ Bond │ atom1 │ atom2 │ bond_type │
├──────┼───────┼───────┼───────────┤
│    1 │     1 │    12 │         7 │
│    2 │     1 │    24 │         1 │
│    3 │     3 │     4 │         7 │
│    4 │     4 │     5 │         7 │
│    5 │     4 │     9 │         1 │
│    6 │     5 │     6 │         7 │
│    7 │     5 │    10 │         1 │
│  ⋮   │   ⋮   │   ⋮   │     ⋮     │
└──────┴───────┴───────┴───────────┘
                   5236 rows omitted

We are now going to make queries against the mutagenesis data. Generally speaking, Catlab supports evaluating conjunctive queries on attributed \(\mathsf{C}\)-sets. Although more restrictive than arbitrary SQL queries, conjunctive queries play in a central role in the theory and practice of relational databases. They are precisely the queries that can be expressed in regular logic, the subsystem of first-order logic with connectives \(\exists, \wedge, \top\). Alternatively, as we will see later, they are the queries that can be computed as finite limits in \(\mathsf{Set}\).

In Catlab, conjunctive queries are most easily specified using the @relation macro, which has the general form:

@relation (out1=var1, out2=var2, …) [where (x,y,z,…)] begin
  T₁(col1=var3, col2=var4, …)
  ⋮
  Tₙ(col1=var5, col2=var6, …)
end

Every relation statement in the macro body specifies a table \(T_i\) and a subset of its columns, with each column assigned to a variable. To evaluate the query, you take the cartesian product of the selected tables \(T_1, \dots, T_n\), then filter the rows so that any two columns assigned to the same variable are equal. Finally, you return a table that exposes a subset of the variables, given by the output names in the first clause. The full set of variables is optionally stated in the where clause; if it is omitted, the set of variables is inferred.

Conjunctive queries are more intuitive than this description may make them appear. Returning to the mutagenesis data, we might assume that molecular graphs are undirected, so that for every for bond between atoms \(a_1\) and \(a_2\), there is also a bond between atoms \(a_2\) and \(a_1\). We can check this assumption using the following query.

symmetric_pairs = @relation (atom1=a1, atom2=a2) begin
  Bond(atom1=a1, atom2=a2)
  Bond(atom1=a2, atom2=a1)
end

length(query(mutagenesis, symmetric_pairs))
0

In fact, no bond in this dataset has a corresponding bond in the opposite direction. The creators of the data presumably left the oppositely oriented bonds implicit.

A more fundamental constraint is that any two bonded atoms must belong to the same molecule.

bond_within_molecule = @relation (Bond=b) begin
  Bond(_id=b, atom1=a1, atom2=a2)
  Atom(_id=a1, molecule=m)
  Atom(_id=a2, molecule=m)
end

result = query(mutagenesis, bond_within_molecule)
length(result) == nparts(mutagenesis, :Bond)
true

At least this assumption is satisfied by the data! The schema presented earlier has no equations, but we could have added an equation explitictly stating the assumption:

@present MutagenesisSchema(FreeSchema) begin
  [...]
  
  compose(atom1, molecule) == compose(atom2, molecule)
end

By default, the variables in a query may take any values, but variables can be fixed to specific values using the optional third argument to query. For example, the following query returns all carbon atoms belonging to a mutagenic molecule, together with the charges of the atoms.

atoms_in_molecule = @relation (Atom=a, molecule=m, charge=charge) begin
  Atom(_id=a, element=element, charge=charge, molecule=m)
  Molecule(_id=m, mutagenic=mutagenic)
end

result = query(mutagenesis, atoms_in_molecule, (element="c", mutagenic=true))
Table with 3 columns and 1833 rows:
     Atom  molecule  charge
   ┌───────────────────────
 1 │ 27    4         -0.121
 2 │ 28    4         -0.091
 3 │ 29    4         0.009
 4 │ 30    4         0.009
 5 │ 31    4         0.009
 6 │ 32    4         -0.121
 7 │ 33    4         -0.121
 8 │ 34    4         -0.122
 ⋮ │  ⋮       ⋮        ⋮

The mutagenesis data is essentially a collection of graphs, one for each molecule, where the vertices are atoms and the edges are bonds. The conventional way to process such data would be to convert it into a graph, using a specialized data structure for directed graphs. In Catlab, this conversion is easily achieved by functorial data migration in its basic, contravariant form:

using Catlab.Graphs

g = Graph(mutagenesis, Dict(:V => :Atom, :E => :Bond),
                       Dict(:src => :atom1, :tgt => :atom2))
(V = nv(g), E = ne(g))
(V = 4893, E = 5243)

The graph's connected components correspond to the molecules. An advantage of the graph representation is that you can directly apply any graph analytics procedure that expects a graph as input. However, in contrast to other systems, in Catlab there are no further advantages to performing this conversion. Traversing the derived graph will be no faster or slower than traversing the MutagenesisData object, because graphs in Catlab are \(\mathsf{C}\)-sets isomorphic to the atom-bond substructure of the mutagenesis structure. This what we meant at the outset when we said that attributed \(\mathsf{C}\)-sets are a joint generalization of graphs and data frames.

The unification of graphs with relational data cuts in both directions. On the one hand, since graphs in Catlab are \(\mathsf{C}\)-sets, we can efficiently evaluate conjunctive queries against them. As a simple example, here are all the directed paths of length two in the derived graph:

paths2 = @relation (vert1=v1, vert2=v2, vert3=v3) begin
  E(src=v1, tgt=v2)
  E(src=v2, tgt=v3)
end

result = query(g, paths2)
Table with 3 columns and 5539 rows:
     vert1  vert2  vert3
   ┌────────────────────
 1 │ 1      12     20
 2 │ 1      12     25
 3 │ 3      4      5
 4 │ 3      4      9
 5 │ 4      5      6
 6 │ 4      5      10
 7 │ 5      6      7
 8 │ 5      6      11
 ⋮ │   ⋮      ⋮      ⋮

On the other hand, we can formulate richer queries against the mutagenesis data, where more data attributes are available. The following query refines the previous one to find all directed paths from a carbon atom to a nitrogen atom to another carbon atom. The resulting table is much smaller.

paths2 = @relation (atom1=a1, atom2=a2, atom3=a3) begin
  Bond(atom1=a1, atom2=a2)
  Bond(atom1=a2, atom2=a3)
  Atom(_id=a1, element=elem1)
  Atom(_id=a2, element=elem2)
  Atom(_id=a3, element=elem3)
end

result = query(mutagenesis, paths2, (elem1="c", elem2="n", elem3="c"))
Table with 3 columns and 41 rows:
     atom1  atom2  atom3
   ┌────────────────────
 1 │ 163    164    165
 2 │ 169    170    163
 3 │ 620    624    608
 4 │ 768    769    787
 5 │ 786    766    767
 6 │ 1105   1104   1086
 7 │ 1315   1319   1320
 8 │ 1316   1321   1320
 ⋮ │   ⋮      ⋮      ⋮

Conjunctive queries as limits

The theory of relational databases is category theory. This statement is true in many ways and from many perspectives, but we will focus on just one: the equivalence between conjunctive queries on relational databases and finite limits in the category of sets. This equivalence is almost immediate if you are familiar with the relevant concepts; it is certainly not new (Alagic & Alagic 1991). Nevertheless, it is a fruitful observation for both category theory and relational databases. In one direction, conjunctive queries inherit an intuitive diagrammatic syntax through the operad of undirected wiring diagrams. Conversely, any general-purpose package for computational category theory, which Catlab aims to be, ought to be able to efficiently compute finite limits of sets. Algorithms for doing so have been extensively developed by database theorists and engineers.

Here is a dictionary between conjunctive queries of different kinds and finite limits in \(\mathsf{Set}\) of different shapes.



The last statement, about conjunctive queries and bipartite diagrams, requires elaboration. We will break it down into two steps. First, conjunctive queries can be represented by undirected wiring diagrams (UWDs); in Catlab, they actually are UWDs.[3] Second, the UWD translates into a bipartite diagram, whose limit is the result of the query.

As an illustration, recall this conjunctive query on the mutagenesis data.

bond_within_molecule = @relation (Bond=b) begin
  Bond(_id=b, atom1=a1, atom2=a2)
  Atom(_id=a1, molecule=m)
  Atom(_id=a2, molecule=m)
end

bond_within_molecule isa UndirectedWiringDiagram
true

The tables in the query correspond to boxes (unfilled circles) in the UWD and the variables correspond to junctions (filled circles).

We won't say much more about UWDs here. For additional examples of diagrammatic query syntax, see the related earlier post by Andrew Baas on UWDs and SQL queries.

In the second step, the UWD translates into a bipartite diagram in \(\mathsf{Set}\), where the boxes of the UWD become vertices in the diagram's top layer and the junctions become vertices in the bottom layer.

The result \(R\) of the conjunctive query is then the limit of this bipartite diagram, which can be thought of as a "generalized pullback."

In technical terms, isomorphism classes of data frames (multispans) constitute an algebra of the operad of UWDs, where composition is given by limits of the induced bipartite diagrams.

All this is to say what a conjunctive query is but not how to compute it. There are many ways to evaluate a conjunctive query. The simplest, sketched in the previous section, is to take the cartesian product of the sets in the top layer of the bipartite diagram, then filter the tuples, keeping only those that project to equal values under the functions over any given set in the bottom layer. In a special case, this procedure is known to category theorists as "computing a pullback by taking an equalizer of a product." However, it is probably the worst possible algorithm for computing a limit, generally being extremely inefficient.

The inefficiency arises from having to iterate over the elements of a cartesian product \(X_1 \times \cdots \times X_n\), whose cardinality grows exponentially in \(n\). To avoid this, the conjunctive query should be decomposed into a sequence of joins, a problem known as query planning. Catlab currently employs a simple greedy heuristic for choosing joins.

Joins, or equivalently pullbacks, may themselves be computed in many ways, of which the most common are (Mishra & Eich 1992):

  1. Nested loop join: the naive algorithm that iterates over all pairs of elements

  2. Sort-merge join: both vectors of attributes are sorted, then merged together by synchronized iteration

  3. Hash join: a hash map (index of preimages) is built for one attribute, then the other attribute vector is iterated, looking up preimages for each element

All three join algorithms are implemented in Catlab, for two-way and multi-way joins. Both the sort-merge join and the hash join are generally much faster than the nested loop join. The hash join performs especially well when an attribute is already indexed in the attributed \(\mathsf{C}\)-set, as happens for the source and target functions in a graph.

[2] In this post, by "join" we always mean both "inner join" and "equi-join." Catlab does not yet support outer joins or joins with predicates other than equality.
[3] And UWDs are, in turn, \(\mathsf{C}\)-sets for a schema \(\mathsf{C}\) with tables Box, Junction, Port, and OuterPort. This is a circularity but not a vicious one. It does imply that we could evaluate conjunctive queries on conjunctive queries (!), for whatever that is worth.

Outlook

Despite its simplistic query planning, Catlab's conjunctive queries on \(\mathsf{C}\)-sets are already fairly usable. Over time we will implement better query planning algorithms, as well as more general kinds of queries. Possibilities include unions of conjunctive queries, the class of queries expressible in coherent logic, and recursive queries, as in Datalog.

As we have seen, relational databases enjoy two key advantages over isolated data frames. First, relational schemas enable the data's encoding to mirror its underlying relational structure, which eliminates redundancy in storage and promotes internal consistency. Second, declarative query languages make it easy to ask questions about the data and view it through different lenses. On the other hand, SQL databases and logic programming languages like Prolog and Datalog have generally suffered from the Achilles' heel of all domain-specific languages (DSLs): you cannot step outside the DSL without abandoning the system entirely. Yet stepping outside the DSL is often necessary, since even the most expressive query languages have their limitations.

We think that \(\mathsf{C}\)-sets can offer the best of both worlds to data analysis, providing the benefits of the relational data model while existing as ordinary data structures in a general-purpose programming language. If you cannot express your query using the @relation macro, you can always just write the for loops or the map calls! Due to the design of the Julia programming language and the \(\mathsf{C}\)-sets in Catlab, doing so should cause no performance degradations. In this way, we hope that \(\mathsf{C}\)-sets will eventually become a valuable component of the JuliaData ecosystem.

References

Suad Alagic and Mara Alagic, 1991. Joins as pullbacks. DBLP:AlagicA91

Priti Mishra and Margaret Eich, 1992. Join processing in relational databases. DOI:10.1145/128762.128764

Patrick Schultz, David Spivak, Christina Vasilakopoulou, and Ryan Wisnesky, 2016. Algebraic databases. arXiv:1602.03501

David Spivak, 2012. Functorial data migration. DOI:10.1016/j.ic.2012.05.001 arXiv:1009.1166

David Spivak, 2014. Category Theory for the Sciences. Online