C-sets for data analysis: relational data and conjunctive queries

databases

c-sets

attributed-c-sets

Not just useful for graphs, C-sets are a general-purpose tool for data analysis offering the functionality of an in-memory relational database. In this post, we illustrate Catlab’s new capabilities for querying C-sets and we explain the categorical underpinnings of conjunctive queries.

Author

Evan Patterson

Published

December 28, 2020

So far on the blog we have motivated \mathsf{C}-sets mainly as a unifying abstraction that encompasses graphs, wiring diagrams, Petri nets, and other graph-like objects. But what makes attributed \mathsf{C}-sets so useful is that they are a joint generalization of two essential data structures in computer science: graphs, as we have seen, but also data frames, a mainstay in any modern environment for data analysis.

A data frame represents a table of data as a set of named columns, where different columns may have different data types. Here is an excerpt from a data frame with four columns, provided by the Sitka dataset in the R package MASS.

395×4 DataFrame

370 rows omitted

Row	size	time	tree	treat
	Float64	Int64	Int64	Cat…
1	4.51	152	1	ozone
2	4.98	174	1	ozone
3	5.41	201	1	ozone
4	5.9	227	1	ozone
5	6.15	258	1	ozone
6	4.24	152	2	ozone
7	4.2	174	2	ozone
8	4.68	201	2	ozone
9	4.92	227	2	ozone
10	4.96	258	2	ozone
11	3.98	152	3	ozone
12	4.36	174	3	ozone
13	4.79	201	3	ozone
⋮	⋮	⋮	⋮	⋮
384	4.28	227	77	control
385	4.54	258	77	control
386	4.5	152	78	control
387	4.8	174	78	control
388	5.28	201	78	control
389	5.83	227	78	control
390	6.16	258	78	control
391	2.99	152	79	control
392	3.61	174	79	control
393	4.48	201	79	control
394	4.91	227	79	control
395	5.06	258	79	control

Mathematically, a data frame is a multispan of sets whose apex is a finite set. A multispan, in turn, is a span that may have a number of legs different than two. The Sitka data frame is a multispan with four legs:

The apex is the set [n] := \{1,\dots,n\} tabulating the n = 395 rows of the data frame. The first leg of the multispan, the function \mathrm{size}: [n] \to \mathbb{R}, encodes the first column of the table, a vector in \mathbb{R}^n.

Although data frames are useful in isolation, many datasets have a relational structure that is better expressed by a collection of interlinked tables than by a single table. Arguably that is true even of this simple example. The Sitka dataset is longitudinal data, consisting of repeated measurements over time of the sizes of Sitka spruce trees, some of which were grown in ozone-enriched chambers (ozone) and others not (control). Thus, the data has a two-level structure: there is a set of trees and, for each tree, a set of measurements. The two types of units, trees and measurements, should be recorded in separate tables, where each measurement is linked to the tree that it measures. Graphically, the schema \mathsf{C} is:

Let’s use Catlab to set up the corresponding attributed \mathsf{C}-set. The complete schema, including data attributes, is:

Code

@present SchSitka(FreeSchema) begin
  (Numerical, Time, Treatment)::AttrType

  Tree::Ob
  treat::Attr(Tree, Treatment)
  
  Measurement::Ob
  tree::Hom(Measurement, Tree)
  size::Attr(Measurement, Numerical)
  time::Attr(Measurement, Time)
end

@acset_type SitkaData(SchSitka, index=[:tree])

Notice that the assigned treatment (treat) is properly an attribute of the trees, not of the measurements as the original data frame would suggest. Having defined the schema and corresponding Julia type, we load the data from the data frame (sitka_df) into the attributed \mathsf{C}-set (sitka):

Code

sitka = SitkaData{Float64,Int,String}()

# Add trees and measurements, including data attributes of the latter.
trees = unique!(sort(sitka_df.tree))
@assert trees == 1:last(trees)
add_parts!(sitka, :Tree, length(trees))
add_parts!(sitka, :Measurement, nrow(sitka_df),
           tree=sitka_df.tree, size=sitka_df.size, time=sitka_df.time)

# Assign treatment to each tree, checking for data consistency.
for tree in trees
  measurements = incident(sitka, tree, :tree)
  sitka[tree, :treat] = only(unique(sitka_df.treat[measurements]))
end

import Tables
map(Tables.rowcount, tables(sitka))

(Tree = 79, Measurement = 395)

The resulting tables, excerpts of which are shown below, comprise 395 measurements spread across 79 trees.

Code

pretty_tables(sitka, crop=:both, display_size=(15,80))

┌──────┬───────┐
│ Tree │ treat │
├──────┼───────┤
│    1 │ ozone │
│    2 │ ozone │
│    3 │ ozone │
│    4 │ ozone │
│    5 │ ozone │
│    6 │ ozone │
│    7 │ ozone │
│  ⋮   │   ⋮   │
└──────┴───────┘
 72 rows omitted
┌─────────────┬──────┬──────┬──────┐
│ Measurement │ tree │ size │ time │
├─────────────┼──────┼──────┼──────┤
│           1 │    1 │ 4.51 │  152 │
│           2 │    1 │ 4.98 │  174 │
│           3 │    1 │ 5.41 │  201 │
│           4 │    1 │  5.9 │  227 │
│           5 │    1 │ 6.15 │  258 │
│           6 │    2 │ 4.24 │  152 │
│           7 │    2 │  4.2 │  174 │
│      ⋮      │  ⋮   │  ⋮   │  ⋮   │
└─────────────┴──────┴──────┴──────┘
                    388 rows omitted

The data munging was simple enough, with one qualification. In order to consistently set the treat attribute of the trees, we had to verify that each tree is actually assigned a unique treatment value in the original table. Nothing about the structure of the table guarantees that. This subtlety illustrates a key strength of the relational data model: even in the absence of equational axioms, the schema can enforce logical constraints that a single table cannot, eliminating redundancy and reducing the risk of transcription errors.

Of course, flattened views of relational data are often needed for particular data analyses. These views are easily obtained by making queries against the \mathsf{C}-set. For example, the following query recovers the original data frame.

Code

select_all = @relation (size=size, time=time, tree=t, treat=treat) begin
  Tree(_id=t, treat=treat)
  Measurement(tree=t, size=size, time=time)
end

query(sitka, select_all, table_type=DataFrame) == sitka_df

true

The query joins the Tree and Measurement tables along their respective trees, carrying along the data attributes of both tables.

Readers familiar with relational databases will recognize the foregoing ideas. Relational data is ubiquitous, and relational databases, in the form of SQL databases like MySQL and SQLite, constitute a large industry. The mathematics of relational databases as \mathsf{C}-sets or similar categorical structures is also now well established.¹ Despite this, relational databases play only a minor role in the data analysis environments offered by languages like Python, R, and Julia. A typical analysis might begin with a query against a SQL database, returning a single large table, but will then proceed by ad hoc manipulation of data frames. If a query involving multiple tables (or multiple copies of the same table) is needed, it is carried out manually through a sequence of joins. With relational databases, by contrast, queries are specified declaratively and the database system does the work of translating the query into an efficient algorithm.

Important though they are, data frames are the lowest common denominator of data science tooling. We can see no reason for this state of affairs besides tradition and the tendency of SQL databases to be siloed from other systems. We hope that \mathsf{C}-sets will be a useful tool for data analysis, serving as an in-memory relational database that interoperates smoothly with the JuliaData ecoystem. Indeed, the \mathsf{C}-sets in Catlab do not compete with this ecosystem but rather depend on it: the tables of a \mathsf{C}-set are provided by TypedTables.jl and may be exchanged for any other columnar table compliant with Tables.jl.

In the remainder of this post, we explain how conjunctive queries work and illustrate them on a dataset with more complex relational structure than the Sitka data. We then explain the categorical underpinnings of conjunctive queries, particularly the close connection between joins in relational databases and finite limits in the category of sets.

Conjunctive queries on C-sets

The mutagenesis dataset, available through the Relational Dataset Repository, is a collection of molecular graphs from a study on the mutagenicity of certain chemical compounds. The dataset has a three-level relational structure: there is a set of molecules, each of which has a set of atoms, which may be connected pairwise by chemical bonds.

The complete schema includes data attributes for each of the three types.

Code

@present SchMutagenesis(FreeSchema) begin
  (Indicator, Categorical, Name, Numerical)::AttrType
  
  Molecule::Ob
  molecule_id::Attr(Molecule, Name)
  (ind1, inda)::Attr(Molecule, Indicator)
  (logp, lumo)::Attr(Molecule, Numerical)
  mutagenic::Attr(Molecule, Indicator)
  
  Atom::Ob
  molecule::Hom(Atom, Molecule)
  atom_id::Attr(Atom, Name)
  atom_type::Attr(Atom, Categorical)
  element::Attr(Atom, Name)
  charge::Attr(Atom, Numerical)
  
  Bond::Ob
  (atom1, atom2)::Hom(Bond, Atom)
  bond_type::Attr(Bond, Categorical)
end

@acset_type MutagenesisData(SchMutagenesis,
  index=[:atom1, :atom2, :molecule],
  unique_index=[:atom_id, :molecule_id])

The attributes molecule_id and atom_id do not appear to be meaningful but are carried over from the primary keys in the original SQL database.

The data is loaded into Catlab by making three SELECT * from [table]; queries on the SQL database and mapping the primary keys to integer IDs. We omit this routine code, which should in any case be automated in the future. The result is an ACSet mutagenesis having the following dimensions.

Code

map(Tables.rowcount, tables(mutagenesis))

(Molecule = 188, Atom = 4893, Bond = 5243)

Here are excerpts from each of the tables:

Code

pretty_tables(mutagenesis, crop=:both, display_size=(15,80))

┌──────────┬─────────────┬───────┬───────┬──────┬────────┬───────────┐
│ Molecule │ molecule_id │  ind1 │  inda │ logp │   lumo │ mutagenic │
├──────────┼─────────────┼───────┼───────┼──────┼────────┼───────────┤
│        1 │          d1 │  true │ false │ 4.23 │ -1.246 │      true │
│        2 │         d10 │  true │ false │ 4.62 │ -1.387 │      true │
│        3 │        d100 │ false │ false │ 2.68 │ -1.034 │     false │
│        4 │        d101 │  true │ false │ 6.26 │ -1.598 │      true │
│        5 │        d102 │  true │ false │  2.4 │ -3.172 │      true │
│        6 │        d103 │  true │ false │ 4.69 │ -1.487 │      true │
│        7 │        d104 │  true │ false │ 6.26 │ -1.546 │      true │
│    ⋮     │      ⋮      │   ⋮   │   ⋮   │  ⋮   │   ⋮    │     ⋮     │
└──────────┴─────────────┴───────┴───────┴──────┴────────┴───────────┘
                                                      181 rows omitted
┌──────┬──────────┬─────────┬───────────┬─────────┬────────┐
│ Atom │ molecule │ atom_id │ atom_type │ element │ charge │
├──────┼──────────┼─────────┼───────────┼─────────┼────────┤
│    1 │        3 │  d100_1 │        22 │       c │ -0.128 │
│    2 │        3 │ d100_10 │         3 │       h │  0.132 │
│    3 │        3 │ d100_11 │        29 │       c │  0.002 │
│    4 │        3 │ d100_12 │        22 │       c │ -0.128 │
│    5 │        3 │ d100_13 │        22 │       c │ -0.128 │
│    6 │        3 │ d100_14 │        22 │       c │ -0.128 │
│    7 │        3 │ d100_15 │        22 │       c │  0.202 │
│  ⋮   │    ⋮     │    ⋮    │     ⋮     │    ⋮    │   ⋮    │
└──────┴──────────┴─────────┴───────────┴─────────┴────────┘
                                           4886 rows omitted

┌──────┬───────┬───────┬───────────┐
│ Bond │ atom1 │ atom2 │ bond_type │
├──────┼───────┼───────┼───────────┤
│    1 │     1 │    12 │         7 │
│    2 │     1 │    24 │         1 │
│    3 │     3 │     4 │         7 │
│    4 │     4 │     5 │         7 │
│    5 │     4 │     9 │         1 │
│    6 │     5 │     6 │         7 │
│    7 │     5 │    10 │         1 │
│  ⋮   │   ⋮   │   ⋮   │     ⋮     │
└──────┴───────┴───────┴───────────┘
                   5236 rows omitted

We are now going to make queries against the mutagenesis data. Generally speaking, Catlab supports evaluating conjunctive queries on attributed \mathsf{C}-sets. Although more restrictive than arbitrary SQL queries, conjunctive queries play in a central role in the theory and practice of relational databases. They are precisely the queries that can be expressed in regular logic, the subsystem of first-order logic with connectives \exists, \wedge, \top. Alternatively, as we will see later, they are the queries that can be computed as finite limits in \mathsf{Set}.

In Catlab, conjunctive queries are most easily specified using the @relation macro, which has the general form:

@relation (out1=var1, out2=var2, …) [where (x,y,z,…)] begin
  T₁(col1=var3, col2=var4, …)
  ⋮
  Tₙ(col1=var5, col2=var6, …)
end

Every relation statement in the macro body specifies a table T_i and a subset of its columns, with each column assigned to a variable. To evaluate the query, you take the cartesian product of the selected tables T_1, \dots, T_n, then filter the rows so that any two columns assigned to the same variable are equal. Finally, you return a table that exposes a subset of the variables, given by the output names in the first clause. The full set of variables is optionally stated in the where clause; if it is omitted, the set of variables is inferred.

Conjunctive queries are more intuitive than this description may make them appear. Returning to the mutagenesis data, we might assume that molecular graphs are undirected, so that for every for bond between atoms a_1 and a_2, there is also a bond between atoms a_2 and a_1. We can check this assumption using the following query.

Code

symmetric_pairs = @relation (atom1=a1, atom2=a2) begin
  Bond(atom1=a1, atom2=a2)
  Bond(atom1=a2, atom2=a1)
end

nrow(query(mutagenesis, symmetric_pairs))

In fact, no bond in this dataset has a corresponding bond in the opposite direction. The creators of the data presumably left the oppositely oriented bonds implicit.

A more fundamental constraint is that any two bonded atoms must belong to the same molecule.

Code

bond_within_molecule = @relation (Bond=b) begin
  Bond(_id=b, atom1=a1, atom2=a2)
  Atom(_id=a1, molecule=m)
  Atom(_id=a2, molecule=m)
end

result = query(mutagenesis, bond_within_molecule)
nrow(result) == nparts(mutagenesis, :Bond)

true

At least this assumption is satisfied by the data! The schema presented earlier has no equations, but we could have added an equation explitictly stating the assumption:

@present SchMutagenesis(FreeSchema) begin
  [...]
  
  compose(atom1, molecule) == compose(atom2, molecule)
end

By default, the variables in a query may take any values, but variables can be fixed to specific values using the optional third argument to query. For example, the following query returns all carbon atoms belonging to a mutagenic molecule, together with the charges of the atoms.

Code

atoms_in_molecule = @relation (Atom=a, molecule=m, charge=charge) begin
  Atom(_id=a, element=element, charge=charge, molecule=m)
  Molecule(_id=m, mutagenic=mutagenic)
end

query(mutagenesis, atoms_in_molecule, (element="c", mutagenic=true))

1833×3 DataFrame

1808 rows omitted

Row	Atom	molecule	charge
	Int64	Int64	Float64
1	27	4	-0.121
2	28	4	-0.091
3	29	4	0.009
4	30	4	0.009
5	31	4	0.009
6	32	4	-0.121
7	33	4	-0.121
8	34	4	-0.122
9	35	4	-0.091
10	36	4	-0.091
11	37	4	-0.091
12	38	4	-0.121
13	39	4	-0.091
⋮	⋮	⋮	⋮
1822	4855	188	-0.116
1823	4856	188	-0.086
1824	4857	188	0.014
1825	4858	188	-0.086
1826	4866	188	-0.116
1827	4870	188	0.014
1828	4871	188	-0.086
1829	4872	188	-0.086
1830	4873	188	-0.116
1831	4874	188	-0.116
1832	4875	188	-0.116
1833	4876	188	-0.116

The mutagenesis data is essentially a collection of graphs, one for each molecule, where the vertices are atoms and the edges are bonds. The conventional way to process such data would be to convert it into a graph, using a specialized data structure for directed graphs. In Catlab, this conversion is easily achieved by functorial data migration in its basic, contravariant form:

Code

using Catlab.Graphs

g = Graph(mutagenesis, Dict(:V => :Atom, :E => :Bond),
                       Dict(:src => :atom1, :tgt => :atom2))
(V = nv(g), E = ne(g))

(V = 4893, E = 5243)

The graph’s connected components correspond to the molecules. An advantage of the graph representation is that you can directly apply any graph analytics procedure that expects a graph as input. However, in contrast to other systems, in Catlab there are no further advantages to performing this conversion. Traversing the derived graph will be no faster or slower than traversing the MutagenesisData object, because graphs in Catlab are \mathsf{C}-sets isomorphic to the atom-bond substructure of the mutagenesis structure. This what we meant at the outset when we said that attributed \mathsf{C}-sets are a joint generalization of graphs and data frames.

The unification of graphs with relational data cuts in both directions. On the one hand, since graphs in Catlab are \mathsf{C}-sets, we can efficiently evaluate conjunctive queries against them. As a simple example, here are all the directed paths of length two in the derived graph:

Code

paths2 = @relation (vert1=v1, vert2=v2, vert3=v3) begin
  E(src=v1, tgt=v2)
  E(src=v2, tgt=v3)
end

query(g, paths2)

5539×3 DataFrame

5514 rows omitted

Row	vert1	vert2	vert3
	Int64	Int64	Int64
1	1	12	20
2	1	12	25
3	3	4	5
4	3	4	9
5	4	5	6
6	4	5	10
7	5	6	7
8	5	6	11
9	6	7	8
10	6	7	14
11	7	8	3
12	7	8	13
13	7	14	18
⋮	⋮	⋮	⋮
5528	4888	4889	4890
5529	4888	4891	4892
5530	4889	4890	4877
5531	4889	4890	4878
5532	4890	4877	4881
5533	4890	4877	4886
5534	4890	4878	4879
5535	4891	4892	4884
5536	4891	4892	4893
5537	4892	4893	4885
5538	4892	4893	4887
5539	4893	4887	4888

On the other hand, we can formulate richer queries against the mutagenesis data, where more data attributes are available. The following query refines the previous one to find all directed paths from a carbon atom to a nitrogen atom to another carbon atom. The resulting table is much smaller.

Code

paths2 = @relation (atom1=a1, atom2=a2, atom3=a3) begin
  Bond(atom1=a1, atom2=a2)
  Bond(atom1=a2, atom2=a3)
  Atom(_id=a1, element=elem1)
  Atom(_id=a2, element=elem2)
  Atom(_id=a3, element=elem3)
end

query(mutagenesis, paths2, (elem1="c", elem2="n", elem3="c"))

41×3 DataFrame

16 rows omitted

Row	atom1	atom2	atom3
	Int64	Int64	Int64
1	163	164	165
2	169	170	163
3	620	624	608
4	768	769	787
5	786	766	767
6	1105	1104	1086
7	1315	1319	1320
8	1316	1321	1320
9	1347	1348	1366
10	1365	1345	1346
11	1403	1406	1407
12	1407	1408	1392
13	1407	1408	1412
⋮	⋮	⋮	⋮
30	3113	3102	3103
31	3185	3184	3166
32	3362	3368	3378
33	3377	3366	3357
34	3595	3601	3608
35	3607	3600	3590
36	4029	4019	4020
37	4398	4385	4402
38	4546	4550	4534
39	4837	4838	4839
40	4888	4891	4892
41	4892	4893	4887

Conjunctive queries as limits

The theory of relational databases is category theory. This statement is true in many ways and from many perspectives, but we will focus on just one: the equivalence between conjunctive queries on relational databases and finite limits in the category of sets. This equivalence is almost immediate if you are familiar with the relevant concepts; it is certainly not new (Kato 1983; Alagic and Alagic 1991). Nevertheless, it is a fruitful observation for both category theory and relational databases. In one direction, conjunctive queries inherit an intuitive diagrammatic syntax through the operad of undirected wiring diagrams. Conversely, any general-purpose package for computational category theory, which Catlab aims to be, ought to be able to efficiently compute finite limits of sets. Algorithms for doing so have been extensively developed by database theorists and engineers.

Here is a dictionary between conjunctive queries of different kinds and finite limits in \mathsf{Set} of different shapes.

cross joins of tables X and Y are products
joins² along attributes a_X: X \to A and a_Y: Y \to A are pullbacks
multi-way joins along attributes a_i: X_i \to A, for i=1,\dots,n, are wide pullbacks
general conjunctive queries are limits of finite bipartite diagrams

The last statement, about conjunctive queries and bipartite diagrams, requires elaboration. We will break it down into two steps. First, conjunctive queries can be represented by undirected wiring diagrams (UWDs); in Catlab, they actually are UWDs.³ Second, the UWD translates into a bipartite diagram, whose limit is the result of the query.

As an illustration, recall this conjunctive query on the mutagenesis data.

Code

bond_within_molecule = @relation (Bond=b) begin
  Bond(_id=b, atom1=a1, atom2=a2)
  Atom(_id=a1, molecule=m)
  Atom(_id=a2, molecule=m)
end

bond_within_molecule isa UndirectedWiringDiagram

true

The tables in the query correspond to boxes (unfilled circles) in the UWD and the variables correspond to junctions (filled circles).

We won’t say much more about UWDs here. For additional examples of diagrammatic query syntax, see the related earlier post by Andrew Baas on UWDs and SQL queries.

In the second step, the UWD translates into a bipartite diagram in \mathsf{Set}, where the boxes of the UWD become vertices in the diagram’s top layer and the junctions become vertices in the bottom layer.

The result R of the conjunctive query is then the limit of this bipartite diagram, which can be thought of as a “generalized pullback.”

In technical terms, isomorphism classes of data frames (multispans) constitute an algebra of the operad of UWDs, where composition is given by limits of the induced bipartite diagrams.

All this is to say what a conjunctive query is but not how to compute it. There are many ways to evaluate a conjunctive query. The simplest, sketched in the previous section, is to take the cartesian product of the sets in the top layer of the bipartite diagram, then filter the tuples, keeping only those that project to equal values under the functions over any given set in the bottom layer. In a special case, this procedure is known to category theorists as “computing a pullback by taking an equalizer of a product.” However, it is probably the worst possible algorithm for computing a limit, generally being extremely inefficient.

The inefficiency arises from having to iterate over the elements of a cartesian product X_1 \times \cdots \times X_n, whose cardinality grows exponentially in n. To avoid this, the conjunctive query should be decomposed into a sequence of joins, a problem known as query planning. Catlab currently employs a simple greedy heuristic for choosing joins.

Joins, or equivalently pullbacks, may themselves be computed in many ways, of which the most common are (Mishra and Eich 1992):

Nested loop join: the naive algorithm that iterates over all pairs of elements
Sort-merge join: both vectors of attributes are sorted, then merged together by synchronized iteration
Hash join: a hash map (index of preimages) is built for one attribute, then the other attribute vector is iterated, looking up preimages for each element

All three join algorithms are implemented in Catlab, for two-way and multi-way joins. Both the sort-merge join and the hash join are generally much faster than the nested loop join. The hash join performs especially well when an attribute is already indexed in the attributed \mathsf{C}-set, as happens for the source and target functions in a graph.

Outlook

Despite its simplistic query planning, Catlab’s conjunctive queries on \mathsf{C}-sets are already fairly usable. Over time we will implement better query planning algorithms, as well as more general kinds of queries. Possibilities include unions of conjunctive queries, the class of queries expressible in coherent logic, and recursive queries, as in Datalog.

As we have seen, relational databases enjoy two key advantages over isolated data frames. First, relational schemas enable the data’s encoding to mirror its underlying relational structure, which eliminates redundancy in storage and promotes internal consistency. Second, declarative query languages make it easy to ask questions about the data and view it through different lenses. On the other hand, SQL databases and logic programming languages like Prolog and Datalog have generally suffered from the Achilles’ heel of all domain-specific languages (DSLs): you cannot step outside the DSL without abandoning the system entirely. Yet stepping outside the DSL is often necessary, since even the most expressive query languages have their limitations.

We think that \mathsf{C}-sets can offer the best of both worlds to data analysis, providing the benefits of the relational data model while existing as ordinary data structures in a general-purpose programming language. If you cannot express your query using the @relation macro, you can always just write the for loops or the map calls! Due to the design of the Julia programming language and the \mathsf{C}-sets in Catlab, doing so should cause no performance degradations. In this way, we hope that \mathsf{C}-sets will eventually become a valuable component of the JuliaData ecosystem.

References

Alagic, Suad, and Mara Alagic. 1991. “Joins as Pullbacks.” In Third Workshop on Foundations of Models and Languages for Data and Objects, Aigen, Austria, 23.-27. September 1991, edited by Jutta Göers, Andreas Heuer, and Gunter Saake, 91/3:197–207. Informatik-Berichte Des IfI. Technische Universität Clausthal.

Kato, Akihiko. 1983. “AN ABSTRACT RELATIONAL MODEL AND NATURAL JOIN FUNCTORS .” Bulletin of Informatics and Cybernetics 20 (3/4): 95–106. https://doi.org/10.5109/13349.

Mishra, Priti, and Margaret H. Eich. 1992. “Join Processing in Relational Databases.” ACM Computing Surveys 24 (1): 63–113. https://doi.org/10.1145/128762.128764.

Schultz, Patrick, David I. Spivak, Christina Vasilakopoulou, and Ryan Wisnesky. 2016. “Algebraic Databases.”

Spivak, David I. 2014. Category Theory for the Sciences. The MIT Press. London, England: MIT Press.

Spivak, David I. 2012. “Functorial Data Migration.” Information and Computation 217 (August): 31–51. https://doi.org/10.1016/j.ic.2012.05.001.

Footnotes

See the line of research by David Spivak and collaborators (David I. Spivak 2012; David I. Spivak 2014; Schultz et al. 2016) and also the earlier blog post by Owen Lynch on the mathematics of attributed \mathsf{C}-sets.↩︎
In this post, by “join” we always mean both “inner join” and “equi-join.” Catlab does not yet support outer joins or joins with predicates other than equality.↩︎
And UWDs are, in turn, \mathsf{C}-sets for a schema \mathsf{C} with tables Box, Junction, Port, and OuterPort. This is a circularity but not a vicious one. It does imply that we could evaluate conjunctive queries on conjunctive queries (!), for whatever that is worth.↩︎