A categorical approach to scientific data management

By Andrew Baas on November 2, 2020

There seems to be constant conflict over data management in fields of scientific research. While rigorous data management can lead to improvements in the accuracy, reproducibility, and credibility of a scientist's published results, it can also seem irrelevant and difficult to instate at the beginning of the research process. The research cycle is a dynamic process, and scientists often lack the time to adapt their data management systems to meet the needs of an evolving, experimental workflow.

In response to the apparent conflict between disciplined data storage and research agility, many scientists opt for managing data by hand. While this method tends to be more ad-hoc and can cause difficulties with data upkeep and analysis, it also gives the scientist the freedom to rapidly iterate and develop their experiment, not being held back by a need for fitting data to a specific system.

We believe that this trade-off between rigor and research agility is unnecessary, and that tools can be created that not only remain unobtrusive but provide a framework for rapid analysis of data within the iteration process. This allows the scientist to make informed decisions based on numerical data during research iteration and results in a consistent and queryable historical record of the experimentation process.

While there are many excellent tools built specifically to address this exact problem,[1] we believe that category theory provides a natural answer to this problem. Just as a categorical approach has provided a simultaneously intuitive and powerful method for describing Petri nets,[2] we will show that approaching data management from this perspective provides similar benefits.

[1] One such example is DrWatson.jl, a Julia package which provides extensive tooling for managing scientific scripts as well as the data those scripts produce.
[2] See the previous post by Micah Halter and Evan Patterson on compositional modeling with AlgebraicPetri.jl.

The AlgebraicRelations package

As a part of the AlgebraicJulia suite, AlgebraicRelations.jl is a tool which naturally connects a scientific workflow with a data management system (a SQL database) such that the definition of an experiment automatically generates both the database and an interface for querying the database.

In order to showcase this process, we will step through it with an example experiment.[3] This experiment will investigate the hypothesis that Young's modulus can be calculated by stretching materials within their region of linear elasticity. This may not exactly be novel science but will serve as a useful example. Our experiment will involve taking several materials with known Young's moduli and stretching them to obtain a stress-strain curve. The stress will be measured directly by a force gauge, and the strain will be measured by a camera observing the movement of markings on the material. The resulting data will then be compared with the known Young's modulus to assess the accuracy of this method.

We begin by importing the necessary files from AlgebraicRelations.

# Necessary imports
using AlgebraicRelations.DB
using AlgebraicRelations.Queries
using AlgebraicRelations.Presentations
using AlgebraicRelations.Interface
[3] As this package is in active development, there are several features (like functorial data migration) which have not yet been implemented, and so will not appear in this example. These will be discussed in the Future Development section of this post.

Generating the primitives

The first step of this process is defining the primitives which will be involved in the experiment. These primitives fall into two groups, types of objects used in the experiment and processes which convert between types of objects. For example, one of the first steps of our experiment is the process of calibrating our strain gauge and camera system. We call this process calibrate, and it accepts a Material object (some material with known properties under this experiment) and a ValueFile object (a file with calibration constants for this material). This process produces a Calibration object (a file containing the calibration information).

Each primitive type is associated with a Julia datatype (e.g. storing the folder name of Images as String). The processes are in turn defined by their sets of input and output types. If a process accepts or returns multiple types of objects, this is shown as a product of the types, using the symbol.

We will store the primitives in a structure called a Presentation.[4]

# Initialize Presentation object
present = Presentation()

# Add types to presentation
Material, ForceRange, Images,
ValueFile, Value, Calibration,
Metadata = add_types!(present, [(:Material, String),
                                (:ForceRange, Real),
                                (:Images, String),
                                (:ValueFile, String),
                                (:Value, Real),
                                (:Calibration, String),
                                (:Metadata, String)]);

# Add processes to presentation
calibrate, measure,
edge_detection, calc_strain,
lin_interp = add_processes!(present, [(:calibrate, Material⊗ValueFile, Calibration),
                                      (:measure, Material⊗ForceRange⊗Calibration, ValueFile⊗Images),
                                      (:edge_detection, Images, Images⊗Metadata),
                                      (:calc_strain, Images, ValueFile⊗Metadata),
                                      (:lin_interp, ValueFile⊗ValueFile, Value⊗Value⊗Value)]);

Once we define these primitives, we can visualize their relationship as a bipartite graph.

# Convert to Schema
ExpDB = present_to_schema(present);
sch = draw_schema(present)

The arrows pointing from circles (types of objects) to squares (processes) show an input relationship between the two, while the arrows from squares to circles show an output relationship. This visualization of the workflow primitives suggests possible experiments (represented as paths within this bipartite graph).

We can also create the SQL that generates the database schema for these primitives.

print(generate_schema_sql(ExpDB()))
CREATE TABLE calc_strain (Images1 text, ValueFile2 text, Metadata3 text);
CREATE TABLE calibrate (Material1 text, ValueFile2 text, Calibration3 text);
CREATE TABLE edge_detection (Images1 text, Images2 text, Metadata3 text);
CREATE TABLE measure (Material1 text, ForceRange2 real, Calibration3 text, ValueFile4 text, Images5 text);
CREATE TABLE lin_interp (ValueFile1 text, ValueFile2 text, Value3 real, Value4 real, Value5 real);

As you may notice if you are familiar with SQL, a table is created for each process in the presentation, and the columns of the table are the types of the input and output products of the process. You may also notice that this database has no internal relationships between tables, an aspect which works to our benefit because there are no forced internal relationships. This lack of enforced internal relationships allows this database to contain data from any workflow based off of the defined primitives.[5]

By generating a schema directly from the primitives, we create a database that is flexible to change as the experiment matures. While a scientist will likely regularly change the arrangement of processes in the workflow of their experiment as it matures, they will likely only change the primitives themselves a handful of times over the lifecycle of the research project.[6] This choice of generated schema allows scientists to develop their experiments within the primitives they have defined without the concern of generating a new database each time.

[4] We call this a "presentation" because we are actually defining the generators (types and processes) that present a free symmetric monoidal category (the category of our workflow).
[5] As hierarchical definitions are added, foreign keys may be inserted into a generated database schema in order to enforce relationships within components composed of primitives.
[6] The changing of primitives over the lifetime of a research project falls under the work on data migration, an active area of development in AlgebraicRelations.

Creating specific workflows

Using these previously defined primitives, workflows can be defined using an automatically generated domain specific language (DSL). This DSL has a function for each process and a datatype for each type in the presentation. The DSL takes advantage of the paradigm that processes are connected through the data products in the same way as functions are connected through variables. Below is our workflow for the proposed experiment.

p = @program present (mat::Material, force::ForceRange, cal_mat::Material, cal_vals::ValueFile) begin
    cal_data = calibrate(cal_mat, cal_vals)
    stresses, strain_ims = measure(mat, force, cal_data)
    edge_ims, _ = edge_detection(strain_ims)
    strains, _ = calc_strain(edge_ims)
    young_slope, young_intercept, young_Rsq = lin_interp(stresses, strains)
    return young_slope, young_intercept, young_Rsq
end;

We can then visualize this workflow to get an idea of correctness.

wf_viz = draw_workflow(p, labels=true)

It should also be noted that this workflow is sufficiently well defined that it can be directly connected to Julia functions, scheduled, and executed. This will allow for automated data collection from a workflow and insertion into a database (this functionality is currently in development).

Constructing queries

While we have defined a schema (and soon a method for inserting data into that schema), all of this is useless if we have no way to answer questions about the data in the database! This is where the @query macro comes in. Using the @relation functionality provided by Catlab.jl, the @query macro allows for generating a query relationally based off of the previously defined primitives.[7]

For the sake of having something to query, we have filled a SQLite database with sample experimental data from five different materials tested.

The @query macro allows you to define a set of relational constraints, and then generates a SQL query which enforces this constraint. Let's say, for example, that we want a table which contains all calibration materials, their calibration values, and their corresponding calibration results over all of our experiments. This would be written as

q_cal = @query ExpDB() (material, cal_val, cal_res) where (material, cal_val, cal_res) begin
    calibrate(material, cal_val, cal_res)
end

This can then be converted to a SQL statement as follows:[8]

println(to_sql(q_cal))
SELECT t1.Material1 AS material, t1.ValueFile2 AS cal_val, t1.Calibration3 AS cal_res
FROM calibrate AS t1

We can then execute this query on our database to get the following dataframe which contains the filenames of the relevant calibration files used and created.

5×3 DataFrame
│ Row │ material │ cal_val               │ cal_res               │
│     │ String   │ String                │ String                │
├─────┼──────────┼───────────────────────┼───────────────────────┤
│ 1   │ Aluminum │ cal/aluminum_vals.csv │ exp_1/calibration.csv │
│ 2   │ Aluminum │ cal/aluminum_vals.csv │ exp_2/calibration.csv │
│ 3   │ Aluminum │ cal/aluminum_vals.csv │ exp_3/calibration.csv │
│ 4   │ Aluminum │ cal/aluminum_vals.csv │ exp_4/calibration.csv │
│ 5   │ Aluminum │ cal/aluminum_vals.csv │ exp_5/calibration.csv │

Since we are only querying from a single table, this is a fairly simple query, but we will show below how this looks as the query grows. We are also able to visualize the query as an undirected wiring diagram, as shown below:

q_cal_viz = draw_query(q_cal)

This is a non-traditional way of representing queries, but one which we have found useful in understanding complex queries. In order to explain how these are useful, we will begin with explaining its components. The large, open circles are the relationships (processes from the presentation), and the small filled circles are variables (objects of types from presentation). The wires express equality, and wires without a connection (dangling wires) are the columns in the resulting table.

The utility of these diagrams derives from the fact that each path through the network between dangling wires is a constraint enforced between their respective columns. To interpret this diagram and understand the constraints, we perform wire tracing.

What follows is an example of wire tracing. We take a dangling wire (say the one connected to material), we follow it to the small black dot (the variable itself), and then follow that wire to the calibrate relationship. We can then leave calibrate, following the wire to the cal_res variable, and then from that variable to its dangling wire. From this path, we can say

The column which corresponds to material is the material which was used in the calibrate process which produced the file cal_res

While this is a simple query example, we can easily add some complexity to it.[9]

q_imgs = @query ExpDB() (images, strains) where (images, edge_im, strains,
                                                 md1, md2) begin
  edge_detection(images, edge_im, md1)
  calc_strain(edge_im, strains, md2)
end

This query will enforce the edge_detection relationship between the input images and the images resulting from edge detection and will then enforce the calc_strain relationship between the edge images and the resulting calculated strain. Since images and strains are the only values in the signature, the resulting query will return a table of image folders and the strains calculated from them. The SQL, diagram, and results for this query look like this:

SELECT t1.Images1 AS images, t2.ValueFile2 AS strains
FROM edge_detection AS t1, calc_strain AS t2
WHERE t2.Images1=t1.Images2
5×2 DataFrame
│ Row │ images                 │ strains           │
│     │ String                 │ String            │
├─────┼────────────────────────┼───────────────────┤
│ 1   │ exp_1/orig_strain_ims/ │ exp_1/strains.csv │
│ 2   │ exp_2/orig_strain_ims/ │ exp_2/strains.csv │
│ 3   │ exp_3/orig_strain_ims/ │ exp_3/strains.csv │
│ 4   │ exp_4/orig_strain_ims/ │ exp_4/strains.csv │
│ 5   │ exp_5/orig_strain_ims/ │ exp_5/strains.csv │

Here you may notice that there are certain variables that are only connected to one relation and not connected to the table (namely md1 and md2). These variables are free to vary over all valid values of those columns, and do not enforce a restriction in the query. But, as you may notice, by following the relations along the one path, you will read out our previous description of the query.

The column which corresponds to images is the image folder which has been processed by edge_detection to create the edge_im image folder, which was then passed to calc_strain, resulting in strains.

Finally, we will showcase the ability to add user-defined inputs to queries (generated as prepared queries) and to add comparison relationships.

The following query will generate a table containing all materials which had a Young's modulus calculated with an R-squared greater than a given R-squared _rsq (variable names that begin with an underscore are used as arguments in prepared statements).

q_ym = @query ExpDB() (material, youngs_mod) where (material, f_range, cal_res,
                                                    forces, o_imgs, edge_imgs,
                                                    strains, youngs_mod, intercept,
                                                    rsquared, _rsq, md1, md2) begin
  measure(material, f_range, cal_res, forces, o_imgs)

  edge_detection(o_imgs, edge_imgs, md1)
  calc_strain(edge_imgs, strains, md2)

  lin_interp(forces, strains, youngs_mod, intercept, rsquared)

  >=(rsquared, _rsq)
end

We can see how the generated SQL and visualization turn out from this query statement.

SELECT t1.Material1 AS material, t4.Value3 AS youngs_mod
FROM measure AS t1, edge_detection AS t2, calc_strain AS t3, lin_interp AS t4
WHERE t2.Images1=t1.Images5 AND t3.Images1=t2.Images2 AND t4.ValueFile1=t1.ValueFile4 AND t4.ValueFile2=t3.ValueFile2 AND t4.Value5>=$1

We'll evaluate this query first with _rsq = 0.0 to show all contents

5×2 DataFrame
│ Row │ material   │ youngs_mod │
│     │ String     │ Float64    │
├─────┼────────────┼────────────┤
│ 1   │ Gold       │ 72.32      │
│ 2   │ Flax fiber │ 53.41      │
│ 3   │ Nylon      │ 3.41       │
│ 4   │ Bronze     │ 74.22      │
│ 5   │ Glass      │ 88.76      │

And then with _rsq = 0.95 to show just those with R-squared greater than 95%

3×2 DataFrame
│ Row │ material   │ youngs_mod │
│     │ String     │ Float64    │
├─────┼────────────┼────────────┤
│ 1   │ Gold       │ 72.32      │
│ 2   │ Flax fiber │ 53.41      │
│ 3   │ Nylon      │ 3.41       │

As has been shown, with this macro scientists can define SQL queries in a DSL generated directly from their workflow. This allows scientists to focus more on the actual semantic content of their queries without being concerned about the syntax of SQL.

[7] The construction of a query from directed wiring diagrams is inspired by Evan Patterson's work on bicategories of relations (Patterson 2017). The next step of conversion from undirected wiring diagrams was formalized for regular logic in work by Brendan Fong and David Spivak (Fong & Spivak 2018).
[8] These SQL queries are in the form of conjunctive queries. This method was mentioned (and its limitations explained) in David Spivak's work on the operad of wiring diagrams (Spivak 2013).
[9] At this point some may begin seeing similarities between this syntax and that of Datalog. While the @query syntax is not as powerful as Datalog, this does show a strong history of using declarative programming languages for querying databases.

Future development

Throughout this blog, we mentioned several areas of active development in the AlgebraicRelations library. These are summarized below, offering some open questions and possible areas of involvement with the development of this library!

  1. Connect data analysis workflows to an execution framework, which will support automatically checkpointing the resulting data into the database.

  2. Automated functorial data migration for when the experiment requires that the presentation be modified (primitives added/removed/renamed)

  3. Support for hierarchical modeling of both workflows and queries in order for scale to complex scientific experiments. The power of this hierarchical framework is showcased for Petri nets in the last blog post.

Appendix: ACSet-defined schema

As an interesting side-note for those more interested in the inner workings of AlgebraicRelations, the data structure which allows for smooth transitions between workflows, schemas, and relational queries is the attributed C-Set, described by Owen Lynch in a previous blog post.[10]

For example, the attributed C-Set which stores the schema of the presentation shown above looks like this:

@present ExpSchema <: TheorySQL begin
    # Tables
    calibrate::Ob
    measure::Ob
    edge_detection::Ob
    calc_strain::Ob
    lin_interp::Ob

    # Columns of tables
    calibrate_1_Material1::Attr(calibrate, String)
    calibrate_2_ValueFile2::Attr(calibrate, String)
    calibrate_3_Calibration3::Attr(calibrate, String)

    measure_1_Material1::Attr(measure, String)
    measure_2_ForceRange2::Attr(measure, String)
    measure_3_Calibration3::Attr(measure, String)
    measure_4_ValueFile4::Attr(measure, String)
    measure_5_Images5::Attr(measure, String)

    edge_detection_1_Images1::Attr(edge_detection, String)
    edge_detection_2_Images2::Attr(edge_detection, String)
    edge_detection_3_Metadata3::Attr(edge_detection, String)

    calc_strain_1_Images1::Attr(calc_strain, String)
    calc_strain_2_ValueFile2::Attr(calc_strain, String)
    calc_strain_3_Metadata3::Attr(calc_strain, String)

    lin_interp_1_ValueFile1::Attr(lin_interp, String)
    lin_interp_2_ValueFile2::Attr(lin_interp, String)
    lin_interp_3_Value3::Attr(lin_interp, String)
    lin_interp_4_Value4::Attr(lin_interp, String)
    lin_interp_5_Value5::Attr(lin_interp, String)
end;

Each process (homomorphism in a presentation) defines an object in the ACSet, and each type (object in a presentation) defines an attribute on each process that references it.

[10] This data structure is the tool which will power the hierarchical modeling and data migration that will be added to AlgebraicRelations. It also provides the structure that is necessary for converting between workflows, databases, and queries.

References

George Datseris et al, 2020. DrWatson: the perfect sidekick for your scientific inquiries. DOI:10.21105/joss.02673

Brendan Fong and David Spivak, 2018. Graphical regular logic. arXiv:1812.05765

Evan Patterson, 2017. Knowledge representation in bicategories of relations. arXiv:1706.00526

David Spivak, 2013. The operad of wiring diagrams: formalizing a graphical language for databases, recursion, and plug-and-play circuits. arXiv:1305.0297