Science data structure

About the project

During my work at the university I often had to work with large data-sets. These data-sets where not only challaging due to the sheer size of each variable in the data-set, but also to its complexity. About two years ago I sat down and took the time to develop a python module that aided me in organizing this data. I had the following “demands”:

  • A tree. Organized in a tree like fashion to cope with the same experiments, but on a different day.
  • No external viewer. I wanted to edit and view the structure of the data-set with just my standard file-browser or terminal. I did not want to install any third party apps to do this (like with for instance HDF5 and Matlab).
  • Read in parts. I wanted to be able to only read the data I required, not the entire data-set. This helped especially when the sets became very large.
  • Protected. I wanted to make sure not to easily overwrite a data-set with a script. I needed to explicitly call the write function data_set.write(), before any data is written to the disk. Even then, if there already existed a data-set on the exact same location it should give a FileExistsError unless explicitly silenced.

At the end of my PhD I did have a working library that was easy to work with (at least for me). I rewrote the code from the ground up to make it more flexible for the future and I added it to GitHub and PyPi.

Ambition

In the near future I want to add:

  • Documentation generation. With this I don’t mean the documentation I need to write in the code (which I should by the way), but the documentation included in the data-set. I often experienced with data-sets from others that it was hard to interpret what was in a variable by the name alone.
  • Versioning. I want to log (on request) in each data-set who the last author was, and also include the ability to store the old data before inserting the new data. This might be handy if you want to see if you performed the analysis correctly.
  • Inclusion of more data types. At the moment I only include numpy, but I want to at least include: CSV, plain text, pandas, excel. (If you have any more request do send me an email!)

Installing

(This documentation is also included on the github page)

The package is available through pip, so that is the most straight forward way to install:

pip install science-data-structure

Examples

Simple data-set

In this simple example a data-set is created, with a single branch parabola. In this branch two “leafs” are added x and y. At the end of the example the data_set is written to disk.

import science_data_structure.structures as structures
from pathlib import Path
import numpy


# Initialze an empty data-set
data_set = structures.StructuredDataSet(Path("./"), "example", {})

# add data to the data-set
data_set["parabola"]["x"] = numpy.linspace(-2, 2, 100)
data_set["parabola"]["y"] = data_set["parabola"]["x"].data ** 2

# write the data to disk
data_set.write()

Branch overriding

What will happen when a branch or a leaf is overwritten with another leaf or branch? This example extends the previous example

data_set["parabola"]["x"] = None

The above code will try to delete the variable x, however it will raise a PermissionError. This protection method is in place to make sure that data from a data-set is not simple overwritten. The user must explicitly ask to override the branch or leaf. In the case above, a simple solution will be:

data_set.overwrite = True
data_set["parabola"]["x"] = None
data_set.overwrite = False

data_set.write(exists_ok=True)

The last protection in place is the exist_ok variable in the data_set.write() function. This makes sure to not accidentally override an existing data-set.

Reading an existing data-set

Often you want to read a data-set, use it, adapt it, and write the results back to disk. The following script does just that.

import science_data_structure.structures as structures
from pathlib import Path
import numpy


# Initialze an empty data-set
data_set = structures.StructuredDataSet.read(Path("./example.struct"))

a = 2
b = 4
data_set["linear"]["x"] = numpy.linspace(-2, 2, 100)
data_set["linear"]["y"] = data_set["linear"]["x"] * a + b

data_set.write(exists_ok=True)

Note that we again must set the exists_ok = True, otherwise the data-set cannot be written to disk.

Wouter G. van Veen
Wouter G. van Veen
Aspiring developer & scientist

I use computational fluid mechanics to research the fundaments of insect flight.

Related