Jul 1, 2008

Efficient storage for Python

Did you ever run into the situation that your analysis/simulation data is too large to fit it in memory? Does the flat file format you use for storing your data sets become to big that renders it slow to a crawl? If you answered yes, you may want to give a spin to the HDF5 library. HDF5 file are not replacement for relational data bases. They are catered for storing complex data objects and a wide variety of metadata. It is also optimize for efficiency of storage and retrieval. The underlying library is written in C. If you are a Python user, PyTables provides a very efficient wrapper for HDF5 files. It gives you access to all the HDF5 api, plus it is nicely integrated with NumPy and provides natural naming conventions. In another words, you can quickly store and retrieve your arrays/matrix to HDF5 files, giving you a very interesting persistence layer. For instance you can do a simple table scan by:

import tables;

h5f = tables.openFile("test.h5f","r")
for r in h5f.root.Table :
    pass
h5f.close()

The code above opens an H5DF file called “test.h5f” which contains a table at the root named “Table”–see the natural notation here. Then it just scans the table using the common Python syntax. You can also do grouped reads, read all into memory, filtered wheres, and the like. Definitely worth to give it a spin if you have to much data for your available memory or flat file format. Also, I am wondering how fast this may fly on a SSD…