<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Xavier Llorà &#187; data processing</title>
	<atom:link href="http://www.xavierllora.net/tag/data-processing/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.xavierllora.net</link>
	<description>A notebook on data-intensive computing, genetics-based machine learning &#38; more.</description>
	<lastBuildDate>Sun, 08 Jan 2012 19:39:15 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>Efficient storage for Python</title>
		<link>http://www.xavierllora.net/2008/07/01/efficient-storage-for-python/</link>
		<comments>http://www.xavierllora.net/2008/07/01/efficient-storage-for-python/#comments</comments>
		<pubDate>Tue, 01 Jul 2008 21:44:57 +0000</pubDate>
		<dc:creator>Xavier</dc:creator>
				<category><![CDATA[Notes]]></category>
		<category><![CDATA[data processing]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[storage]]></category>

		<guid isPermaLink="false">http://www.xavierllora.net/?p=254</guid>
		<description><![CDATA[Did you ever run into the situation that your analysis/simulation data is too large to fit it in memory? Does the flat file format you use for storing your data sets become to big that renders it slow to a crawl? If you answered yes, you may want to give a spin to the HDF5 library. HDF5 [...]
Related posts:<ol>
<li><a href='http://www.xavierllora.net/2009/08/13/easy-reliable-and-flexible-storage-for-python/' rel='bookmark' title='Easy, reliable, and flexible storage for Python'>Easy, reliable, and flexible storage for Python</a></li>
<li><a href='http://www.xavierllora.net/2009/09/29/temporary-storage-for-meandres-distribute-flow-execution/' rel='bookmark' title='Temporary storage for Meandre&#8217;s distributed flow execution'>Temporary storage for Meandre&#8217;s distributed flow execution</a></li>
<li><a href='http://www.xavierllora.net/2008/06/09/crash-course-on-threading-in-python/' rel='bookmark' title='Crash course on threading in Python'>Crash course on threading in Python</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>Did you ever run into the situation that your analysis/simulation data is too large to fit it in memory? Does the flat file format you use for storing your data sets become to big that renders it slow to a crawl? If you answered yes, you may want to give a spin to the <a href="http://www.hdfgroup.com/HDF5/whatishdf5.html">HDF5 library</a>. HDF5 file are not replacement for relational data bases. They are catered for storing complex data objects and a wide variety of metadata. It is also optimize for efficiency of storage and retrieval. The underlying library is written in C.</p>
<p>If you are a Python user, <a href="http://www.pytables.org/">PyTables</a> provides a very efficient wrapper for HDF5 files. It gives you access to all the HDF5 api, plus it is nicely integrated with <a href="http://numpy.scipy.org/">NumPy</a> and provides natural naming conventions. In another words, you can quickly store and retrieve your arrays/matrix to HDF5 files, giving you a very interesting persistence layer.  For instance you can do a simple table scan by:</p>

<div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;"><span style="color: #ff7700;font-weight:bold;">import</span> tables<span style="color: #66cc66;">;</span>
&nbsp;
h5f = tables.<span style="color: black;">openFile</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;test.h5f&quot;</span>,<span style="color: #483d8b;">&quot;r&quot;</span><span style="color: black;">&#41;</span>
<span style="color: #ff7700;font-weight:bold;">for</span> r <span style="color: #ff7700;font-weight:bold;">in</span> h5f.<span style="color: black;">root</span>.<span style="color: black;">Table</span> :
    <span style="color: #ff7700;font-weight:bold;">pass</span>
h5f.<span style="color: black;">close</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span></pre></div></div>

<p>The code above opens an H5DF file called &#8220;test.h5f&#8221; which contains a table at the root named &#8220;Table&#8221;&#8211;see the natural notation here. Then it just scans the table using the common Python syntax. You can also do grouped reads, read all into memory, filtered wheres, and the like. </p>
<p>Definitely worth to give it a spin if you have to much data for your available memory or flat file format. Also, I am wondering how fast this may fly on a SSD&#8230;</p>
<p>Related posts:<ol>
<li><a href='http://www.xavierllora.net/2009/08/13/easy-reliable-and-flexible-storage-for-python/' rel='bookmark' title='Easy, reliable, and flexible storage for Python'>Easy, reliable, and flexible storage for Python</a></li>
<li><a href='http://www.xavierllora.net/2009/09/29/temporary-storage-for-meandres-distribute-flow-execution/' rel='bookmark' title='Temporary storage for Meandre&#8217;s distributed flow execution'>Temporary storage for Meandre&#8217;s distributed flow execution</a></li>
<li><a href='http://www.xavierllora.net/2008/06/09/crash-course-on-threading-in-python/' rel='bookmark' title='Crash course on threading in Python'>Crash course on threading in Python</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.xavierllora.net/2008/07/01/efficient-storage-for-python/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>SVNKit or analyzing SVN content in Java</title>
		<link>http://www.xavierllora.net/2008/05/20/svnkit-or-analyzing-svn-content-in-java/</link>
		<comments>http://www.xavierllora.net/2008/05/20/svnkit-or-analyzing-svn-content-in-java/#comments</comments>
		<pubDate>Tue, 20 May 2008 14:47:52 +0000</pubDate>
		<dc:creator>Xavier</dc:creator>
				<category><![CDATA[Notes]]></category>
		<category><![CDATA[Software]]></category>
		<category><![CDATA[data processing]]></category>
		<category><![CDATA[java]]></category>
		<category><![CDATA[svn]]></category>

		<guid isPermaLink="false">http://www.xavierllora.net/?p=220</guid>
		<description><![CDATA[The other day I was looking for a piece of software that could help me pull data out of SVN repository so I can do a little analysis on it. Browsing over the net, I run into several tools, but the one that finally caught my eye was SNVKit. The package does a lot of [...]
Related posts:<ol>
<li><a href='http://www.xavierllora.net/2009/02/04/efficient-serialization-for-java-and-beyond/' rel='bookmark' title='Efficient serialization for Java (and beyond)'>Efficient serialization for Java (and beyond)</a></li>
<li><a href='http://www.xavierllora.net/2007/04/14/visualizing-content-from-metadata-stores/' rel='bookmark' title='Visualizing content from metadata stores'>Visualizing content from metadata stores</a></li>
<li><a href='http://www.xavierllora.net/2006/07/07/analyzing-active-interactive-genetic-algorithms-using-visual-analytics/' rel='bookmark' title='Analyzing active interactive genetic algorithms using visual analytics'>Analyzing active interactive genetic algorithms using visual analytics</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>The other day I was looking for a piece of software that could help me pull data out of SVN repository so I can do a little analysis on it. Browsing over the net, I run into several tools, but the one that finally caught my eye was <a title="SVNKit" href="http://svnkit.com/">SNVKit</a>. The package does a lot of stuff, but for instance, if you are only interested on pulling information out, it makes your like quite easy. The code below is just a simple example of how you can use it to pull information out of a SVN repository.</p>
<pre class="brush: java">
import java.util.LinkedList;

import org.tmatesoft.svn.core.ISVNDirEntryHandler;
import org.tmatesoft.svn.core.SVNDirEntry;
import org.tmatesoft.svn.core.SVNException;
import org.tmatesoft.svn.core.SVNURL;
import org.tmatesoft.svn.core.internal.io.svn.SVNRepositoryFactoryImpl;
import org.tmatesoft.svn.core.internal.wc.admin.SVNEntry;
import org.tmatesoft.svn.core.wc.SVNClientManager;
import org.tmatesoft.svn.core.wc.SVNLogClient;
import org.tmatesoft.svn.core.wc.SVNRevision;
import org.tmatesoft.svn.core.SVNNodeKind;

public class SVNReader {

    public static void main ( String [] sArgs ) throws SVNException {
          final LinkedList&lt;SVNDirEntry&gt; lstFiles = new LinkedList&lt;SVNDirEntry&gt;();

          SVNRepositoryFactoryImpl.setup();

          SVNClientManager clientManager = SVNClientManager.newInstance();

          SVNLogClient lc = clientManager.getLogClient();
          SVNURL svnUrl = SVNURL.parseURIDecoded(
                    &quot;svn://some.server.com:3690/path/to/trunk&quot;
                    );
          lc.doList(svnUrl, SVNRevision.HEAD,  SVNRevision.HEAD, false, true, new ISVNDirEntryHandler() {
                    public void handleDirEntry(SVNDirEntry svnEntry) throws SVNException {
                              if ( svnEntry.getKind()==SVNNodeKind.FILE) {
                                        lstFiles.add(svnEntry);
                              }
                    }
          });

          for ( SVNDirEntry svnEntry:lstFiles ) {
                    System.out.println(svnEntry);
          }
      }
}
</pre>
<p> </p>
<p>Related posts:<ol>
<li><a href='http://www.xavierllora.net/2009/02/04/efficient-serialization-for-java-and-beyond/' rel='bookmark' title='Efficient serialization for Java (and beyond)'>Efficient serialization for Java (and beyond)</a></li>
<li><a href='http://www.xavierllora.net/2007/04/14/visualizing-content-from-metadata-stores/' rel='bookmark' title='Visualizing content from metadata stores'>Visualizing content from metadata stores</a></li>
<li><a href='http://www.xavierllora.net/2006/07/07/analyzing-active-interactive-genetic-algorithms-using-visual-analytics/' rel='bookmark' title='Analyzing active interactive genetic algorithms using visual analytics'>Analyzing active interactive genetic algorithms using visual analytics</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.xavierllora.net/2008/05/20/svnkit-or-analyzing-svn-content-in-java/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

