<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Xavier Llorà &#187; data-intensive flows</title>
	<atom:link href="http://www.xavierllora.net/tag/data-intensive-flows/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.xavierllora.net</link>
	<description>A notebook about data-intensive computing, genetics-based machine learning, semantic-web technology, cloud computing,  and more.</description>
	<lastBuildDate>Thu, 15 Jul 2010 19:50:19 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0</generator>
		<item>
		<title>Meandre is going Scala</title>
		<link>http://www.xavierllora.net/2009/12/01/meandre-is-going-scala/</link>
		<comments>http://www.xavierllora.net/2009/12/01/meandre-is-going-scala/#comments</comments>
		<pubDate>Tue, 01 Dec 2009 20:01:47 +0000</pubDate>
		<dc:creator>Xavier</dc:creator>
				<category><![CDATA[Notes]]></category>
		<category><![CDATA[Software]]></category>
		<category><![CDATA[clojure]]></category>
		<category><![CDATA[data-intensive flows]]></category>
		<category><![CDATA[erlang]]></category>
		<category><![CDATA[jruby]]></category>
		<category><![CDATA[jython]]></category>
		<category><![CDATA[meandre]]></category>
		<category><![CDATA[scala]]></category>
		<category><![CDATA[ZigZag]]></category>

		<guid isPermaLink="false">http://www.xavierllora.net/?p=637</guid>
		<description><![CDATA[After quite a bit of experimenting with different alternatives, Meandre is moving into Scala. Scala is a general purpose programming language designed to express common programming patterns in a concise, elegant, and type-safe way. This is not a radical process, but a gradual one while I am starting to revisit the infrastructure for the next [...]


Related posts:<ol><li><a href='http://www.xavierllora.net/2010/07/15/meandre-2-0-alpha-preview-scala-mongodb/' rel='bookmark' title='Permanent Link: Meandre 2.0 Alpha Preview = Scala + MongoDB'>Meandre 2.0 Alpha Preview = Scala + MongoDB</a></li>
<li><a href='http://www.xavierllora.net/2008/04/18/meandre-semantic-driven-data-intensive-flow-engine/' rel='bookmark' title='Permanent Link: Meandre: Semantic-Driven Data-Intensive Flow Engine'>Meandre: Semantic-Driven Data-Intensive Flow Engine</a></li>
<li><a href='http://www.xavierllora.net/2010/01/21/fast-rest-api-prototyping-with-crochet-and-scala/' rel='bookmark' title='Permanent Link: Fast REST API prototyping with Crochet and Scala'>Fast REST API prototyping with Crochet and Scala</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>After quite a bit of experimenting with different alternatives, <a href="http://seasr.org/meandre">Meandre</a> is moving into <a href="http://www.scala-lang.org/">Scala</a>. <a href="http://www.scala-lang.org/">Scala</a> is a general purpose programming language designed to express common programming patterns in a concise, elegant, and type-safe way. This is not a radical process, but a gradual one while I am starting to revisit the infrastructure for the next major release. <a href="http://www.scala-lang.org/">Scala</a> also generates code for the JVM making mix and match trivial. I started fuzzing around with Scala back when I started the development of <a href="http://seasr.org/meandre">Meandre</a> during the summer of 2007, however I did fall back to Java since that was what most of the people in the group was comfortable with. I was fascinated with <a href="http://www.scala-lang.org/">Scala</a> fusion of object oriented programming and functional programming. Time went by and the codebase has grown to a point that I cannot stand anymore cutting through the weeds of Java when I have to extend the infrastructure or do bug fixing&#8212;not to mention its verbosity even for writing trivial code.</p>
<p>This summer I decided to go on a quest to get me out of the woods. I do not mind relying on the JVM and the large collection of libraries available, but I would also like to get my sanity back. Yes, I tested some of the usual suspects for the JVM (<a href="http://www.jython.org/">Jython</a>, <a href="http://jruby.org/">JRuby</a>, <a href="http://clojure.org/">Clojure</a>, and <a href="http://groovy.codehaus.org/">Groovy</a>) but not quite what I wanted. For instance, I wrote most of the <a href="http://seasr.org/meandre">Meandre</a> infrastructure services using <a href="http://www.jython.org/">Jython</a> (much more concise than Java), but still not quite happy to jump on that boat. <a href="http://clojure.org/">Clojure</a> is also interesting (functional programming) but it would be hard to justify for the group to move into it since not everybody may feel comfortable with a pure functional language. I also toyed with some not-so-usual ones like <a href="http://www.erlang.org/">Erlang</a> and <a href="http://www.haskell.org/">Haskell</a>, but again, I ended up with no real argument that could justify such a decision. </p>
<p>So, as I started doing back in 2007, I went back to my original idea of using  <a href="http://www.scala-lang.org/">Scala</a> and its mixed object-oriented- and functional-programming- paradigm. To test it seriously, I started developing the distributed execution engine for <a href="http://seasr.org/meandre">Meandre</a> in  <a href="http://www.scala-lang.org/">Scala</a> using its Earlang-inspired actors. And, boom, suddenly I found myself spending more time thinking that writing/debugging threaded/networking code <img src='http://www.xavierllora.net/wp-includes/images/smilies/icon_biggrin.gif' alt=':D' class='wp-smiley' /> . Yes, I regret my 2007 decision instead of running with my original intuition, but better late than never. With a working seed of the distributed engine working and tested (did I mention that <a href="http://code.google.com/p/scalacheck/">scalacheck</a> and <a href="http://code.google.com/p/specs/">specs</a> are really powerful tools for behavior driven development?), I finally decided to start gravitating the Meandre infrastructure development effort from Java to <a href="http://www.scala-lang.org/">Scala</a>&#8212;did I mention that <a href="http://www.scala-lang.org/">Scala</a> is Martin Odersky&#8217;s child? Yes, such a decision has some impact on my colleagues, but I envision that the benefits will eventually weight out the initial resistance and step learning curve. At least, the last two group meetings nobody jumped off the window while presenting the key elements of  <a href="http://www.scala-lang.org/">Scala</a>, and demonstrating how concise and elegant it made the first working seed of the distributed execution engine <img src='http://www.xavierllora.net/wp-includes/images/smilies/icon_biggrin.gif' alt=':D' class='wp-smiley' /> . We even got in discussions about the benefits of using <a href="http://www.scala-lang.org/">Scala</a> if it delivered everything I showed. I am lucky to work with such smart guys. If you want to take a peek at the <a href="http://dev-tools.seasr.org/fisheye/browse/Snowfield">distributed execution engine (a.k.a. Snowfield) at SEASR&#8217;s Fisheye</a>.</p>
<p>Oh, one last thing. Are you using Atlassian&#8217;s Fisheye? Do you want <a href='http://www.xavierllora.net/wp-content/uploads/2009/12/scala.def_.gz'>syntax highlighting for Scala</a>? I tweaked the Java definitions to make it highlight  <a href="http://www.scala-lang.org/">Scala</a> code. Remember to drop the <code>scala.def</code> file on <code>$FISHEYE_HOME/syntax</code> directory add an entry on the filename.map to make it highlight anything with extension <code>.scala</code>.</p>


<p>Related posts:<ol><li><a href='http://www.xavierllora.net/2010/07/15/meandre-2-0-alpha-preview-scala-mongodb/' rel='bookmark' title='Permanent Link: Meandre 2.0 Alpha Preview = Scala + MongoDB'>Meandre 2.0 Alpha Preview = Scala + MongoDB</a></li>
<li><a href='http://www.xavierllora.net/2008/04/18/meandre-semantic-driven-data-intensive-flow-engine/' rel='bookmark' title='Permanent Link: Meandre: Semantic-Driven Data-Intensive Flow Engine'>Meandre: Semantic-Driven Data-Intensive Flow Engine</a></li>
<li><a href='http://www.xavierllora.net/2010/01/21/fast-rest-api-prototyping-with-crochet-and-scala/' rel='bookmark' title='Permanent Link: Fast REST API prototyping with Crochet and Scala'>Fast REST API prototyping with Crochet and Scala</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.xavierllora.net/2009/12/01/meandre-is-going-scala/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
		<item>
		<title>Temporary storage for Meandre&#8217;s distributed flow execution</title>
		<link>http://www.xavierllora.net/2009/09/29/temporary-storage-for-meandres-distribute-flow-execution/</link>
		<comments>http://www.xavierllora.net/2009/09/29/temporary-storage-for-meandres-distribute-flow-execution/#comments</comments>
		<pubDate>Tue, 29 Sep 2009 15:14:28 +0000</pubDate>
		<dc:creator>Xavier</dc:creator>
				<category><![CDATA[Notes]]></category>
		<category><![CDATA[Software]]></category>
		<category><![CDATA[Data-Intensive Computing]]></category>
		<category><![CDATA[data-intensive flows]]></category>
		<category><![CDATA[java]]></category>
		<category><![CDATA[meandre]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[storage]]></category>
		<category><![CDATA[tokyo cabinet]]></category>

		<guid isPermaLink="false">http://www.xavierllora.net/?p=615</guid>
		<description><![CDATA[Designing the distributed execution of a generic Meandre flow involves several moving pieces. One of those is the temporary storage required by the computing nodes (think of it as one node as one isolated component of a flow) to keep up with the data generated by a component, and also be able to replicate such [...]


Related posts:<ol><li><a href='http://www.xavierllora.net/2009/08/13/easy-reliable-and-flexible-storage-for-python/' rel='bookmark' title='Permanent Link: Easy, reliable, and flexible storage for Python'>Easy, reliable, and flexible storage for Python</a></li>
<li><a href='http://www.xavierllora.net/2008/05/22/zookeeper-and-orchestrating-distributed-applications/' rel='bookmark' title='Permanent Link: ZooKeeper and distributed applications'>ZooKeeper and distributed applications</a></li>
<li><a href='http://www.xavierllora.net/2008/04/18/meandre-semantic-driven-data-intensive-flow-engine/' rel='bookmark' title='Permanent Link: Meandre: Semantic-Driven Data-Intensive Flow Engine'>Meandre: Semantic-Driven Data-Intensive Flow Engine</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>Designing the distributed execution of a generic Meandre flow involves several moving pieces. One of those is the temporary storage required by the computing nodes (think of it as one node as one isolated component of a flow) to keep up with the data generated by a component, and also be able to replicate such storage to the node containing the consumer to be fed. Such storage, local to each node, must guarantee at least three basic properties.</p>
<ul>
<li>Transaction ready</li>
<li>Light weight implementation</li>
<li>Efficient write and read to minimize the contention on ports</li>
</ul>
<p>Also, it is important to keep in mind that in a distributed execution scenario, each node requires to have its one separated and standalone storage system. Thus, it is also important to minimize the overhead of installation and maintenance of such storage subsystem. There are several alternatives available ranging from traditional relational data base systems to home-brewed solutions. Relational data base systems provide a distributed, reliable, stable, and well tested environment, but they may tend to require a quite involved installation and maintenance. Also, tuning those systems to optimize performance may required quite an involved monitoring and tweaking. On the other hand, home-brewed solutions can be optimized for performance by dropping non required functionality and focussing on writing and reading performance. However, such solutions tend to be bug prone and tend to become time consuming, not to mention that proving transaction correctness can be quite involved.</p>
<p>Fortunately there is a middle ground where efficient and stable transaction aware solutions are available. They may not provide SQL interfaces, but they still provide transaction boundaries. Also, since they are oriented to maximize performance, they can provide better throughput and operation latency than having to traverse the SQL stack. Examples of such storage systems can be found under the areas of key-value stores and column stores. Several options were considered while writing these line, but key-value stores were the ones that better matches the three requirements described above. Several options were informally tested, including solutions like HDF and Berkely DB, however the best performing by far under similar stress test conditions as the sketched temporary storage subsystem was <a href="http://1978th.net/tokyocabinet/">Tokyo Cabinet</a>. I already <a href="/2008/06/05/the-next-generation-of-data-bases/">introduced and <a href="/2009/08/13/easy-reliable-and-flexible-storage-for-python/">tested</a> <a href="http://1978th.net/tokyocabinet/">Tokyo Cabinet</a> more than a year ago, but this time I was going to give it a stress test to basically convince myself that that was what I wanted to use for as temporary storage of the distributed flow execution.</p>
<h2>The experiment</h2>
<p>Tokyo cabinet is a collection of storage utilities including, among other facilities, key-value stores implemented as hash files or B-trees and flexible column stores. To illustrate the performance and throughput you can achieve. To implement multiple queues on a single casket (<a href="http://1978th.net/tokyocabinet/">Tokyo Cabinet</a> file containing the data store) B-trees with duplicated keys can help achieving such goal. The duplicated keys are the queue names, and the values are the <a href="http://en.wikipedia.org/wiki/Universally_Unique_Identifier">UUID</a>s of the objects being store. Objects are also stored in the same B-tree by using the <a href="http://en.wikipedia.org/wiki/Universally_Unique_Identifier">UIUD</a> as a key and the value become the payload to store (usually an array of bytes). </p>
<p>Previously, I have been heavily using Python bindings to test <a href="http://1978th.net/tokyocabinet/">Tokyo Cabinet</a>, but this time I went down the Java route (since the Meandre infrastructure is written on Java). The Java bindings are basically build around JNI and statically link to the C version of <a href="http://1978th.net/tokyocabinet/">Tokyo Cabinet</a> library, giving away the best of both world. To measure how fast can I write data out of a port into the local storage in a transactional mode, I used the following piece of code.</p>

<div class="wp_syntax"><div class="code"><pre class="java" style="font-family:monospace;">	<span style="color: #000000; font-weight: bold;">public</span> <span style="color: #000000; font-weight: bold;">static</span> <span style="color: #000066; font-weight: bold;">void</span> main <span style="color: #009900;">&#40;</span> <span style="color: #003399;">String</span> args <span style="color: #009900;">&#91;</span><span style="color: #009900;">&#93;</span> <span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
		<span style="color: #000066; font-weight: bold;">int</span> MAX <span style="color: #339933;">=</span> <span style="color: #cc66cc;">10000000</span><span style="color: #339933;">;</span>
		<span style="color: #000066; font-weight: bold;">int</span> inc <span style="color: #339933;">=</span> <span style="color: #cc66cc;">10</span><span style="color: #339933;">;</span>
		<span style="color: #000066; font-weight: bold;">int</span> cnt <span style="color: #339933;">=</span> <span style="color: #cc66cc;">0</span><span style="color: #339933;">;</span>
		<span style="color: #000066; font-weight: bold;">float</span> fa <span style="color: #009900;">&#91;</span><span style="color: #009900;">&#93;</span> <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> <span style="color: #000066; font-weight: bold;">float</span><span style="color: #009900;">&#91;</span><span style="color: #cc66cc;">8</span><span style="color: #009900;">&#93;</span><span style="color: #339933;">;</span>
		<span style="color: #000066; font-weight: bold;">int</span> reps <span style="color: #339933;">=</span> <span style="color: #cc66cc;">10</span><span style="color: #339933;">;</span>
&nbsp;
		<span style="color: #000000; font-weight: bold;">for</span> <span style="color: #009900;">&#40;</span> <span style="color: #000066; font-weight: bold;">int</span> i<span style="color: #339933;">=</span><span style="color: #cc66cc;">1</span> <span style="color: #339933;">;</span> i<span style="color: #339933;">&lt;=</span>MAX <span style="color: #339933;">;</span> i<span style="color: #339933;">*=</span>inc  <span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
			<span style="color: #666666; font-style: italic;">//System.out.println(&quot;Size: &quot;+i);</span>
			<span style="color: #000000; font-weight: bold;">for</span> <span style="color: #009900;">&#40;</span> <span style="color: #000066; font-weight: bold;">int</span> j<span style="color: #339933;">=</span><span style="color: #cc66cc;">0</span> <span style="color: #339933;">;</span> j<span style="color: #339933;">&lt;</span>reps <span style="color: #339933;">;</span> j<span style="color: #339933;">++</span> <span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>	
				<span style="color: #666666; font-style: italic;">//System.out.println(&quot;\tRepetition: &quot;+j);</span>
&nbsp;
				<span style="color: #666666; font-style: italic;">// open the database</span>
				BDB bdb <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> BDB<span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
				<span style="color: #000000; font-weight: bold;">if</span><span style="color: #009900;">&#40;</span><span style="color: #339933;">!</span>bdb.<span style="color: #006633;">open</span><span style="color: #009900;">&#40;</span>TEST_CASKET_TCB, BDB.<span style="color: #006633;">OWRITER</span> <span style="color: #339933;">|</span> BDB.<span style="color: #006633;">OCREAT</span> <span style="color: #339933;">|</span> BDB.<span style="color: #006633;">OTSYNC</span> <span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#123;</span>
					<span style="color: #000066; font-weight: bold;">int</span> ecode <span style="color: #339933;">=</span> bdb.<span style="color: #006633;">ecode</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
					fail<span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;open error: &quot;</span> <span style="color: #339933;">+</span> bdb.<span style="color: #006633;">errmsg</span><span style="color: #009900;">&#40;</span>ecode<span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
				<span style="color: #009900;">&#125;</span>
&nbsp;
				<span style="color: #666666; font-style: italic;">// Add a bunch of duplicates</span>
				<span style="color: #000066; font-weight: bold;">long</span> start <span style="color: #339933;">=</span> <span style="color: #003399;">System</span>.<span style="color: #006633;">currentTimeMillis</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
				bdb.<span style="color: #006633;">tranbegin</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
				<span style="color: #000000; font-weight: bold;">for</span> <span style="color: #009900;">&#40;</span> <span style="color: #000066; font-weight: bold;">int</span> k<span style="color: #339933;">=</span><span style="color: #cc66cc;">0</span><span style="color: #339933;">;</span> k<span style="color: #339933;">&lt;</span>i<span style="color: #339933;">;</span> k<span style="color: #339933;">++</span> <span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
					<span style="color: #003399;">String</span> uuid <span style="color: #339933;">=</span> UUID.<span style="color: #006633;">randomUUID</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span>.<span style="color: #006633;">toString</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
					bdb.<span style="color: #006633;">putdup</span><span style="color: #009900;">&#40;</span>QUEUE_KEY, uuid<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
					bdb.<span style="color: #006633;">putdup</span><span style="color: #009900;">&#40;</span>uuid.<span style="color: #006633;">getBytes</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span>, uuid.<span style="color: #006633;">getBytes</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>	
				<span style="color: #009900;">&#125;</span>
				bdb.<span style="color: #006633;">trancommit</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
				fa<span style="color: #009900;">&#91;</span>cnt<span style="color: #009900;">&#93;</span> <span style="color: #339933;">+=</span> <span style="color: #003399;">System</span>.<span style="color: #006633;">currentTimeMillis</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">-</span>start<span style="color: #339933;">;</span>
&nbsp;
				<span style="color: #666666; font-style: italic;">// Clean up</span>
				bdb.<span style="color: #006633;">close</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
				<span style="color: #000000; font-weight: bold;">new</span> <span style="color: #003399;">File</span><span style="color: #009900;">&#40;</span>TEST_CASKET_TCB<span style="color: #009900;">&#41;</span>.<span style="color: #006633;">delete</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
			<span style="color: #009900;">&#125;</span>
			fa<span style="color: #009900;">&#91;</span>cnt<span style="color: #009900;">&#93;</span> <span style="color: #339933;">/=</span> reps<span style="color: #339933;">;</span>
			<span style="color: #003399;">System</span>.<span style="color: #006633;">out</span>.<span style="color: #006633;">println</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;&quot;</span><span style="color: #339933;">+</span>i<span style="color: #339933;">+</span><span style="color: #0000ff;">&quot;<span style="color: #000099; font-weight: bold;">\t</span>&quot;</span><span style="color: #339933;">+</span>fa<span style="color: #009900;">&#91;</span>cnt<span style="color: #009900;">&#93;</span><span style="color: #339933;">+</span><span style="color: #0000ff;">&quot;<span style="color: #000099; font-weight: bold;">\t</span>&quot;</span><span style="color: #339933;">+</span><span style="color: #009900;">&#40;</span>fa<span style="color: #009900;">&#91;</span>cnt<span style="color: #009900;">&#93;</span><span style="color: #339933;">/</span>i<span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
			cnt<span style="color: #339933;">++;</span>
		<span style="color: #009900;">&#125;</span>
	<span style="color: #009900;">&#125;</span></pre></div></div>

<p>The idea is very simple. Just go and star storing 1, 10, 100, 1000, 10000, 1000000, and 10000000 pieces of data at once in a transaction. Measure the time. For each data number repeat the operation 10 times and average the time trying to palliate the fact that the experiment was run on a laptop running all sorts of other concurrent applications. Plot the results to illustrate:</p>
<ol>
<li>time required to insert one piece of data as a function of the number of data involve in the transaction</li>
<li>number of pieces of data wrote per second as a function of the number of data involve in the transaction</li>
</ol>
<p>The idea is to expose the behavior of <a href="http://1978th.net/tokyocabinet/">Tokyo Cabinet</a> as more data is involved in a transaction to check if degradation happens as the volume increase. This is an important issue, since data intensive flows can generate large volumes of data per firing event. </p>
<h2>The results</h2>
<p>Results are displayed on the figures below.</p>
<p><a href="http://www.xavierllora.net/wp-content/uploads/2009/09/tc_time.png"><img src="http://www.xavierllora.net/wp-content/uploads/2009/09/tc_time-400x400.png" alt="Time per data unit as a function of number of data involve in a transaction" title="Time per data unit as a function of number of data involve in a transaction" width="300" height="300" /></a><a href="http://www.xavierllora.net/wp-content/uploads/2009/09/tc_throughput.png"><img src="http://www.xavierllora.net/wp-content/uploads/2009/09/tc_throughput-400x400.png" alt="Throughput as a function of number of data in a transaction" title="Throughput as a function of number of data in a transaction" width="300" height="300" /></a></p>
<p>The first important element to highlight is that the time to insert one data element does not degrade as the volume increase. Actually, it is quite interesting that <a href="http://1978th.net/tokyocabinet/">Tokyo Cabinet</a> feels more comfortable as the volume per transaction grows. The throughput results are also interesting, since it shows that it is able to sustain transfers of around 40K data units per second, and that the only bottleneck is the disk cache management and bandwidth to the disk itself&#8212;which gets saturated after pushing more than 10K pieces of data.</p>
<h2>The lessons learned</h2>
<p><a href="http://1978th.net/tokyocabinet/">Tokyo Cabinet</a> is a excellent candidate to support the temporary transactional storage required in a distributed execution of a Meandre flow. Other alternatives like <a href="http://www.mysql.com/">MySQL</a>, embedded <a href="http://db.apache.org/derby/">Apache Derby</a>, the <a href="http://www.oracle.com/database/berkeley-db/je/index.html">Java edition of Berkeley DB</a>, <a href="http://www.zentus.com/sqlitejdbc/">SQLite JDBC</a> could not get even get close to such performance falling at least one order of magnitude behind.</p>


<p>Related posts:<ol><li><a href='http://www.xavierllora.net/2009/08/13/easy-reliable-and-flexible-storage-for-python/' rel='bookmark' title='Permanent Link: Easy, reliable, and flexible storage for Python'>Easy, reliable, and flexible storage for Python</a></li>
<li><a href='http://www.xavierllora.net/2008/05/22/zookeeper-and-orchestrating-distributed-applications/' rel='bookmark' title='Permanent Link: ZooKeeper and distributed applications'>ZooKeeper and distributed applications</a></li>
<li><a href='http://www.xavierllora.net/2008/04/18/meandre-semantic-driven-data-intensive-flow-engine/' rel='bookmark' title='Permanent Link: Meandre: Semantic-Driven Data-Intensive Flow Engine'>Meandre: Semantic-Driven Data-Intensive Flow Engine</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.xavierllora.net/2009/09/29/temporary-storage-for-meandres-distribute-flow-execution/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Large Scale Data Mining using Genetics-Based Machine Learning</title>
		<link>http://www.xavierllora.net/2009/07/15/large-scale-data-mining-using-genetics-based-machine-learning/</link>
		<comments>http://www.xavierllora.net/2009/07/15/large-scale-data-mining-using-genetics-based-machine-learning/#comments</comments>
		<pubDate>Wed, 15 Jul 2009 21:56:17 +0000</pubDate>
		<dc:creator>Xavier</dc:creator>
				<category><![CDATA[Data-Intensive Computing]]></category>
		<category><![CDATA[GBML & LCS]]></category>
		<category><![CDATA[Learning Classifier Systems]]></category>
		<category><![CDATA[Presentations]]></category>
		<category><![CDATA[Research]]></category>
		<category><![CDATA[data-intensive flows]]></category>
		<category><![CDATA[genetics-based machine learning]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[LCS]]></category>
		<category><![CDATA[map-reduce]]></category>

		<guid isPermaLink="false">http://www.xavierllora.net/?p=568</guid>
		<description><![CDATA[Below you may find the slides of the GECCO 2009 tutorial that Jaume Bacardit and I put together. Hope you enjoy it. Slides Abstract We are living in the peta-byte era.We have larger and larger data to analyze, process and transform into useful answers for the domain experts. Robust data mining tools, able to cope [...]


Related posts:<ol><li><a href='http://www.xavierllora.net/2006/12/13/observer-invariant-histopathology-using-genetics-based-machine-learning/' rel='bookmark' title='Permanent Link: Observer-Invariant Histopathology using Genetics-Based Machine Learning'>Observer-Invariant Histopathology using Genetics-Based Machine Learning</a></li>
<li><a href='http://www.xavierllora.net/2009/04/07/deadline-extended-for-special-issue-on-metaheuristics-for-large-scale-data-mining/' rel='bookmark' title='Permanent Link: Deadline extended for special issue on Metaheuristics for Large Scale Data Mining'>Deadline extended for special issue on Metaheuristics for Large Scale Data Mining</a></li>
<li><a href='http://www.xavierllora.net/2008/03/26/bdcsg2008-algorithmic-perspectives-on-large-scale-social-network-data-jon-kleinberg/' rel='bookmark' title='Permanent Link: [BDCSG2008] Algorithmic Perspectives on Large-Scale Social Network Data (Jon Kleinberg)'>[BDCSG2008] Algorithmic Perspectives on Large-Scale Social Network Data (Jon Kleinberg)</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>Below you may find the slides of the <a href="http://www.sigevo.org/gecco-2009/tutorials.html#lsdm">GECCO 2009 tutorial</a> that <a href="http://www.cs.nott.ac.uk/~jqb/">Jaume Bacardit</a> and I put together. Hope you enjoy it.</p>
<p><strong>Slides</strong></p>
<object width="425&type=s" height="348"><param name="movie" value="http://static.slideshare.net/swf/ssplayer2.swf?doc=gecco2009largegbmltutorial-090715163244-phpapp01"/><param name="allowFullScreen" value="true"/><param name="allowScriptAccess" value="always"/><embed src="http://static.slideshare.net/swf/ssplayer2.swf?doc=gecco2009largegbmltutorial-090715163244-phpapp01"  type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" width="425&type=s" height="348"></embed></object>
<p><strong>Abstract</strong></p>
<p>We are living in the peta-byte era.We have larger and larger data to analyze, process and transform into useful answers for the domain experts. Robust data mining tools, able to cope with petascale volumes and/or high dimensionality producing human-understandable solutions are key on several domain areas. Genetics-based machine learning (GBML) techniques are perfect candidates for this task, among others, due to the recent advances in representations, learning paradigms, and theoretical modeling. If evolutionary learning techniques aspire to be a relevant player in this context, they need to have the capacity of processing these vast amounts of data and they need to process this data within reasonable time. Moreover, massive computation cycles are getting cheaper and cheaper every day, allowing researchers to have access to unprecedented parallelization degrees. Several topics are interlaced in these two requirements: (1) having the proper learning paradigms and knowledge representations, (2) understanding them and knowing when are they suitable for the problem at hand, (3) using efficiency enhancement techniques, and (4) transforming and visualizing the produced solutions to give back as much insight as possible to the domain experts are few of them.</p>
<p>This tutorial will try to answer this question, following a roadmap that starts with the questions of what large means, and why large is a challenge for GBML methods. Afterwards, we will discuss different facets in which we can overcome this challenge: Efficiency enhancement techniques, representations able to cope with large dimensionality spaces, scalability of learning paradigms. We will also review a topic interlaced with all of them: how can we model the scalability of the components of our GBML systems to better engineer them to get the best performance out of them for large datasets. The roadmap continues with examples of real applications of GBML systems and finishes with an analysis of further directions.</p>


<p>Related posts:<ol><li><a href='http://www.xavierllora.net/2006/12/13/observer-invariant-histopathology-using-genetics-based-machine-learning/' rel='bookmark' title='Permanent Link: Observer-Invariant Histopathology using Genetics-Based Machine Learning'>Observer-Invariant Histopathology using Genetics-Based Machine Learning</a></li>
<li><a href='http://www.xavierllora.net/2009/04/07/deadline-extended-for-special-issue-on-metaheuristics-for-large-scale-data-mining/' rel='bookmark' title='Permanent Link: Deadline extended for special issue on Metaheuristics for Large Scale Data Mining'>Deadline extended for special issue on Metaheuristics for Large Scale Data Mining</a></li>
<li><a href='http://www.xavierllora.net/2008/03/26/bdcsg2008-algorithmic-perspectives-on-large-scale-social-network-data-jon-kleinberg/' rel='bookmark' title='Permanent Link: [BDCSG2008] Algorithmic Perspectives on Large-Scale Social Network Data (Jon Kleinberg)'>[BDCSG2008] Algorithmic Perspectives on Large-Scale Social Network Data (Jon Kleinberg)</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.xavierllora.net/2009/07/15/large-scale-data-mining-using-genetics-based-machine-learning/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Efficient serialization for Java (and beyond)</title>
		<link>http://www.xavierllora.net/2009/02/04/efficient-serialization-for-java-and-beyond/</link>
		<comments>http://www.xavierllora.net/2009/02/04/efficient-serialization-for-java-and-beyond/#comments</comments>
		<pubDate>Wed, 04 Feb 2009 15:10:38 +0000</pubDate>
		<dc:creator>Xavier</dc:creator>
				<category><![CDATA[Data-Intensive Computing]]></category>
		<category><![CDATA[Research]]></category>
		<category><![CDATA[Software]]></category>
		<category><![CDATA[c]]></category>
		<category><![CDATA[data-intensive flows]]></category>
		<category><![CDATA[java]]></category>
		<category><![CDATA[meandre]]></category>
		<category><![CDATA[protocol buffers]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[serialization]]></category>
		<category><![CDATA[xstream]]></category>

		<guid isPermaLink="false">http://www.xavierllora.net/?p=431</guid>
		<description><![CDATA[I am currently working on the distributed execution of flows as part of the Meandre infrastructure&#8212;as a part of the SEASR project. One of the pieces to explore is how to push data between machines. No, I am not going to talk about network protocols and the like here, but how you can pass the [...]


Related posts:<ol><li><a href='http://www.xavierllora.net/2008/05/20/svnkit-or-analyzing-svn-content-in-java/' rel='bookmark' title='Permanent Link: SVNKit or analyzing SVN content in Java'>SVNKit or analyzing SVN content in Java</a></li>
<li><a href='http://www.xavierllora.net/2009/09/29/temporary-storage-for-meandres-distribute-flow-execution/' rel='bookmark' title='Permanent Link: Temporary storage for Meandre&#8217;s distributed flow execution'>Temporary storage for Meandre&#8217;s distributed flow execution</a></li>
<li><a href='http://www.xavierllora.net/2006/10/19/r-and-java/' rel='bookmark' title='Permanent Link: R and Java'>R and Java</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>I am currently working on the distributed execution of flows as part of the <a href="http://seasr.org/meandre/">Meandre infrastructure</a>&#8212;as a part of the <a href="http://seasr.org">SEASR project</a>. One of the pieces to explore is how to push data between machines. No, I am not going to talk about network protocols and the like here, but how you can pass the data around. If you have ever programmed <a href="http://www-unix.mcs.anl.gov/mpi/">MPI</a> using C/C++ you remember the tedious efforts that requires passing complex data structures around between processes. Serialization is a way to take those complex structures into a form that can be easily stored/transmitted, and then retrieved/received and regenerate the original complex data structure. Some languages/platforms support this functionality (e.g. Java, Python), allowing to easily use the serialized representation for persistency or transmission purposes.</p>
<p>Last Thursday I was talking to <a href="http://vermaabhishekp.googlepages.com/">Abhishek Verma</a>, and he pointed out <a title="Protocol buffers" href="http://code.google.com/p/protobuf/">Google&#8217;s Protol Buffer</a> project&#8212;Google&#8217;s take data interchange formats. Not a new idea&#8212;for instance <a href="http://en.wikipedia.org/wiki/Interface_definition_language">Corba&#8217;s IDL</a> has been around for a long time&#8212;but what caught my eye was their claims about: (1) efficiency, and (2) multiple language bindings. I was contemplating using <a href="http://xstream.codehaus.org/">XStream</a> for Meandre distributed flow execution needs, but the XML heavy weight made me quite reluctant to walk down that path.  The Java native serialization is not a bad choice in terms of efficiency, but does not provide friendly mechanics for modifying data formats without rendering already serialized objects useless, neither a transparent mechanism to allow bindings for other languages/platforms. So the <a title="Protocol buffers" href="http://code.google.com/p/protobuf/">Google&#8217;s Protol Buffer</a> seemed an option worth trying. So there I went, and I prepare a simple comparison between the tree: (1) Java serialization, (2) <a title="Protocol buffers" href="http://code.google.com/p/protobuf/">Google&#8217;s Protol Buffer</a>, and (3) <a href="http://xstream.codehaus.org/">XStream</a>. Yes, you may guess the outcome, but I was more interested on getting my hands dirty, see how <a title="Protocol buffers" href="http://code.google.com/p/protobuf/">Google&#8217;s Protol Buffer</a> perform, and how much overhead for the developer it required.</p>
<h2>The experiment</h2>
<p>Before getting into the description, this experiment does not try to be an exhaustive performance evaluation, just an afternoon diversion. Having said so, the experiment measured the serialization/deserialization time and space used for a simple data structure containing just one array of integers and one array of strings. All the integers were initialized to zero, and the strings to <em>&#8220;Dummy text&#8221;</em>. To allow measuring the time required to serialize this simple object, the number of integers and strings were increased incrementally. The code below illustrates the implementation of the Java native serialization measures.</p>

<div class="wp_syntax"><div class="code"><pre class="java" style="font-family:monospace;"><span style="color: #000000; font-weight: bold;">package</span> <span style="color: #006699;">org.meandre.tools.serialization.xstream</span><span style="color: #339933;">;</span>
&nbsp;
<span style="color: #000000; font-weight: bold;">public</span> <span style="color: #000000; font-weight: bold;">class</span> TargetObject <span style="color: #009900;">&#123;</span>
&nbsp;
       <span style="color: #000000; font-weight: bold;">public</span> <span style="color: #003399;">String</span> <span style="color: #009900;">&#91;</span><span style="color: #009900;">&#93;</span> sa<span style="color: #339933;">;</span>
       <span style="color: #000000; font-weight: bold;">public</span> <span style="color: #000066; font-weight: bold;">int</span> <span style="color: #009900;">&#91;</span><span style="color: #009900;">&#93;</span> ia<span style="color: #339933;">;</span>
&nbsp;
       <span style="color: #000000; font-weight: bold;">public</span> TargetObject <span style="color: #009900;">&#40;</span> <span style="color: #000066; font-weight: bold;">int</span> iStringElements, <span style="color: #000066; font-weight: bold;">int</span> iIntegerElements <span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
             sa <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> <span style="color: #003399;">String</span><span style="color: #009900;">&#91;</span>iStringElements<span style="color: #009900;">&#93;</span><span style="color: #339933;">;</span>
             <span style="color: #000000; font-weight: bold;">for</span> <span style="color: #009900;">&#40;</span> <span style="color: #000066; font-weight: bold;">int</span> i<span style="color: #339933;">=</span><span style="color: #cc66cc;">0</span> <span style="color: #339933;">;</span> i<span style="color: #339933;">&lt;</span>iStringElements <span style="color: #339933;">;</span> i<span style="color: #339933;">++</span> <span style="color: #009900;">&#41;</span>
                  sa<span style="color: #009900;">&#91;</span>i<span style="color: #009900;">&#93;</span> <span style="color: #339933;">=</span> <span style="color: #0000ff;">&quot;Dummy text&quot;</span><span style="color: #339933;">;</span>
             ia <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> <span style="color: #000066; font-weight: bold;">int</span><span style="color: #009900;">&#91;</span>iIntegerElements<span style="color: #009900;">&#93;</span><span style="color: #339933;">;</span>
       <span style="color: #009900;">&#125;</span>
<span style="color: #009900;">&#125;</span></pre></div></div>

<p>The experiment consisted on generating objects like the above containing from 100 to 10,000 elements by increments of 100. Each object was serialized 50 times, measuring the average serialization time and the space required (in bytes) per object generated. Below you may have the sample code I used to measure native java serialization/deserialization times.</p>

<div class="wp_syntax"><div class="code"><pre class="java" style="font-family:monospace;"><span style="color: #000000; font-weight: bold;">package</span> <span style="color: #006699;">org.meandre.tools.serialization.java</span><span style="color: #339933;">;</span>
&nbsp;
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">java.io.ByteArrayInputStream</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">java.io.ByteArrayOutputStream</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">java.io.IOException</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">java.io.ObjectInputStream</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">java.io.ObjectOutputStream</span><span style="color: #339933;">;</span>
&nbsp;
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">org.junit.Test</span><span style="color: #339933;">;</span>
&nbsp;
<span style="color: #000000; font-weight: bold;">public</span> <span style="color: #000000; font-weight: bold;">class</span> JavaSerializationTest <span style="color: #009900;">&#123;</span>
&nbsp;
       @Test
       <span style="color: #000000; font-weight: bold;">public</span> <span style="color: #000066; font-weight: bold;">void</span> testJavaSerialization <span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span> 
       <span style="color: #000000; font-weight: bold;">throws</span> <span style="color: #003399;">IOException</span> <span style="color: #009900;">&#123;</span>
             <span style="color: #000000; font-weight: bold;">final</span> <span style="color: #000066; font-weight: bold;">int</span> MAX_SIZE <span style="color: #339933;">=</span> <span style="color: #cc66cc;">10000</span><span style="color: #339933;">;</span>
             <span style="color: #000000; font-weight: bold;">final</span> <span style="color: #000066; font-weight: bold;">int</span> REP <span style="color: #339933;">=</span> <span style="color: #cc66cc;">50</span><span style="color: #339933;">;</span>
             <span style="color: #000000; font-weight: bold;">final</span> <span style="color: #000066; font-weight: bold;">int</span> INC <span style="color: #339933;">=</span> <span style="color: #cc66cc;">100</span><span style="color: #339933;">;</span>
&nbsp;
             <span style="color: #003399;">System</span>.<span style="color: #006633;">out</span>.<span style="color: #006633;">println</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;Java serialization times&quot;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
             <span style="color: #000000; font-weight: bold;">for</span> <span style="color: #009900;">&#40;</span> <span style="color: #000066; font-weight: bold;">int</span> i<span style="color: #339933;">=</span>INC <span style="color: #339933;">;</span> i<span style="color: #339933;">&lt;=</span>MAX_SIZE <span style="color: #339933;">;</span> i<span style="color: #339933;">+=</span>INC <span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
                  TargetObjectSerializable tos <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> TargetObjectSerializable<span style="color: #009900;">&#40;</span>i,i<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
                  <span style="color: #000066; font-weight: bold;">long</span> lAccTime <span style="color: #339933;">=</span> <span style="color: #cc66cc;">0</span><span style="color: #339933;">;</span>
                  <span style="color: #000066; font-weight: bold;">long</span> lSize <span style="color: #339933;">=</span> <span style="color: #cc66cc;">0</span><span style="color: #339933;">;</span>
                  <span style="color: #000066; font-weight: bold;">long</span> lTmp<span style="color: #339933;">;</span>
                  <span style="color: #003399;">ByteArrayOutputStream</span> baos<span style="color: #339933;">;</span>
                  <span style="color: #003399;">ObjectOutputStream</span> out<span style="color: #339933;">;</span>
                  <span style="color: #000000; font-weight: bold;">for</span> <span style="color: #009900;">&#40;</span> <span style="color: #000066; font-weight: bold;">int</span> j<span style="color: #339933;">=</span><span style="color: #cc66cc;">0</span> <span style="color: #339933;">;</span> j<span style="color: #339933;">&lt;</span>REP <span style="color: #339933;">;</span> j<span style="color: #339933;">++</span> <span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
                      baos <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> <span style="color: #003399;">ByteArrayOutputStream</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
                      out <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> <span style="color: #003399;">ObjectOutputStream</span><span style="color: #009900;">&#40;</span>baos<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
                      lTmp <span style="color: #339933;">=</span> <span style="color: #003399;">System</span>.<span style="color: #006633;">currentTimeMillis</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
                      out.<span style="color: #006633;">writeObject</span><span style="color: #009900;">&#40;</span>tos<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
                      lTmp <span style="color: #339933;">-=</span> <span style="color: #003399;">System</span>.<span style="color: #006633;">currentTimeMillis</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
                      out.<span style="color: #006633;">close</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
                      lAccTime <span style="color: #339933;">-=</span> lTmp<span style="color: #339933;">;</span>
                      lSize <span style="color: #339933;">=</span> baos.<span style="color: #006633;">size</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span> 
                  <span style="color: #009900;">&#125;</span>
                  <span style="color: #003399;">System</span>.<span style="color: #006633;">out</span>.<span style="color: #006633;">println</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;&quot;</span><span style="color: #339933;">+</span>i<span style="color: #339933;">+</span><span style="color: #0000ff;">&quot;<span style="color: #000099; font-weight: bold;">\t</span>&quot;</span><span style="color: #339933;">+</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#40;</span><span style="color: #000066; font-weight: bold;">double</span><span style="color: #009900;">&#41;</span>lAccTime<span style="color: #009900;">&#41;</span><span style="color: #339933;">/</span>REP<span style="color: #009900;">&#41;</span><span style="color: #339933;">+</span><span style="color: #0000ff;">&quot;<span style="color: #000099; font-weight: bold;">\t</span>&quot;</span><span style="color: #339933;">+</span>lSize<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
             <span style="color: #009900;">&#125;</span>
       <span style="color: #009900;">&#125;</span>
&nbsp;
&nbsp;
       @Test
       <span style="color: #000000; font-weight: bold;">public</span> <span style="color: #000066; font-weight: bold;">void</span> testJavaDeserialization <span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span> 
       <span style="color: #000000; font-weight: bold;">throws</span> <span style="color: #003399;">IOException</span>, <span style="color: #003399;">ClassNotFoundException</span> <span style="color: #009900;">&#123;</span>
             <span style="color: #000000; font-weight: bold;">final</span> <span style="color: #000066; font-weight: bold;">int</span> MAX_SIZE <span style="color: #339933;">=</span> <span style="color: #cc66cc;">10000</span><span style="color: #339933;">;</span>
             <span style="color: #000000; font-weight: bold;">final</span> <span style="color: #000066; font-weight: bold;">int</span> REP <span style="color: #339933;">=</span> <span style="color: #cc66cc;">50</span><span style="color: #339933;">;</span>
             <span style="color: #000000; font-weight: bold;">final</span> <span style="color: #000066; font-weight: bold;">int</span> INC <span style="color: #339933;">=</span> <span style="color: #cc66cc;">100</span><span style="color: #339933;">;</span>
&nbsp;
             <span style="color: #003399;">System</span>.<span style="color: #006633;">out</span>.<span style="color: #006633;">println</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;Java deserialization times&quot;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
             <span style="color: #000000; font-weight: bold;">for</span> <span style="color: #009900;">&#40;</span> <span style="color: #000066; font-weight: bold;">int</span> i<span style="color: #339933;">=</span>INC <span style="color: #339933;">;</span> i<span style="color: #339933;">&lt;=</span>MAX_SIZE <span style="color: #339933;">;</span> i<span style="color: #339933;">+=</span>INC <span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
                  TargetObjectSerializable tos <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> TargetObjectSerializable<span style="color: #009900;">&#40;</span>i,i<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
                  <span style="color: #003399;">ByteArrayOutputStream</span> baos <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> <span style="color: #003399;">ByteArrayOutputStream</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
                  <span style="color: #003399;">ObjectOutputStream</span> out <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> <span style="color: #003399;">ObjectOutputStream</span><span style="color: #009900;">&#40;</span>baos<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
                  out.<span style="color: #006633;">writeObject</span><span style="color: #009900;">&#40;</span>tos<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
                  out.<span style="color: #006633;">close</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
                  <span style="color: #003399;">ByteArrayInputStream</span> bais<span style="color: #339933;">;</span>
                  <span style="color: #003399;">ObjectInputStream</span> ois<span style="color: #339933;">;</span>
                  <span style="color: #000066; font-weight: bold;">long</span> lAccTime <span style="color: #339933;">=</span> <span style="color: #cc66cc;">0</span><span style="color: #339933;">;</span>
                  <span style="color: #000066; font-weight: bold;">long</span> lTmp<span style="color: #339933;">;</span>
                  <span style="color: #000000; font-weight: bold;">for</span> <span style="color: #009900;">&#40;</span> <span style="color: #000066; font-weight: bold;">int</span> j<span style="color: #339933;">=</span><span style="color: #cc66cc;">0</span> <span style="color: #339933;">;</span> j<span style="color: #339933;">&lt;</span>REP <span style="color: #339933;">;</span> j<span style="color: #339933;">++</span> <span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
                      bais <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> <span style="color: #003399;">ByteArrayInputStream</span><span style="color: #009900;">&#40;</span>baos.<span style="color: #006633;">toByteArray</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
                      ois <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> <span style="color: #003399;">ObjectInputStream</span><span style="color: #009900;">&#40;</span>bais<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
                      lTmp <span style="color: #339933;">=</span> <span style="color: #003399;">System</span>.<span style="color: #006633;">currentTimeMillis</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
                      ois.<span style="color: #006633;">readObject</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
                      lTmp <span style="color: #339933;">-=</span> <span style="color: #003399;">System</span>.<span style="color: #006633;">currentTimeMillis</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
                      lAccTime <span style="color: #339933;">-=</span> lTmp<span style="color: #339933;">;</span>
                  <span style="color: #009900;">&#125;</span>
                  <span style="color: #003399;">System</span>.<span style="color: #006633;">out</span>.<span style="color: #006633;">println</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;&quot;</span><span style="color: #339933;">+</span>i<span style="color: #339933;">+</span><span style="color: #0000ff;">&quot;<span style="color: #000099; font-weight: bold;">\t</span>&quot;</span><span style="color: #339933;">+</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#40;</span><span style="color: #000066; font-weight: bold;">double</span><span style="color: #009900;">&#41;</span>lAccTime<span style="color: #009900;">&#41;</span><span style="color: #339933;">/</span>REP<span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
             <span style="color: #009900;">&#125;</span>
       <span style="color: #009900;">&#125;</span>
<span style="color: #009900;">&#125;</span></pre></div></div>

<p>Equivalent versions of the code shown above were used to measure <a title="Protocol buffers" href="http://code.google.com/p/protobuf/">Google&#8217;s Protol Buffer</a> and <a href="http://xstream.codehaus.org/">XStream</a>. If you are interested on seeing the <a href="http://www.xavierllora.net/wp-content/uploads/2009/02/src-test-serialization.zip">full code you can download it as it is</a>&#8212;no guarantees provided. Also, for completion of the experiment code, you can find below the <code>proto</code> file use for testing the Java implementation of <a title="Protocol buffers" href="http://code.google.com/p/protobuf/">Google&#8217;s Protol Buffer</a>.</p>

<div class="wp_syntax"><div class="code"><pre class="proto" style="font-family:monospace;">package test;
&nbsp;
option java_package = &quot;org.meandre.tools.serialization.proto&quot;;
option java_outer_classname = &quot;TargetObjectProtoOuter&quot;;
&nbsp;
message TargetObjectProto { 
  repeated int32 ia = 1; 
  repeated string sa = 2;
}</pre></div></div>

<p>In order to run the experiment, besides <a title="Protocol buffers" href="http://code.google.com/p/protobuf/">Google&#8217;s Protol Buffer</a> and  <a href="http://xstream.codehaus.org/">XStream</a> libraries, you will also need <a href="http://www.junit.org/" title="JUnit">JUnit</a>.</p>
<h2>The results</h2>
<p>The experiments were run on an first generation MacBook Pro using Apple&#8217;s Java 1.5 virtual machine with 2Gb of RAM. The figure below illustrated the different memory requirements for each of the the three serialization methods compared. Figures and data processing was done using <a href="http://www.r-project.org/" title="R">R</a>.</p>
<p><a href="http://www.xavierllora.net/wp-content/uploads/2009/02/data-size.png"><img src="http://www.xavierllora.net/wp-content/uploads/2009/02/data-size.png" alt="Data size of the serialized object" title="Data size of the serialized object" width="220" height="220" class="aligncenter size-full wp-image-425" /></a><a href="http://www.xavierllora.net/wp-content/uploads/2009/02/data-size-ratio.png"><img src="http://www.xavierllora.net/wp-content/uploads/2009/02/data-size-ratio.png" alt="Serialized/original data size ratio" title="Serialized/original data size ratio" width="220" height="220" class="aligncenter size-full wp-image-424" /></a></p>
<p>Figures show the already intuited bloated size of XML-based XStream serialization, up to 6 time larger than the original data being serialized. On the other hand, the Java native serialization provides a minimal increase on the serialized equivalent. Google&#8217;s Protocol Buffer presents a slightly larger requirement than the native Java serialization, but never doubled the original size. Moreover, it does not exhibit the constant initial payload overhead displayed by both XStream and the Java native serialization. The next question was how costly was the serialization process. Figures below show the amount of time required to serialize an object.</p>
<p><a href="http://www.xavierllora.net/wp-content/uploads/2009/02/serialization-time.png"><img src="http://www.xavierllora.net/wp-content/uploads/2009/02/serialization-time.png" alt="Serialization time" title="Serialization time" width="220" height="220" class="aligncenter size-full wp-image-430" /></a><a href="http://www.xavierllora.net/wp-content/uploads/2009/02/serialization-time-ratio.png"><img src="http://www.xavierllora.net/wp-content/uploads/2009/02/serialization-time-ratio.png" alt="Serialization time ratio" title="Serialization time ratio" width="220" height="220" class="aligncenter size-full wp-image-429" /></a></p>
<p>The Java native serialization was, as expected the fastest, however Google&#8217;s Protocol Buffer took only, on average, four times the more time than the Java native version. However, that is peanuts when compared to the fifty times slower XStream version. Deserialization times of the encoded object presents the same trends as the serialization, as the figures below show.</p>
<p><a href="http://www.xavierllora.net/wp-content/uploads/2009/02/deserialization-time.png"><img src="http://www.xavierllora.net/wp-content/uploads/2009/02/deserialization-time.png" alt="Deserialization time" title="Deserialization time" width="220" height="220" class="aligncenter size-full wp-image-427" /></a><a href="http://www.xavierllora.net/wp-content/uploads/2009/02/deserialization-time-ratio.png"><img src="http://www.xavierllora.net/wp-content/uploads/2009/02/deserialization-time-ratio.png" alt="Deserialization time ratio" title="Deserialization time ratio" width="220" height="220" class="aligncenter size-full wp-image-426" /></a></p>
<p>It is also interesting to note that serialization&#8212;as the figures below show&#8212;is faster than deserialization (as common sense would have suggested). However, it is interesting to note that Google&#8217;s Protocol Buffer is the method where these difference is more pronounced.</p>
<p><a href="http://www.xavierllora.net/wp-content/uploads/2009/02/serialization-deserialization-ratio.png"><img src="http://www.xavierllora.net/wp-content/uploads/2009/02/serialization-deserialization-ratio.png" alt="Serialization/deserialization ratio" title="Serialization/deserialization ratio" width="220" height="220" class="aligncenter size-full wp-image-428" /></a></p>
<h2>The lessons learned</h2>
<p>As I said, this is far from being an exhaustive or even representative example, but just one afternoon exploration. However, the results show interesting trends. Yes, XStream could also be tweaked to make the searialized XML leaner, and even would&#8212;with the proper tinkering&#8212;make possible deserialize the object on a different platform/language, but at an enormous cost&#8212;both in size and time. The Java native serialization is by far the fastest and the most size efficient, but is made from and for Java. Also, changes on the serialized classes&#8212;imagine wanting to add or remove a field&#8212;may render the serialize objects unreadable. Google Protocol Buffers on the other hand delivers the best of both scenarios: (1) the ability to serialize/deserialize objects in a compact and relatively fast manner, and (2) allows the serialization/deserialization to happen between different languages and platforms. For these reasons, it seems to be a very interesting option to keep exploring, if you need both.</p>


<p>Related posts:<ol><li><a href='http://www.xavierllora.net/2008/05/20/svnkit-or-analyzing-svn-content-in-java/' rel='bookmark' title='Permanent Link: SVNKit or analyzing SVN content in Java'>SVNKit or analyzing SVN content in Java</a></li>
<li><a href='http://www.xavierllora.net/2009/09/29/temporary-storage-for-meandres-distribute-flow-execution/' rel='bookmark' title='Permanent Link: Temporary storage for Meandre&#8217;s distributed flow execution'>Temporary storage for Meandre&#8217;s distributed flow execution</a></li>
<li><a href='http://www.xavierllora.net/2006/10/19/r-and-java/' rel='bookmark' title='Permanent Link: R and Java'>R and Java</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.xavierllora.net/2009/02/04/efficient-serialization-for-java-and-beyond/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Meandre 1.4.0 final release candidate tagged</title>
		<link>http://www.xavierllora.net/2009/01/11/meandre-140-final-release-candidate-tagged/</link>
		<comments>http://www.xavierllora.net/2009/01/11/meandre-140-final-release-candidate-tagged/#comments</comments>
		<pubDate>Mon, 12 Jan 2009 03:01:49 +0000</pubDate>
		<dc:creator>Xavier</dc:creator>
				<category><![CDATA[Software]]></category>
		<category><![CDATA[Data-Intensive Computing]]></category>
		<category><![CDATA[data-intensive flows]]></category>
		<category><![CDATA[meandre]]></category>

		<guid isPermaLink="false">http://www.xavierllora.net/?p=396</guid>
		<description><![CDATA[Today I just tagged the final release candidate RC6 for the Meandre Infrastructure.  The list of improvement from the previous 1.3.2 release is long, but I just want to highlight a few of them. Standardization of web services and their response information. Also added new services to help manage users, repositories, and jobs. The infrastructure is [...]


Related posts:<ol><li><a href='http://www.xavierllora.net/2008/12/02/meandre-infrastructure-14-rc1-tagged/' rel='bookmark' title='Permanent Link: Meandre Infrastructure 1.4 RC1 tagged'>Meandre Infrastructure 1.4 RC1 tagged</a></li>
<li><a href='http://www.xavierllora.net/2009/01/15/meandre-140-released-141-coming-short-after/' rel='bookmark' title='Permanent Link: Meandre 1.4.0 released, 1.4.1 coming short after'>Meandre 1.4.0 released, 1.4.1 coming short after</a></li>
<li><a href='http://www.xavierllora.net/2008/04/18/meandre-semantic-driven-data-intensive-flow-engine/' rel='bookmark' title='Permanent Link: Meandre: Semantic-Driven Data-Intensive Flow Engine'>Meandre: Semantic-Driven Data-Intensive Flow Engine</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>Today I just tagged the final release candidate RC6 for the Meandre Infrastructure.  The list of improvement from the previous 1.3.2 release is long, but I just want to highlight a few of them.</p>
<ul>
<li><a href="http://dev-tools.seasr.org/confluence/display/MDR/Meandre+Infrastructure+Web+API+v.+1.4">Standardization of web services and their response information</a>. Also added new services to help manage users, repositories, and jobs.</li>
<li>The infrastructure is now able to provide a <a href="http://www.xavierllora.net/2008/11/15/meandre-semantic-driven-data-intensive-flows-in-the-clouds/">single system image cluster of Meandre servers</a>. </li>
<li>It also provides a <a href="http://dev-tools.seasr.org/confluence/display/MDR/Meandre+Infrastructure+HTML+interface+v.+1.4">new light weight HTML interface</a> for easy administration.</li>
<li>Location addition has been speeded up minimizing the network usage. </li>
<li>Component contexts are locally cached, drastically reducing the preparation time of a flow before executing.</li>
</ul>
<p>Also, all the reported bugs reported so far has been fixed. Hopefully, the release will be pushed out this coming week. Stay tuned <img src='http://www.xavierllora.net/wp-includes/images/smilies/icon_biggrin.gif' alt=':D' class='wp-smiley' /> . Also, a few documentation for developers interested in the infrastructure can be find on the <a href="http://dev-tools.seasr.org/confluence/">development tools Confluence wiki</a>.</p>


<p>Related posts:<ol><li><a href='http://www.xavierllora.net/2008/12/02/meandre-infrastructure-14-rc1-tagged/' rel='bookmark' title='Permanent Link: Meandre Infrastructure 1.4 RC1 tagged'>Meandre Infrastructure 1.4 RC1 tagged</a></li>
<li><a href='http://www.xavierllora.net/2009/01/15/meandre-140-released-141-coming-short-after/' rel='bookmark' title='Permanent Link: Meandre 1.4.0 released, 1.4.1 coming short after'>Meandre 1.4.0 released, 1.4.1 coming short after</a></li>
<li><a href='http://www.xavierllora.net/2008/04/18/meandre-semantic-driven-data-intensive-flow-engine/' rel='bookmark' title='Permanent Link: Meandre: Semantic-Driven Data-Intensive Flow Engine'>Meandre: Semantic-Driven Data-Intensive Flow Engine</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.xavierllora.net/2009/01/11/meandre-140-final-release-candidate-tagged/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>ICEIS 2008: Blogging summary and final strings</title>
		<link>http://www.xavierllora.net/2008/06/18/iceis-2008-blogging-summary-and-final-strings/</link>
		<comments>http://www.xavierllora.net/2008/06/18/iceis-2008-blogging-summary-and-final-strings/#comments</comments>
		<pubDate>Wed, 18 Jun 2008 15:28:37 +0000</pubDate>
		<dc:creator>Xavier</dc:creator>
				<category><![CDATA[Events]]></category>
		<category><![CDATA[Conferences]]></category>
		<category><![CDATA[data-intensive flows]]></category>
		<category><![CDATA[discus project]]></category>
		<category><![CDATA[distributed computing]]></category>
		<category><![CDATA[meandre]]></category>

		<guid isPermaLink="false">http://www.xavierllora.net/?p=250</guid>
		<description><![CDATA[If you are looking for a list of the related blogging done during ICEIS 2008 just follow this link. During Sunday morning I run into Angel A. Juan, an assistant professor at Open University of Catalonia (UOC), interested on analyzing online teaching efforts and how tools to assist professors monitoring students performance on online media. I visited him yesterday at [...]


Related posts:<ol><li><a href='http://www.xavierllora.net/2008/06/16/iceis-2008-final-sprint-and-ricardo-baeza-yates/' rel='bookmark' title='Permanent Link: ICEIS 2008: Final sprint and Ricardo Baeza-Yates'>ICEIS 2008: Final sprint and Ricardo Baeza-Yates</a></li>
<li><a href='http://www.xavierllora.net/2008/06/13/iceis-2008-blogging-from-barcelona-friday-morning/' rel='bookmark' title='Permanent Link: ICEIS 2008: Blogging from Barcelona (Friday Morning)'>ICEIS 2008: Blogging from Barcelona (Friday Morning)</a></li>
<li><a href='http://www.xavierllora.net/2008/03/26/summary-of-bdcsg2008-blogging/' rel='bookmark' title='Permanent Link: Summary of BDCSG2008 blogging'>Summary of BDCSG2008 blogging</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>If you are looking for a <a title="ICEIS 2008 blogging" href="http://www.xavierllora.net/?s=iceis+2008">list of the related blogging done during ICEIS 2008</a> just follow this <a title="ICEIS 2008" href="http://www.xavierllora.net/?s=iceis+2008">link</a>. During Sunday morning I run into Angel A. Juan, an assistant professor at <a title="UOC" href="http://www.uoc.edu/web/eng/index.html">Open University of Catalonia (UOC)</a>, interested on analyzing online teaching efforts and how tools to assist professors monitoring students performance on online media. I visited him yesterday at his office and we got and interesting exchange of ideas. Most of them revolved around the work we have conducted under <a title="the DISCUS project" href="http://www.i-discus.org">the DISCUS project</a>, and how similar is our efforts on marketing focus groups and their online teaching environment. His group, <a title="DPCS" href="http://dpcs.uoc.es/">Distributed, Parallel and Collaborative Systems</a>, was also interested on the work done under the <a title="SEASR" href="http://seasr.org">SEASR project</a>, mostly focusing on the <a title="Meandre" href="http://seasr.org/meandre">Meadre</a> infrastructure for data-intensive flow computing we are getting close to release.</p>
<p> </p>


<p>Related posts:<ol><li><a href='http://www.xavierllora.net/2008/06/16/iceis-2008-final-sprint-and-ricardo-baeza-yates/' rel='bookmark' title='Permanent Link: ICEIS 2008: Final sprint and Ricardo Baeza-Yates'>ICEIS 2008: Final sprint and Ricardo Baeza-Yates</a></li>
<li><a href='http://www.xavierllora.net/2008/06/13/iceis-2008-blogging-from-barcelona-friday-morning/' rel='bookmark' title='Permanent Link: ICEIS 2008: Blogging from Barcelona (Friday Morning)'>ICEIS 2008: Blogging from Barcelona (Friday Morning)</a></li>
<li><a href='http://www.xavierllora.net/2008/03/26/summary-of-bdcsg2008-blogging/' rel='bookmark' title='Permanent Link: Summary of BDCSG2008 blogging'>Summary of BDCSG2008 blogging</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.xavierllora.net/2008/06/18/iceis-2008-blogging-summary-and-final-strings/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>ZooKeeper and distributed applications</title>
		<link>http://www.xavierllora.net/2008/05/22/zookeeper-and-orchestrating-distributed-applications/</link>
		<comments>http://www.xavierllora.net/2008/05/22/zookeeper-and-orchestrating-distributed-applications/#comments</comments>
		<pubDate>Thu, 22 May 2008 13:39:36 +0000</pubDate>
		<dc:creator>Xavier</dc:creator>
				<category><![CDATA[Notes]]></category>
		<category><![CDATA[data-intensive flows]]></category>
		<category><![CDATA[distributed computing]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[meandre]]></category>
		<category><![CDATA[zookeeper]]></category>

		<guid isPermaLink="false">http://www.xavierllora.net/?p=223</guid>
		<description><![CDATA[Lately I have been exploring different alternatives for coordinating the execution of distributed applications. Yes, you guessed it right, I am working on the distribution of the execution of Meandre flows. Chopping the data-intensive flow and mapping the chunks onto a set of distributed processors requires several elements (graph analysis, resource management, etc.). However, the [...]


Related posts:<ol><li><a href='http://www.xavierllora.net/2009/09/29/temporary-storage-for-meandres-distribute-flow-execution/' rel='bookmark' title='Permanent Link: Temporary storage for Meandre&#8217;s distributed flow execution'>Temporary storage for Meandre&#8217;s distributed flow execution</a></li>
<li><a href='http://www.xavierllora.net/2007/09/19/embedded-jetty-applications/' rel='bookmark' title='Permanent Link: Embedded Jetty applications'>Embedded Jetty applications</a></li>
<li><a href='http://www.xavierllora.net/2008/04/18/meandre-semantic-driven-data-intensive-flow-engine/' rel='bookmark' title='Permanent Link: Meandre: Semantic-Driven Data-Intensive Flow Engine'>Meandre: Semantic-Driven Data-Intensive Flow Engine</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>Lately I have been exploring different alternatives for coordinating the execution of distributed applications. Yes, you guessed it right, I am working on the distribution of the execution of <a title="Meandre" href="http://seasr.org/meandre">Meandre flows</a>. Chopping the data-intensive flow and mapping the chunks onto a set of distributed processors requires several elements (graph analysis, resource management, etc.). However, the basic element that needs to be solved first is the need for a reliable and scalable coordination system.</p>
<p>During my trip to the <a title="Hadoop Summit" href="http://developer.yahoo.com/hadoop/summit/">Hadoop Summit</a> and the <a title="BDCSG" href="http://www.xavierllora.net/2008/03/26/summary-of-bdcsg2008-blogging/">Big Data Computing Study Group</a> I ran into the <a title="ZooKeeper project" href="http://zookeeper.sourceforge.net/">ZooKeeper project</a>, a <a title="Yahoo! Research" href="http://research.yahoo.com/node/1849">Yahoo Research</a> project. In their own words:</p>
<blockquote><p>ZooKeeper is a high available and reliable coordination system. Distributed applications use ZooKeeper to store and mediate updates key configuration information. ZooKeeper can be used for leader election, group membership, configuration maintenance, etc.</p></blockquote>
<p>There is also an interesting introductory lesson <a title="ZooKeeper" href="http://zookeeper.sourceforge.net/">here</a>, and some recipes for the most common data structures (queues, priority queues, distributed locks, etc.) are also available <a title="ZooKeeper recipies" href="http://zookeeper.wiki.sourceforge.net/ZooKeeperRecipes">here</a>. Looks promising, and may make the coordination of the distributed execution of a <a href="http://seasr.org/meandre">Meandre</a> flows easier. The other reason that pushed me to explore this directions was that one of the <a title="Hadoop Summit" href="http://developer.yahoo.com/hadoop/summit/">Hadoop Summit</a> highlights was the adoption of  the <a title="ZooKeeper project" href="http://zookeeper.sourceforge.net/">ZooKeeper project</a> by <a title="Hadoop" href="http://hadoop.apache.org/core/">Hadoop</a> itself.</p>


<p>Related posts:<ol><li><a href='http://www.xavierllora.net/2009/09/29/temporary-storage-for-meandres-distribute-flow-execution/' rel='bookmark' title='Permanent Link: Temporary storage for Meandre&#8217;s distributed flow execution'>Temporary storage for Meandre&#8217;s distributed flow execution</a></li>
<li><a href='http://www.xavierllora.net/2007/09/19/embedded-jetty-applications/' rel='bookmark' title='Permanent Link: Embedded Jetty applications'>Embedded Jetty applications</a></li>
<li><a href='http://www.xavierllora.net/2008/04/18/meandre-semantic-driven-data-intensive-flow-engine/' rel='bookmark' title='Permanent Link: Meandre: Semantic-Driven Data-Intensive Flow Engine'>Meandre: Semantic-Driven Data-Intensive Flow Engine</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.xavierllora.net/2008/05/22/zookeeper-and-orchestrating-distributed-applications/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Meandre: Semantic-Driven Data-Intensive Flow Engine</title>
		<link>http://www.xavierllora.net/2008/04/18/meandre-semantic-driven-data-intensive-flow-engine/</link>
		<comments>http://www.xavierllora.net/2008/04/18/meandre-semantic-driven-data-intensive-flow-engine/#comments</comments>
		<pubDate>Sat, 19 Apr 2008 00:30:22 +0000</pubDate>
		<dc:creator>Xavier</dc:creator>
				<category><![CDATA[Notes]]></category>
		<category><![CDATA[data-intensive flows]]></category>
		<category><![CDATA[meandre]]></category>
		<category><![CDATA[semantic web]]></category>

		<guid isPermaLink="false">http://dita.ncsa.uiuc.edu/xllora/2008/04/18/meandre-semantic-driven-data-intensive-flow-engine/</guid>
		<description><![CDATA[Finally we have finished setting up the website for Meandre a semantic-driven data-intensive flow engine. Meandre provides basic infrastructure for data-intensive computation. It provides, among others, tools for creating components and flows, a high-level language to describe flows, and multicore and distributed execution environment based on a service-oriented paradigm. We are currently working on getting [...]


Related posts:<ol><li><a href='http://www.xavierllora.net/2008/11/15/meandre-semantic-driven-data-intensive-flows-in-the-clouds/' rel='bookmark' title='Permanent Link: Meandre: Semantic-Driven Data-Intensive Flows in the Clouds'>Meandre: Semantic-Driven Data-Intensive Flows in the Clouds</a></li>
<li><a href='http://www.xavierllora.net/2009/01/29/data-intensive-computing-for-competent-genetic-algorithms-a-pilot-study-using-meandre/' rel='bookmark' title='Permanent Link: Data-Intensive Computing for Competent Genetic Algorithms: A Pilot Study using  Meandre'>Data-Intensive Computing for Competent Genetic Algorithms: A Pilot Study using  Meandre</a></li>
<li><a href='http://www.xavierllora.net/2009/01/11/meandre-140-final-release-candidate-tagged/' rel='bookmark' title='Permanent Link: Meandre 1.4.0 final release candidate tagged'>Meandre 1.4.0 final release candidate tagged</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>Finally we have finished setting up the website for Meandre a semantic-driven data-intensive flow engine. Meandre provides basic infrastructure for data-intensive computation. It provides, among others, tools for creating components and flows, a high-level language to describe flows, and multicore and distributed execution environment based on a service-oriented paradigm. We are currently working on getting gear up for a first alpha release. You can visit the Meandre site <a href="http://seasr.org/meandre">here</a>. I will be posting in the Meandre blog about our current steps toward getting the release out of the door. The Meandre infrastructure is being build to support the <a href="http://seasr.org">SEASR project</a></p>


<p>Related posts:<ol><li><a href='http://www.xavierllora.net/2008/11/15/meandre-semantic-driven-data-intensive-flows-in-the-clouds/' rel='bookmark' title='Permanent Link: Meandre: Semantic-Driven Data-Intensive Flows in the Clouds'>Meandre: Semantic-Driven Data-Intensive Flows in the Clouds</a></li>
<li><a href='http://www.xavierllora.net/2009/01/29/data-intensive-computing-for-competent-genetic-algorithms-a-pilot-study-using-meandre/' rel='bookmark' title='Permanent Link: Data-Intensive Computing for Competent Genetic Algorithms: A Pilot Study using  Meandre'>Data-Intensive Computing for Competent Genetic Algorithms: A Pilot Study using  Meandre</a></li>
<li><a href='http://www.xavierllora.net/2009/01/11/meandre-140-final-release-candidate-tagged/' rel='bookmark' title='Permanent Link: Meandre 1.4.0 final release candidate tagged'>Meandre 1.4.0 final release candidate tagged</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.xavierllora.net/2008/04/18/meandre-semantic-driven-data-intensive-flow-engine/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
