<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Xavier Llorà &#187; hadoop</title>
	<atom:link href="http://www.xavierllora.net/tag/hadoop/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.xavierllora.net</link>
	<description>A notebook on data-intensive computing, genetics-based machine learning &#38; more.</description>
	<lastBuildDate>Sun, 08 Jan 2012 19:39:15 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>Scaling eCGA Model Building via Data-Intensive Computing</title>
		<link>http://www.xavierllora.net/2010/04/08/scaling-ecga-model-building-via-data-intensive-computing/</link>
		<comments>http://www.xavierllora.net/2010/04/08/scaling-ecga-model-building-via-data-intensive-computing/#comments</comments>
		<pubDate>Thu, 08 Apr 2010 16:17:39 +0000</pubDate>
		<dc:creator>Xavier</dc:creator>
				<category><![CDATA[Data-Intensive Computing]]></category>
		<category><![CDATA[Estimation of Distribution Algorithms]]></category>
		<category><![CDATA[Research]]></category>
		<category><![CDATA[Software]]></category>
		<category><![CDATA[eCGA]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[map-reduce]]></category>
		<category><![CDATA[mongodb]]></category>
		<category><![CDATA[pro]]></category>

		<guid isPermaLink="false">http://www.xavierllora.net/?p=664</guid>
		<description><![CDATA[I just uploaded the technical report of the paper we put together for CEC 2010 on how we can scale up eCGA using a MapReduce approach. The paper, besides exploring the Hadoop implementation, it also presents some very compelling results obtained with MongoDB (a document based store able to perform parallel MapReduce tasks via sharding). [...]
Related posts:<ol>
<li><a href='http://www.xavierllora.net/2009/10/09/scaling-genetic-algorithms-using-mapreduce/' rel='bookmark' title='Scaling Genetic Algorithms using MapReduce'>Scaling Genetic Algorithms using MapReduce</a></li>
<li><a href='http://www.xavierllora.net/2009/07/13/data-intensive-computing-for-competent-genetic-algorithms-a-pilot-study-using-meandre-2/' rel='bookmark' title='Data-Intensive Computing for  Competent Genetic Algorithms:  A Pilot Study using Meandre'>Data-Intensive Computing for  Competent Genetic Algorithms:  A Pilot Study using Meandre</a></li>
<li><a href='http://www.xavierllora.net/2008/03/26/data-intensive-scalable-computing-randy-bryant/' rel='bookmark' title='[BDCSG2008] Data-Intensive Scalable Computing (Randy Bryant)'>[BDCSG2008] Data-Intensive Scalable Computing (Randy Bryant)</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>I just uploaded the technical report of the paper we put together for <a href="http://www.wcci2010.org/">CEC 2010</a> on how we can scale up eCGA using a MapReduce approach. The paper, besides exploring the <a href="http://hadoop.apache.org/">Hadoop</a> implementation, it also presents some very compelling results obtained with <a href="http://www.mongodb.org/display/DOCS/Home">MongoDB</a> (a document based store able to perform parallel MapReduce tasks via sharding). The paper is available as <a href="http://www.illigal.uiuc.edu/pub/papers/IlliGALs/2010001.pdf">PDF</a> and <a href="http://www.illigal.uiuc.edu/pub/papers/IlliGALs/2010001.ps.Z">PS</a>.</p>
<p><strong>Abstract:</strong><br />
This paper shows how the extended compact genetic algorithm can be scaled using data-intensive computing techniques such as MapReduce. Two different frameworks (Hadoop and MongoDB) are used to deploy MapReduce implementations of the compact and extended com- pact genetic algorithms. Results show that both are good choices to deal with large-scale problems as they can scale with the number of commodity machines, as opposed to previous ef- forts with other techniques that either required specialized high-performance hardware or shared memory environments.</p>
<p>Related posts:<ol>
<li><a href='http://www.xavierllora.net/2009/10/09/scaling-genetic-algorithms-using-mapreduce/' rel='bookmark' title='Scaling Genetic Algorithms using MapReduce'>Scaling Genetic Algorithms using MapReduce</a></li>
<li><a href='http://www.xavierllora.net/2009/07/13/data-intensive-computing-for-competent-genetic-algorithms-a-pilot-study-using-meandre-2/' rel='bookmark' title='Data-Intensive Computing for  Competent Genetic Algorithms:  A Pilot Study using Meandre'>Data-Intensive Computing for  Competent Genetic Algorithms:  A Pilot Study using Meandre</a></li>
<li><a href='http://www.xavierllora.net/2008/03/26/data-intensive-scalable-computing-randy-bryant/' rel='bookmark' title='[BDCSG2008] Data-Intensive Scalable Computing (Randy Bryant)'>[BDCSG2008] Data-Intensive Scalable Computing (Randy Bryant)</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.xavierllora.net/2010/04/08/scaling-ecga-model-building-via-data-intensive-computing/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Soaring the Clouds with Meandre</title>
		<link>http://www.xavierllora.net/2010/03/15/soaring-the-clouds-with-meandre/</link>
		<comments>http://www.xavierllora.net/2010/03/15/soaring-the-clouds-with-meandre/#comments</comments>
		<pubDate>Mon, 15 Mar 2010 22:55:11 +0000</pubDate>
		<dc:creator>Xavier</dc:creator>
				<category><![CDATA[Data-Intensive Computing]]></category>
		<category><![CDATA[Notes]]></category>
		<category><![CDATA[Research]]></category>
		<category><![CDATA[cloud computing]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[meandre]]></category>
		<category><![CDATA[ZigZag]]></category>

		<guid isPermaLink="false">http://www.xavierllora.net/?p=659</guid>
		<description><![CDATA[You may find the slide deck and the abstract for the presentation we delivered today at the &#8220;Data-Intensive Research: how should we improve our ability to use data&#8221; workshop in Edinburgh. Abstract This talk will focus a highly scalable data intensive infrastructure being developed at the National Center for Supercomputing Application (NCSA) at the University [...]
Related posts:<ol>
<li><a href='http://www.xavierllora.net/2008/11/15/meandre-semantic-driven-data-intensive-flows-in-the-clouds/' rel='bookmark' title='Meandre: Semantic-Driven Data-Intensive Flows in the Clouds'>Meandre: Semantic-Driven Data-Intensive Flows in the Clouds</a></li>
<li><a href='http://www.xavierllora.net/2008/04/18/meandre-semantic-driven-data-intensive-flow-engine/' rel='bookmark' title='Meandre: Semantic-Driven Data-Intensive Flow Engine'>Meandre: Semantic-Driven Data-Intensive Flow Engine</a></li>
<li><a href='http://www.xavierllora.net/2009/01/29/data-intensive-computing-for-competent-genetic-algorithms-a-pilot-study-using-meandre/' rel='bookmark' title='Data-Intensive Computing for Competent Genetic Algorithms: A Pilot Study using  Meandre'>Data-Intensive Computing for Competent Genetic Algorithms: A Pilot Study using  Meandre</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>You may find the slide deck and the abstract for the presentation we delivered today at the <a href="http://wikis.nesc.ac.uk/escienvoy/Data-Intensive_Research:_how_should_we_improve_our_ability_to_use_data">&#8220;Data-Intensive Research: how should we improve our ability to use data&#8221;</a> workshop in Edinburgh.</p>
<p><center><iframe src="http://www.slideshare.net/slideshow/embed_code/3440242" width="425&type=s" height="356" frameborder="0" marginwidth="0" marginheight="0" scrolling="no"></iframe><br/><br/></center></p>
<p><strong>Abstract</strong></p>
<p>This talk will focus a highly scalable data intensive infrastructure being developed at the National Center for Supercomputing Application (NCSA) at the University of Illinois and will introduce current research efforts to tackle the challenges presented by big-data. Research efforts include exploring potential ways of integration between cloud computing concepts—such as Hadoop or Meandre—and traditional HPC technologies and assets. These architecture models contrast significantly, but can be leveraged by building cloud conduits that connect these resources to provide even greater flexibility and scalability on demand. Orchestrating the physical computational environment requires innovative and sophisticated software infrastructure that can transparently take advantage of the functional features and to negotiate the constraints imposed by this diversity of computational resources. Research conducted during the development of the Meandre infrastructure has lead to the production of an agile conductor able to leverage the particular advantages in the physical diversity. It can also be implemented as services and/or in the context of another application benefitting from it reusability, flexibility, and high-scalability. Some example applications and an introduction to the data intensive infrastructure architecture will be presented to provide an overview of the diverse scope of Meandre usages. Finally, a case will be presented showing how software developers and system designers can easily transition to these new paradigms to address the primary data-deluge challenges and to soar to new heights with extreme application scalability using cloud computing concepts.</p>
<p>Related posts:<ol>
<li><a href='http://www.xavierllora.net/2008/11/15/meandre-semantic-driven-data-intensive-flows-in-the-clouds/' rel='bookmark' title='Meandre: Semantic-Driven Data-Intensive Flows in the Clouds'>Meandre: Semantic-Driven Data-Intensive Flows in the Clouds</a></li>
<li><a href='http://www.xavierllora.net/2008/04/18/meandre-semantic-driven-data-intensive-flow-engine/' rel='bookmark' title='Meandre: Semantic-Driven Data-Intensive Flow Engine'>Meandre: Semantic-Driven Data-Intensive Flow Engine</a></li>
<li><a href='http://www.xavierllora.net/2009/01/29/data-intensive-computing-for-competent-genetic-algorithms-a-pilot-study-using-meandre/' rel='bookmark' title='Data-Intensive Computing for Competent Genetic Algorithms: A Pilot Study using  Meandre'>Data-Intensive Computing for Competent Genetic Algorithms: A Pilot Study using  Meandre</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.xavierllora.net/2010/03/15/soaring-the-clouds-with-meandre/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Scaling Genetic Algorithms using MapReduce</title>
		<link>http://www.xavierllora.net/2009/10/09/scaling-genetic-algorithms-using-mapreduce/</link>
		<comments>http://www.xavierllora.net/2009/10/09/scaling-genetic-algorithms-using-mapreduce/#comments</comments>
		<pubDate>Fri, 09 Oct 2009 15:51:19 +0000</pubDate>
		<dc:creator>Xavier</dc:creator>
				<category><![CDATA[Conferences]]></category>
		<category><![CDATA[Data-Intensive Computing]]></category>
		<category><![CDATA[Estimation of Distribution Algorithms]]></category>
		<category><![CDATA[Publications]]></category>
		<category><![CDATA[Research]]></category>
		<category><![CDATA[Technical Reports]]></category>
		<category><![CDATA[genetic algorithms]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[map-reduce]]></category>

		<guid isPermaLink="false">http://www.xavierllora.net/?p=634</guid>
		<description><![CDATA[Below you may find the abstract to and the link to the technical report of the paper entitled &#8220;Scaling Genetic Algorithms using MapReduce&#8221; that will be presented at the Ninth International Conference on Intelligent Systems Design and Applications (ISDA) 2009 by Verma, A., Llorà, X., Campbell, R.H., Goldberg, D.E. next month. Abstract:Genetic algorithms(GAs) are increasingly [...]
Related posts:<ol>
<li><a href='http://www.xavierllora.net/2010/04/08/scaling-ecga-model-building-via-data-intensive-computing/' rel='bookmark' title='Scaling eCGA Model Building via Data-Intensive Computing'>Scaling eCGA Model Building via Data-Intensive Computing</a></li>
<li><a href='http://www.xavierllora.net/2009/07/13/data-intensive-computing-for-competent-genetic-algorithms-a-pilot-study-using-meandre-2/' rel='bookmark' title='Data-Intensive Computing for  Competent Genetic Algorithms:  A Pilot Study using Meandre'>Data-Intensive Computing for  Competent Genetic Algorithms:  A Pilot Study using Meandre</a></li>
<li><a href='http://www.xavierllora.net/2009/01/29/data-intensive-computing-for-competent-genetic-algorithms-a-pilot-study-using-meandre/' rel='bookmark' title='Data-Intensive Computing for Competent Genetic Algorithms: A Pilot Study using  Meandre'>Data-Intensive Computing for Competent Genetic Algorithms: A Pilot Study using  Meandre</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>Below you may find the abstract to and the link to the technical report of the paper entitled <em>&#8220;Scaling Genetic Algorithms using MapReduce&#8221;</em> that will be presented at the <a href="">Ninth International Conference on Intelligent Systems Design and Applications (ISDA) 2009</a> by Verma, A., Llorà, X., Campbell, R.H., Goldberg, D.E. next month. </p>
<p><strong>Abstract:</strong>Genetic algorithms(GAs) are increasingly being applied to large scale problems. The traditional MPI-based parallel GAs do not scale very well. MapReduce is a powerful abstraction developed by Google for making scalable and fault tolerant applications. In this paper, we mould genetic algorithms into the the MapReduce model. We describe the algorithm design and implementation of GAs on Hadoop, the open source implementation of MapReduce. Our experiments demonstrate the convergence and scalability upto 105 variable problems. Adding more resources would enable us to solve even larger problems without any changes in the algorithms and implementation.</p>
<p>The draft of the paper can be downloaded as <a href="http://www.illigal.uiuc.edu/pub/papers/IlliGALs/2009007.pdf">IlliGAL TR. No. 2009007</a>. For more information see the <a href="http://www.illigal.uiuc.edu/web/technical-reports/2009/10/09/scaling-genetic-algorithms-using-mapreduce/">IlliGAL technical reports web site</a>.</p>
<p>Related posts:<ol>
<li><a href='http://www.xavierllora.net/2010/04/08/scaling-ecga-model-building-via-data-intensive-computing/' rel='bookmark' title='Scaling eCGA Model Building via Data-Intensive Computing'>Scaling eCGA Model Building via Data-Intensive Computing</a></li>
<li><a href='http://www.xavierllora.net/2009/07/13/data-intensive-computing-for-competent-genetic-algorithms-a-pilot-study-using-meandre-2/' rel='bookmark' title='Data-Intensive Computing for  Competent Genetic Algorithms:  A Pilot Study using Meandre'>Data-Intensive Computing for  Competent Genetic Algorithms:  A Pilot Study using Meandre</a></li>
<li><a href='http://www.xavierllora.net/2009/01/29/data-intensive-computing-for-competent-genetic-algorithms-a-pilot-study-using-meandre/' rel='bookmark' title='Data-Intensive Computing for Competent Genetic Algorithms: A Pilot Study using  Meandre'>Data-Intensive Computing for Competent Genetic Algorithms: A Pilot Study using  Meandre</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.xavierllora.net/2009/10/09/scaling-genetic-algorithms-using-mapreduce/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Large Scale Data Mining using Genetics-Based Machine Learning</title>
		<link>http://www.xavierllora.net/2009/07/15/large-scale-data-mining-using-genetics-based-machine-learning/</link>
		<comments>http://www.xavierllora.net/2009/07/15/large-scale-data-mining-using-genetics-based-machine-learning/#comments</comments>
		<pubDate>Wed, 15 Jul 2009 21:56:17 +0000</pubDate>
		<dc:creator>Xavier</dc:creator>
				<category><![CDATA[Data-Intensive Computing]]></category>
		<category><![CDATA[GBML & LCS]]></category>
		<category><![CDATA[Learning Classifier Systems]]></category>
		<category><![CDATA[Presentations]]></category>
		<category><![CDATA[Research]]></category>
		<category><![CDATA[data-intensive flows]]></category>
		<category><![CDATA[genetics-based machine learning]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[LCS]]></category>
		<category><![CDATA[map-reduce]]></category>

		<guid isPermaLink="false">http://www.xavierllora.net/?p=568</guid>
		<description><![CDATA[Below you may find the slides of the GECCO 2009 tutorial that Jaume Bacardit and I put together. Hope you enjoy it. Slides Abstract We are living in the peta-byte era.We have larger and larger data to analyze, process and transform into useful answers for the domain experts. Robust data mining tools, able to cope [...]
Related posts:<ol>
<li><a href='http://www.xavierllora.net/2006/12/13/observer-invariant-histopathology-using-genetics-based-machine-learning/' rel='bookmark' title='Observer-Invariant Histopathology using Genetics-Based Machine Learning'>Observer-Invariant Histopathology using Genetics-Based Machine Learning</a></li>
<li><a href='http://www.xavierllora.net/2009/04/07/deadline-extended-for-special-issue-on-metaheuristics-for-large-scale-data-mining/' rel='bookmark' title='Deadline extended for special issue on Metaheuristics for Large Scale Data Mining'>Deadline extended for special issue on Metaheuristics for Large Scale Data Mining</a></li>
<li><a href='http://www.xavierllora.net/2007/04/17/machine-learning-statistical-learning-in-r/' rel='bookmark' title='Machine learning &amp; Statistical Learning in R'>Machine learning &#38; Statistical Learning in R</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>Below you may find the slides of the <a href="http://www.sigevo.org/gecco-2009/tutorials.html#lsdm">GECCO 2009 tutorial</a> that <a href="http://www.cs.nott.ac.uk/~jqb/">Jaume Bacardit</a> and I put together. Hope you enjoy it.</p>
<p><strong>Slides</strong></p>
<iframe src="http://www.slideshare.net/slideshow/embed_code/1727172" width="425&type=s" height="356" frameborder="0" marginwidth="0" marginheight="0" scrolling="no"></iframe><br/><br/>
<p><strong>Abstract</strong></p>
<p>We are living in the peta-byte era.We have larger and larger data to analyze, process and transform into useful answers for the domain experts. Robust data mining tools, able to cope with petascale volumes and/or high dimensionality producing human-understandable solutions are key on several domain areas. Genetics-based machine learning (GBML) techniques are perfect candidates for this task, among others, due to the recent advances in representations, learning paradigms, and theoretical modeling. If evolutionary learning techniques aspire to be a relevant player in this context, they need to have the capacity of processing these vast amounts of data and they need to process this data within reasonable time. Moreover, massive computation cycles are getting cheaper and cheaper every day, allowing researchers to have access to unprecedented parallelization degrees. Several topics are interlaced in these two requirements: (1) having the proper learning paradigms and knowledge representations, (2) understanding them and knowing when are they suitable for the problem at hand, (3) using efficiency enhancement techniques, and (4) transforming and visualizing the produced solutions to give back as much insight as possible to the domain experts are few of them.</p>
<p>This tutorial will try to answer this question, following a roadmap that starts with the questions of what large means, and why large is a challenge for GBML methods. Afterwards, we will discuss different facets in which we can overcome this challenge: Efficiency enhancement techniques, representations able to cope with large dimensionality spaces, scalability of learning paradigms. We will also review a topic interlaced with all of them: how can we model the scalability of the components of our GBML systems to better engineer them to get the best performance out of them for large datasets. The roadmap continues with examples of real applications of GBML systems and finishes with an analysis of further directions.</p>
<p>Related posts:<ol>
<li><a href='http://www.xavierllora.net/2006/12/13/observer-invariant-histopathology-using-genetics-based-machine-learning/' rel='bookmark' title='Observer-Invariant Histopathology using Genetics-Based Machine Learning'>Observer-Invariant Histopathology using Genetics-Based Machine Learning</a></li>
<li><a href='http://www.xavierllora.net/2009/04/07/deadline-extended-for-special-issue-on-metaheuristics-for-large-scale-data-mining/' rel='bookmark' title='Deadline extended for special issue on Metaheuristics for Large Scale Data Mining'>Deadline extended for special issue on Metaheuristics for Large Scale Data Mining</a></li>
<li><a href='http://www.xavierllora.net/2007/04/17/machine-learning-statistical-learning-in-r/' rel='bookmark' title='Machine learning &amp; Statistical Learning in R'>Machine learning &#38; Statistical Learning in R</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.xavierllora.net/2009/07/15/large-scale-data-mining-using-genetics-based-machine-learning/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>ICEIS 2008: Final sprint and Ricardo Baeza-Yates</title>
		<link>http://www.xavierllora.net/2008/06/16/iceis-2008-final-sprint-and-ricardo-baeza-yates/</link>
		<comments>http://www.xavierllora.net/2008/06/16/iceis-2008-final-sprint-and-ricardo-baeza-yates/#comments</comments>
		<pubDate>Mon, 16 Jun 2008 11:10:46 +0000</pubDate>
		<dc:creator>Xavier</dc:creator>
				<category><![CDATA[Events]]></category>
		<category><![CDATA[Conferences]]></category>
		<category><![CDATA[distributed computing]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[innovation]]></category>
		<category><![CDATA[search]]></category>

		<guid isPermaLink="false">http://www.xavierllora.net/?p=249</guid>
		<description><![CDATA[This is the final sprint for ICEIS. I have been mostly focusing on posters this morning. It his hard to pick one up. I would just say that there was some interesting work on personalized recommender systems&#8212;paper 219. But as I said, there were a bunch of interesting ones and quite a few interesting by-the-poster [...]
Related posts:<ol>
<li><a href='http://www.xavierllora.net/2008/06/18/iceis-2008-blogging-summary-and-final-strings/' rel='bookmark' title='ICEIS 2008: Blogging summary and final strings'>ICEIS 2008: Blogging summary and final strings</a></li>
<li><a href='http://www.xavierllora.net/2008/06/13/iceis-2008-first-friday-afternoon-session/' rel='bookmark' title='ICEIS 2008: First Friday afternoon session'>ICEIS 2008: First Friday afternoon session</a></li>
<li><a href='http://www.xavierllora.net/2008/06/13/iceis-2008-blogging-from-barcelona-friday-morning/' rel='bookmark' title='ICEIS 2008: Blogging from Barcelona (Friday Morning)'>ICEIS 2008: Blogging from Barcelona (Friday Morning)</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>This is the final sprint for ICEIS. I have been mostly focusing on posters this morning. It his hard to pick one up. I would just say that there was some interesting work on personalized recommender systems&#8212;<a title="Papers" href="http://www.iceis.org/program/Program_Monday.htm">paper 219</a>. But as I said, there were a bunch of interesting ones and quite a few interesting by-the-poster conversations. Actually, I am having a very interesting time after the mix of attendees&#8217; profiles.</p>
<p>The morning finally meandered into <a title="Ricardo Baeza-Yates" href="http://research.yahoo.com/bouncer_user/70" target="_blank">Ricardo Baeza-Yates</a>&#8216;s keynote talk. After the initial technical problems&#8212;presentation mode of OpenOffice running on Ubuntu 8.0.4 got up just 75% of the slide area&#8212;they finally succeeded on getting something up and get the talk started. This was a pretty technical talk about Yahoo! research effort on caching to improve the performance and also help scalability and contain cost on the coming years. Besides several cache techniques, he also presented a bunch of possible paralyzation models based on document/term partitions. A thing he breezed over was the machine learning model for classifying queries. That surfaced several places, from predicting common and rare content, to frequent, unfrequent, and rare queries. I was glad that the technical problems were solve and we could enjoy it.</p>
<p>And the conference is finally close. Next year, Milan.</p>
<p>Related posts:<ol>
<li><a href='http://www.xavierllora.net/2008/06/18/iceis-2008-blogging-summary-and-final-strings/' rel='bookmark' title='ICEIS 2008: Blogging summary and final strings'>ICEIS 2008: Blogging summary and final strings</a></li>
<li><a href='http://www.xavierllora.net/2008/06/13/iceis-2008-first-friday-afternoon-session/' rel='bookmark' title='ICEIS 2008: First Friday afternoon session'>ICEIS 2008: First Friday afternoon session</a></li>
<li><a href='http://www.xavierllora.net/2008/06/13/iceis-2008-blogging-from-barcelona-friday-morning/' rel='bookmark' title='ICEIS 2008: Blogging from Barcelona (Friday Morning)'>ICEIS 2008: Blogging from Barcelona (Friday Morning)</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.xavierllora.net/2008/06/16/iceis-2008-final-sprint-and-ricardo-baeza-yates/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>The next generation of data bases</title>
		<link>http://www.xavierllora.net/2008/06/05/the-next-generation-of-data-bases/</link>
		<comments>http://www.xavierllora.net/2008/06/05/the-next-generation-of-data-bases/#comments</comments>
		<pubDate>Thu, 05 Jun 2008 11:39:34 +0000</pubDate>
		<dc:creator>Xavier</dc:creator>
				<category><![CDATA[Notes]]></category>
		<category><![CDATA[couchDB]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[HBase]]></category>
		<category><![CDATA[Metadata]]></category>
		<category><![CDATA[storage]]></category>

		<guid isPermaLink="false">http://www.xavierllora.net/?p=238</guid>
		<description><![CDATA[Yesterday I was reading an interview to Brian Aker (MySQL director of technology) I found via Slashdot when something caught my attention. On the second side of this which may actually be more exciting is the issue of&#8211;instead of the structured data world of the relational database but the semi&#8211;the semi-structured world. You look at what is being [...]
No related posts.]]></description>
			<content:encoded><![CDATA[<p>Yesterday I was reading an <a title="Brian Aker interview" href="http://news.oreilly.com/2008/06/brian-akers-vision-for-a-livab.html">interview to Brian Aker</a> (MySQL director of technology) I found via <a title="Slashdot" href="http://developers.slashdot.org/article.pl?sid=08/06/03/210224&amp;from=rss">Slashdot</a> when something caught my attention.</p>
<blockquote><p>On the second side of this which may actually be more exciting is the issue of&#8211;instead of the structured data world of the relational database but the semi&#8211;the semi-structured world. You look at what is being done today with <a title="CouchDB" href="http://incubator.apache.org/couchdb/">CouchDB</a>, you look at Amazon ScaleDB, to a lesser extent but to a similar extent you&#8211;not ScaleDB, SimpleDB&#8211;to a lesser extent or a similar extent <a title="Tokyo cabinet" href="http://tokyocabinet.sourceforge.net/">Tokyo Cabinet</a>, those databases are really kind of fascinating because those databases are redefining really how we access data and how we are going to be searching and using data. So there&#8217;s a whole world out there that&#8217;s just starting to open up in that direction.</p></blockquote>
<p>For a while now, <a title="Previous post on metadata stores" href="http://www.xavierllora.net/?s=metadata+stores">I have been using different flavors of metadata stores</a>. Everything tends to work nice and dandy as long as you do not push the storage volume too far. For instance, together with <a title="Bernie Acs" href="http://www.ncsa.uiuc.edu/AboutUs/People/contact.php?id=885">Bernie Acs</a> at NCSA, we have run experiments where we could deal with up to <a title="Virtuoso" href="http://virtuoso.openlinksw.com/wiki/main/">280 million triples using Virtuoso</a> or up to 60 million triples using <a title="Jena" href="http://jena.sourceforge.net/">Jena</a> with a <a title="MySQL" href="http://www.mysql.com/">MySQL</a> back end without much trouble and still be able to run arbitrary SPARQL queries on a reasonable time. However, these were relatively small tests. The first one represented only 120 documents in a collection, where as the second one was only a subset of the Wikipedia link graph. Yes, there are ways to move beyond via proper striping and replication of the data, but that is not by default a key concern of such engines. Another sad note was that we had to drop Mulgara because we had a pretty hard time pushing it that far (most of the bugs we run into has been fixed since last year and they have started a push toward getting SPARQL in, so it may be time for revisiting it).</p>
<p>However, all the above approaches were not born out of a distributed environment. Lately, I have started looking for alternative large scale storage born from the distributed environment&#8217;s soup. After the <a title="Hadoop Summit and BCDSG" href="http://www.xavierllora.net/2008/03/26/summary-of-bdcsg2008-blogging/">Hadoop Summit/BCDSG 2008</a> trip I started looking into <a title="HBase" href="http://wiki.apache.org/hadoop/Hbase">HBase</a> (open source take to <a title="Big Table" href="http://labs.google.com/papers/bigtable.html">Google&#8217;s Big Table</a> by the <a title="Hadoop" href="http://hadoop.apache.org/core/">Hadoop</a> community). Not a bad alternative if you can fit your application data needs into their structure. Since runs on <a title="Hadoop" href="http://hadoop.apache.org/core/">Hadoop FS</a>  you get all its benefits for free. But, after reading the above-mentioned interview, however, I just found myself intrigued about <a title="CouchDB" href="http://incubator.apache.org/couchdb/">CouchDB</a> and <a title="Tokyo cabinet" href="http://tokyocabinet.sourceforge.net/">Tokyo Cabinet</a>. I guess I better go and take a look at them <img src='http://www.xavierllora.net/wp-includes/images/smilies/icon_biggrin.gif' alt=':D' class='wp-smiley' /> </p>
<p>No related posts.</p>]]></content:encoded>
			<wfw:commentRss>http://www.xavierllora.net/2008/06/05/the-next-generation-of-data-bases/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>ZooKeeper and distributed applications</title>
		<link>http://www.xavierllora.net/2008/05/22/zookeeper-and-orchestrating-distributed-applications/</link>
		<comments>http://www.xavierllora.net/2008/05/22/zookeeper-and-orchestrating-distributed-applications/#comments</comments>
		<pubDate>Thu, 22 May 2008 13:39:36 +0000</pubDate>
		<dc:creator>Xavier</dc:creator>
				<category><![CDATA[Notes]]></category>
		<category><![CDATA[data-intensive flows]]></category>
		<category><![CDATA[distributed computing]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[meandre]]></category>
		<category><![CDATA[zookeeper]]></category>

		<guid isPermaLink="false">http://www.xavierllora.net/?p=223</guid>
		<description><![CDATA[Lately I have been exploring different alternatives for coordinating the execution of distributed applications. Yes, you guessed it right, I am working on the distribution of the execution of Meandre flows. Chopping the data-intensive flow and mapping the chunks onto a set of distributed processors requires several elements (graph analysis, resource management, etc.). However, the [...]
Related posts:<ol>
<li><a href='http://www.xavierllora.net/2009/09/29/temporary-storage-for-meandres-distribute-flow-execution/' rel='bookmark' title='Temporary storage for Meandre&#8217;s distributed flow execution'>Temporary storage for Meandre&#8217;s distributed flow execution</a></li>
<li><a href='http://www.xavierllora.net/2007/09/19/embedded-jetty-applications/' rel='bookmark' title='Embedded Jetty applications'>Embedded Jetty applications</a></li>
<li><a href='http://www.xavierllora.net/2010/09/13/parallel-and-distributed-computational-intelligence-book-is-out/' rel='bookmark' title='Parallel and Distributed Computational Intelligence book is out for pre-order'>Parallel and Distributed Computational Intelligence book is out for pre-order</a></li>
</ol>]]></description>
			<content:encoded><![CDATA[<p>Lately I have been exploring different alternatives for coordinating the execution of distributed applications. Yes, you guessed it right, I am working on the distribution of the execution of <a title="Meandre" href="http://seasr.org/meandre">Meandre flows</a>. Chopping the data-intensive flow and mapping the chunks onto a set of distributed processors requires several elements (graph analysis, resource management, etc.). However, the basic element that needs to be solved first is the need for a reliable and scalable coordination system.</p>
<p>During my trip to the <a title="Hadoop Summit" href="http://developer.yahoo.com/hadoop/summit/">Hadoop Summit</a> and the <a title="BDCSG" href="http://www.xavierllora.net/2008/03/26/summary-of-bdcsg2008-blogging/">Big Data Computing Study Group</a> I ran into the <a title="ZooKeeper project" href="http://zookeeper.sourceforge.net/">ZooKeeper project</a>, a <a title="Yahoo! Research" href="http://research.yahoo.com/node/1849">Yahoo Research</a> project. In their own words:</p>
<blockquote><p>ZooKeeper is a high available and reliable coordination system. Distributed applications use ZooKeeper to store and mediate updates key configuration information. ZooKeeper can be used for leader election, group membership, configuration maintenance, etc.</p></blockquote>
<p>There is also an interesting introductory lesson <a title="ZooKeeper" href="http://zookeeper.sourceforge.net/">here</a>, and some recipes for the most common data structures (queues, priority queues, distributed locks, etc.) are also available <a title="ZooKeeper recipies" href="http://zookeeper.wiki.sourceforge.net/ZooKeeperRecipes">here</a>. Looks promising, and may make the coordination of the distributed execution of a <a href="http://seasr.org/meandre">Meandre</a> flows easier. The other reason that pushed me to explore this directions was that one of the <a title="Hadoop Summit" href="http://developer.yahoo.com/hadoop/summit/">Hadoop Summit</a> highlights was the adoption of  the <a title="ZooKeeper project" href="http://zookeeper.sourceforge.net/">ZooKeeper project</a> by <a title="Hadoop" href="http://hadoop.apache.org/core/">Hadoop</a> itself.</p>
<p>Related posts:<ol>
<li><a href='http://www.xavierllora.net/2009/09/29/temporary-storage-for-meandres-distribute-flow-execution/' rel='bookmark' title='Temporary storage for Meandre&#8217;s distributed flow execution'>Temporary storage for Meandre&#8217;s distributed flow execution</a></li>
<li><a href='http://www.xavierllora.net/2007/09/19/embedded-jetty-applications/' rel='bookmark' title='Embedded Jetty applications'>Embedded Jetty applications</a></li>
<li><a href='http://www.xavierllora.net/2010/09/13/parallel-and-distributed-computational-intelligence-book-is-out/' rel='bookmark' title='Parallel and Distributed Computational Intelligence book is out for pre-order'>Parallel and Distributed Computational Intelligence book is out for pre-order</a></li>
</ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.xavierllora.net/2008/05/22/zookeeper-and-orchestrating-distributed-applications/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

