Preparing for the upcoming Meandre release, as part of the larger SEASR on, I have been exploring a bit the inner machinery of WordPress to create a simple plugin for the SEASR site that could render Meandre component and flow descriptions straight out of their RDF desciptors. Actually, it turned out to be a breeze. After being a heavy user of WordPress in several project for few years now, its extensibility keeps sparking new possibilities in my mind every time I look into any of its facets. Anyway, I am rambling. Back to the point. To hit the ground running, I started by creating a simple generic plugin that would replace a tag with parameters for something else.

I looked into several freely available plugins, the one that got me going was a very simple one, the SlideShare plugin by Joost de Valk. The plugin transforms a formated tag into code that embed a flash player for SlideShare presentation. Below, I reproduce the skeleton of a plugin that will replace the tags in your post/pages/comments for whatever you like. A tag takes the form [tag-mnemonic param1 param2] and the plugin replaces it by just a HTML list of the two parameters.

 <?php
/*
Plugin Name: Simple Plugin
Plugin URI: http://host.org/
Description: Replaces a tag for a list.
Version: 0.1
Author: Xavier Llor&amp;amp;amp;amp;amp;amp;amp;amp;agrave;
Author URI: http://www.xavierllora.net

Based on SlideShare plugin (http://wordpress.org/extend/plugins/slideshare/)

Installation: copy simple-plugin.php to the wp-content/plugins directory of
your Wordpress installation, then activate the plugin. Use the following syntax
for your tag

[tag-mnemonic Param1 Param2]

When the plugin is active, the tag will be replaced by a simple HTML list
containing the parameter values.

*/

define("SP_REGEXP", "/\[tag-mnemonic ([[:print:]]+) ([[:print:]]+)\]/");
define("SP_TARGET", "<ul><li>###Param1###</li><li>###Param2###</li></ul>" );

function sp_plugin_callback($match)
{
	$output = SP_TARGET;
	$output = str_replace("###Param1###", $match[1], $output);
	$output = str_replace("###Param2###", $match[2], $output);
	return ($output);
}

function sp_plugin($content)
{
	return (preg_replace_callback(SP_REGEXP, 'sp_plugin_callback', $content));
}

add_filter('the_content', 'sp_plugin');
add_filter('comment_text', 'sp_plugin');

?>

That means that if you write the tag [tag-mnemonic http://foo.org http://example.org], the plugin will reformat the tag and inject the following html code

<ul>
<li>http://foo.org</li>
<li>http://exampl.org</li>
</ul>

where the tag originally was. And that’s it!

One more thing. The only thing you may want to be aware is the filtering chain. The add_filter has a third numeric parameter. When you have multiple plugins that rely on filters to do their job, sometimes you may need to force a certain order of execution to avoid stepping on each others toes. If that is your case, you better check the add_filter documentation. It may save you some time, trust me :D

PS: Joost had some comments on improving this simple plugin. Check the comments below


Jul 01

Efficient storage for Python

Filed under Notes - No comment - Post a comment

Did you ever run into the situation that your analysis/simulation data is too large to fit it in memory? Does the flat file format you use for storing your data sets become to big that renders it slow to a crawl? If you answered yes, you may want to give a spin to the HDF5 library. HDF5 file are not replacement for relational data bases. They are catered for storing complex data objects and a wide variety of metadata. It is also optimize for efficiency of storage and retrieval. The underlying library is written in C.

If you are a Python user, PyTables provides a very efficient wrapper for HDF5 files. It gives you access to all the HDF5 api, plus it is nicely integrated with NumPy and provides natural naming conventions. In another words, you can quickly store and retrieve your arrays/matrix to HDF5 files, giving you a very interesting persistence layer.  For instance you can do a simple table scan by:

import tables;
 
h5f = tables.openFile("test.h5f","r")
for r in h5f.root.Table :
    pass
h5f.close()

The code above opens an H5DF file called “test.h5f” which contains a table at the root named “Table”–see the natural notation here. Then it just scans the table using the common Python syntax. You can also do grouped reads, read all into memory, filtered wheres, and the like.

Definitely worth to give it a spin if you have to much data for your available memory or flat file format. Also, I am wondering how fast this may fly on a SSD…


Yes, I am back. Almost 28 hours later than expected, but I am back to Urbana. The trip started with a very tiny little delay of 8 hours in Barcelona. Yes, COA121 was the flight to Newark. Yes, COA121 was the first leg toward Urbana. Yes, the plane had technical problems on the way to Barcelona and had to land at the Azores islands to get it “fixed”. Yes, we did not dare to ask. Yes, we were tired of waiting for a plane that seemed never to come; but you should have seen the faces of the passengers deplaning at Barcelona, exhausted is not a word descriptive enough. Yes, we all filled a complaint form asking for money back according to the European Union bill of right for air passengers. Yes, we know that it was a bit of wishful thinking and the lack of anything better to do while waiting.

I was surprise how civilized was everybody. Nobody shouted. Everybody was polite and ordered. It was also surprising how little the company told the ground personal. The poor guys did and outstanding job to keep everybody relaxed and informed, given the circumstances. Having almost zero information—yes even with almost zero information there were two competing versions of what was going on—they manage to finally do the job. The poor guys also had a never-ending day. I bet they spend more than 14h there. But of course, if you get delayed 8 hours you arrive to Newark at 10pm and, hence, you leave the airport at 11pm. Yes, you leave the airport because there are no more flights that late at night. So they give you an hotel and meal voucher, and they book you for the next flight to O’Hare. I was “lucky” that the next available flight was at 1:50pm. Some folks had only 3 hours sleep at the hotel before leaving back to the airport to catch 5am flights.

And a new day starts. And you get to the airport refreshed and optimistic. And you board the plane on time. And you are happy that you will be able to finally leave on time. And it is summer. And O’Hare has slowed down because of the weather—some storm going one. And you sit for another hour on a plane waiting for the O’Hare to bless your departure. And you got there late; luckily you call the day before the rental car company. And you do not see your luggage. And when you walk toward Continental’s lost luggage office to complain, you find your luggage between the lost luggage of the earlier flight you could not get in. And you go to the rental. And finally you find yourself in a traffic jam driving back to Urbana, one day later than planned, a bit exhausted, and having missed some things at work, but having meet a lot of new people you would never have talked to.


A while a go I wrote about tools for managing your LaTeX bibliography. Despite the fact that the tools I described help managing your LaTeX bibliography collection, it still did not help much with managing the tons of PDFs files you end pilling up when doing research on a particular topic. BibDesk has now the ability to attach files to entries, Zotero with its ability to store snapshots is still the closest thing I have found so far.

However, a friend just pointed me to Papers, a Mac tool—yes it is just available for Mac—for managing your digital library of papers. Very much like iTunes, it allows to streamline your search, reading, organizing, and writing—there is a very interesting webcast by the creators of the software. If you have a Mac, it is worthwhile to give it a spin.


If you are looking for a list of the related blogging done during ICEIS 2008 just follow this link. During Sunday morning I run into Angel A. Juan, an assistant professor at Open University of Catalonia (UOC), interested on analyzing online teaching efforts and how tools to assist professors monitoring students performance on online media. I visited him yesterday at his office and we got and interesting exchange of ideas. Most of them revolved around the work we have conducted under the DISCUS project, and how similar is our efforts on marketing focus groups and their online teaching environment. His group, Distributed, Parallel and Collaborative Systems, was also interested on the work done under the SEASR project, mostly focusing on the Meadre infrastructure for data-intensive flow computing we are getting close to release.

 


This is the final sprint for ICEIS. I have been mostly focusing on posters this morning. It his hard to pick one up. I would just say that there was some interesting work on personalized recommender systems—paper 219. But as I said, there were a bunch of interesting ones and quite a few interesting by-the-poster conversations. Actually, I am having a very interesting time after the mix of attendees’ profiles.

The morning finally meandered into Ricardo Baeza-Yates’s keynote talk. After the initial technical problems—presentation mode of OpenOffice running on Ubuntu 8.0.4 got up just 75% of the slide area—they finally succeeded on getting something up and get the talk started. This was a pretty technical talk about Yahoo! research effort on caching to improve the performance and also help scalability and contain cost on the coming years. Besides several cache techniques, he also presented a bunch of possible paralyzation models based on document/term partitions. A thing he breezed over was the machine learning model for classifying queries. That surfaced several places, from predicting common and rare content, to frequent, unfrequent, and rare queries. I was glad that the technical problems were solve and we could enjoy it.

And the conference is finally close. Next year, Milan.


The morning started with Jean-Marie Favre and his invited speech about “Software languages through the ages”. Quite an eclectic presentation that was quite a bit thought provoking. Some excerpts out of his amalgam of concepts:

  • Human kind is defined by language
  • Civilization is defined by writing
  • Languages expand across a vast period of history, computer science for just a few millimeters in such history line

You may agree or dispute his claims, but you cannot dispute that it was a thought provoking talk. His bottom line, the next research frontier software language engineering and software linguistics. You can find more information at planet-sl.org.

Then I run to attend the presentation of paper 263 and 344. I would mention Texas Tech’s SORCER effort and their effort on service-oriented infrastructure to approach programming large-scale networked systems—their approach also take from the get go issues like availability and fault tolerance. Unfortunately I could not finish the presentation because I had to run because it was my turn.

The afternoon started with papers 331, 346, and 723. The last one was presenting IBM’s work on modeling life cycles, and their effort of making it a generic model. The last round of papers 87, 614, and 655 where rather eclectics, RDF, RFID tags, and Wikipedia. Quite an interesting combinations. Also the afternoon was crowded with interesting hallway conversations, despite the gray day outside :)


This afternoon has been quite slow. My path 340, 348, 495, 612, 193, and 467. The key slower of the afternoon has been the fact that for each of the session there has always been a missing presenter. May be because it is a Saturday afternoon in Barcelona, but then I check the rest of the parallel tracks and it seams quite endemic it. Anyway, I run into an interesting talk about how to map workflows onto multicore architectures—paper 340. It was just a talk that lead straight to scheduling problems of direct acyclic graphs. I ask about the cyclic cases and their approach was to break the cycle and use the acyclic schedule. Mmh. I also found myself how they plain to make those schedulers a reality without messing with the underlying OS one. Another paper that cough my attention was paper 193, where they were focusing to fraud risk reduction. The interesting twists there were that (1) it is, by nature, an unsupervised problem, and (2) they were trying to do both detection and prevention. Unfortunately the only evolutionary computation related paper— paper 306—had no presenter…

A total different story has been Moira Norrie keynote speech. She has presented their research efforts on paper & the digital world. A very interesting presentation covering from basic technologies—Anoto—and latest digital paper products—Livescribe—to their research on interactive paper. The basic idea, paper that allows you to interact with the content—for instance multimedia one. Their research involves from clicking on paper, to gestures, to the infrastructure (iServer) their cross-media information server, and their plugins—such as iPaper. Active components (no, not the Microsoft ones) are the bridges they build to control media from a paper and digital pen. Of course, they can also record annotations, whiteboards, audio, or also support collaboration via multiple pens usage. Other interesting usages of the digital paper was information search and proofreading to mention a few. You can find more information on Moira’s home page.


Jun 14

ICEIS 2008: Saturday morning jam

Filed under Events - No comment - Post a comment

My morning jam involved papers 102, 363, 395, 450 709, 234, 392 and 499—that included a poster session too. There were some puzzling questions running around my head. Just one example out of paper 450; there are always many terminologies meaning the same and rewrite or revolve around the same problem: where is the difference between multicriteria optimization instead of multiobjective optimization. As I said, puzzling.

I also ran into a poster during the jam by Intel folks (paper 234) worth to mention. It was not proposing new technology but building on the creation of distributed data centers via virtualization technologies. They coined the term SVG’s to describe virtualization + data isolation + services providing. More on the level of exercise to show how people can move to virtual data centers hosted in the cloud, than on proposing new technology—they said they run VMWare for all the virtualization. The question still buzzes in my head unanswered: How is this new or different from Amazon’s EC2? The answer was that they focusses on the infrastructure not on hosting—which puzzled me more after claiming they rely on VMWare—I guess that I will need to dig deeper to get a better picture of their work.

Just getting close to the lunch break, paper 195 presented some work on an engineering approach to determine the emotional signature on clothing website. Built around the Kansei method, the paper resonate in my head very close to the Gladwell’s Blink book.


Jorge Cardoso (SAP Research @ Dressden) presented their efforts on Thetsus/Texo. They are using the SOA paradigm as a way to reengineer business processes, pushing the good’old functional design decomposition to the next level. The building stone of their work is the ISE methodology that proposes to streamline as follows.

  1. Innovation
  2. Requirements
  3. Design
  4. Implementation
  5. Preparation to market
  6. Market lunch

The interesting twist of the methodology is that it also tries to treat each of these steps as services, raising again the buzz word of the conferences “business services”.