2nd “t.m.i.” Meetup: Going In-Depth on Aggregating Data

We are glad to announce that the “t.m.i.” group will meet for the 2nd time on Thursday February 12 at 6:15pm at the Workshop Cafe, 180 Montgomery Street, San Francisco.

We will hear and then discuss the following presentations:

Joe Nelson: Coordinating Web Scrapers

Joe Nelson will demo a computer cluster architecture for massive parallel web scraping built out of free, open-source components. Learn how to provision, coordinate, and monitor scrapers in real-time. After reviewing the pieces of the architecture we’ll see it in action scraping a real site. Finally we’ll see how the data is consolidated in S3 storage and touch on the next steps for data transformation.

David Massart: A Data Model, Workflow, and Architecture for Integrating Data

The presentation proposes an approach for integrating data from different data sources. It starts by introducing “actions” and “facts”, the two core concepts of the data model upon which the proposed approach is based. Then it looks at the workflow that leads from the acquisition of raw data from various sources to its storage and integration as action-fact data. Finally, it proposes an architecture for supporting this workflow.

Feel free to join us if you are in SF next Thursday!

Posted in Uncategorized | Leave a comment

Inside MongoDB SF 2014

Earlier this month, I attended MongoDB SF, the last stop of the MongoDB Days 2014 tour. Usually, I am reluctant to attend vendor-organized conferences because they tend to be more marketing opportunities than venues for gaining in depth understanding of the technology. However I was pleasantly surprised at the MongoDB event. Most of the presentations I heard were obviously delivered by the actual technology creators.

When using MongoDB I’ve found that its MapReduce facility is hard to use (and not just because of having to write your mapper and reducer in JavaScript) and does not really work well. In practice it is necessary to externalize the map/reduce jobs to Hadoop. So, I learned that the MongoDB team have confronted this problem and in response they developed an Hadoop connector. And of course they already offer an Elasticsearch connector because MongoDB is not great at facetted searches. All of these developments put MongoDB at the center of an eco-system. In this eco-system, tasks (that MongoDB is not good at) are passed to specialized third-party solutions. I need to think more about this (and to test it) but I am not 100% convinced by this architecture. I would prefer an architecture where all these tools (e.g., MongoDB, Hadoop, Eleasticsearch) are equal rather than depending on MongoDB.

Another interesting insight I got from attending the event is that the MongoDB team is looking at providing some type of schema validation. Although the way they want to achieve this is not clear yet, I think this is a step in the right direction. After all, data is used by applications that assume a certain data structure. In this context, having the database enforcing this data structure can only help because it allows you to write simpler code (by avoiding defensive programming) which speeds up processing (something important for big data) and cuts down on the possibility of coding errors. This is one of the objectives behind our facetted approach where a JSON schema is used to define the data structure of each facet and validate it before it is stored in the database.

Posted in Database | Tagged , , , | Leave a comment

Reviewing Silk (http://silk.co/)

I attended a meetup presentation of Silk two weeks ago and decided to give it a try. For the last couple of months, I’ve been working on a data project consisting of setting up a workflow for collecting and integrating a large quantity of open data from heterogeneous sources and I was hoping that Silk could help me to visualize and explore this newly integrated data. As I quickly realized, this is not Silk’s purpose. Silk is not a data visualization tool, not even a data presentation tool as such but rather an online publication/presentation tool that helps narrate a story and illustrate it with data.

The Silk data model is very simple and consists of ‘collections’, ‘pages’, and ‘facts’. If you consider a Silk collection as a single table in a relational database, a page is an instance (i.e., a row) and a fact is an attribute (i.e., a column) for which various data types are possible: text, numeric value, URL.

A page consists of at least a title (which must be unique for each page within the collection) and one or more facts listed as a two-column table: The first column for the fact names, the second for the fact values. This basic page structure can be improved by manually adding content (e.g., texts, images, audio or video recordings), visualizations (e.g., maps, pies, bar charts, tables), and overviews (e.g., recent pages, tables of content).

A collection can be created either by encoding each page manually or by uploading an Excel spreadsheet or a CSV file where, after the first row will be used as fact names, one page will be created for each of the following rows, the first column being used as page title. Each collection must have a unique title.

A Silk project consists of at least one collection that is referenced from a homepage. As can be seen with these examples, in addition to references to its collections, a homepage can be completed with texts, images, and visualizations of its collections that allow people to go directly to (instance) pages whereas going to an individual collection page allows you to interact with this collection by visualizing it (as a table, a list, a grid, a pie chart, a map) and filtering its (instance) pages in various ways.

In my opinion, it is important to remember that Silk is a presentation tool to tell stories and illustrate them with data. Using Silk assumes that you must first have a good story to tell, which takes time (the same way preparing a good PowerPoint presentation takes time). It also means that the data you use need to be prepared to serve your story, which also takes time. Chances are that you will not be able to use the data sets (even curated) you already have as-is. So, depending on what your goal is, expect to spend ‘a certain amount of time’ on OpenRefine or whatever your favorite data preparation tool might be. Also, keep in mind that:

  • A Silk project can contain a maximum of 3000 pages, so if you deal with large data sets you probably want to carefully select the data instances you want to publish in each collection. (During the MeetUp I attended, Alex Salkever from Silk announced that this limit will be raised in 2015 as Silk is improving the graph database engine that powers the service.)
  • A Silk collection is a simple table. This means that nested data structures, like those typically expressed in JSON, need to be normalized before they can be uploaded to Silk.

To conclude, I would say that Silk is a promising communication tool as demonstrated by the quality of some of the presentations already available on the platform and I look forward to seeing how it is going to develop in 2015. In the meantime, I keep looking for the tool that will allow me to easily and visually explore data.

Posted in Data Visualization | Tagged , , , , | Leave a comment

Information Management in a Big Data World

Big Data technology consists of a set of tools and techniques used to handle data when the amount of this data, its (lack of) structure and/or the speed at which it needs to be processed exceed the (scaling-up) capacity of conventional database management systems.  These technologies are usually based on multi-node architectures designed to easily scale out.

When talking about Big Data, there is a tendency to presume that the amount of data at hand is so ‘big’, so messy, and in such flux that it can only be handled by (big) data scientists, i.e., hackers who rely on Big Data technology to apply statistics, machine learning, and other techniques to extract meaningful information from Big Data sets which are regarded as ‘black boxes’.
But does it really need to be a black box? 
There is no reason why Big Data technology cannot be used to collect, analyze, classify, manipulate, store, and retrieve Big Data in the way information scientists typically manage information. 
Considering that before any useful information can be extracted from Big Data, it needs to be obtained and scrubbed.  Two necessary evils that, despite being considered as secondary, are generally reported by data practitioners as counting for up to 80% of the effort.  It is imperative to tackle this in a more systematic way by adapting information management techniques to this new ‘big’ environment.
What constitutes information management in a Big Data world, how can Big Data be managed?  These are issues we are interested in and would like to discuss with others during the Meet-up we organized for Thursday November 20th at 7pm Pacific.  So, if you are in San Francisco at that time feel free to join the discussion.

Posted in Meetup | Tagged , , , , , | Leave a comment

Reviewing OpenRefine

I have been using OpenRefine extensively in a few projects during the last 12 months and I am now in one of these typical hate-love relationships with it. OpenRefine (formerly Google Refine) is an “open source tool for cleaning and transforming data” although I found it also extremely useful for exploring data (or at least a sample of it) and getting a better understanding of the data before automating the processing of the full dataset using python or java on hadoop.

I mostly used version 2.5 of OpenRefine, which is the last stable version. There is also a version 2.6 that has been under development forever but I have only played with it and haven’t been using it on real projects. OpenRefine is written in Java and runs as a local web application that opens in a web browser. It is available as source code or precompiled packages for Linux, Mac OS X and Windows. My experience (and therefore this review) is limited to version 2.5 on Mac OS X and Windows.

A typical OpenRefine project consists of:

  1. Importing data into OpenRefine;
  2. Transforming it; and
  3. Exporting the result.

Importing Data

Importing data in OpenRefine is relatively easy and leads to the creation of a table of records where each column has a title and each row, which corresponds to a record, is numbered. The application supports the importation of common data formats:

  • CSV-like formats: Personally, I like to have values between double quote and separated by a pipe instead of a comma (e.g., “value1″|”value2″|”value3”). OpenRefine handles it correctly. It shows you how it has interpreted the first records and allows you to interactively modify this interpretation including by selecting the right encodings. The values contained in the first line of the imported file are used as column titles.
  • Excel spreadsheets: This works assuming that your documents only contains data and that you have removed all fancy headings, comments, and formatting. Here also the values contained in the first row of the imported spreadsheet are used as column titles.
  • JSON: When importing a JSON document, an interactive mode lets you select the JSON node (i.e., {object}) that corresponds to the first record to load, then data is imported using the JSON keys found in the imported records as column titles. There are as many columns as keys found in the entire document (not just in the first record). For a reason that I don’t really understand (and that is rather annoying) in each column title, the key name is preceded by the string “__anonymous__ – ” (e.g., “__anonymous__ – TITLE”). Despite this, importing JSON data works relatively well as long as the JSON structure to import stays simple. With complex nested JSON structures, it was necessary to first convert the JSON document into my pipe-separated value format in order to get the result I wanted (and without the annoying “__anonymous__” prefix column name).
  • XML and XHTML: Similarly to what happens for JSON, an interactive mode lets you select the XML element which corresponds to the first record to load. In practice, this only works for very simple XML structures and, in most of the cases, I had to convert the XML documents I wanted to import into CSV format.

The data to import can exist as a file on a file system, as an online file or a web service, or can be pasted from the clipboard. OpenRefine also permits directly importing google data but I haven’t had an opportunity to work with that yet.

Transforming Data

Once your data is imported, OpenRefine offers a large range of tools and features to work on it. All the actions performed on the data are recorded which makes it possible to browse them and undo them at anytime and reestablish the data in a previous state, which often proves extremely useful.

Operations on data can roughly be grouped into three categories depending on the fact that they are based on rows, columns, or cells. I am not going to review all of them but rather pick a few examples in each category demonstrate, in my opinion, the strengths and weaknesses of the tool.

Row-based operations

Row-based operations are limited to marking (with flags or stars) and deleting selected rows.
There is no way to add rows, which is sometimes frustrating.

Rows can be selected by combining filters (simple searches) and facets (facetted searches) on columns. By listing all the distinct values in a column and the number of instances of each of these values in this column, OpenRefine facets are a powerful means to explore, search, and edit data. And sometimes, the facet is the information you need. However, there is no easy way to get and reuse this information. You can try to cut and paste the facet into a text editor but the absence of blank space between a value and the number of instances of this value makes the information hard to reuse, especially if the value is numeric or ends with a digit.

Column-based operations

OpenRefine operations on columns range from simple operations such as renaming/deleting a column or sorting (permanently or not) rows based on the values found in this column; to much more complex ones such as adding a column based on the values found in an existing column or by fetching a URL. Thanks to the Google Refine Expression Language (GREL), these functionalities allow OpenRefine to perform sophisticated operations such as:

  • Geo-localizing IP addresses: For example by submitting the IPs found in a column to a geolocalization web service, parsing the result to extract the corresponding city and adding it to a new column or
  • Merging datasets with common columns: This is the equivalent of performing a “join” with a relational database with the only (and somewhat frustrating) limitation that you can only add one column at a time.

Cell-based operations

Cell-based operations mostly consist of transforming/manipulating the values contained in the cells of a column. This is done by applying GREL statements to the cells of the selected rows. Because writing GREL statements, although a very powerful language that, for example, supports regular expressions, can be cumbersome, OpenRefine also offers a menu of predefined common transformations such as trimming the leading and trailing whitespaces of a string or replacing HTML entities by their corresponding characters.

Exporting Data

Once the dataset is ready, it can be exported. OpenRefine supports the common data formats: Comma-Separated Values (CSV), Tab-Separated Values (TSV), HTML table, Excel spreadsheet. It also offers a power full templating mechanism that allows users to interactively define the format of the data they want to export. By default, the data is presented as a JSON array of record objects (i.e., each OpenRefine row is presented as a JSON object). However, this default format can easily be modified to produce any other format (e.g., a different JSON or an XML).

In addition, OpenRefine has its own format that permits saving an OpenRefine project as a project, which allows, for example, for importing the project in another instance of OpenRefine and continuing to work on the data.


To summarize, OpenRefine has a lot of well-thought out features that make it an excellent tool to clean, transform and explore data such as:

  • Its powerful Undo/Redo functionality;
  • Its excellent support for UTF-8 and other character sets;
  • GREL and, for example, the possibility to “join” columns from different datasets;
  • Its interactive templating export tool.

However, the current version of OpenRefine has some flaws that, in my opinion, make its use in production problematic:

  1. Some frequent operations on data are more complicated than necessary. For example, 5 steps are required to remove duplicate rows when exact values are found in a column.
  2. Much more annoying is the lack of stability of the tool which degrades after a while introducing inconsistencies into data (for example, facets return wrong terms and omit some relevant ones which potentially introduce inconsistencies). The only solution in this case is to restart OpenRefine and in the worse case, when this is not enough, to start the project over.

Let’s hope these bugs and limitations will be fixed in the next release.

Some Pointers

Posted in Data Cleaning | Tagged , , , , | Leave a comment