Reviewing OpenRefine

I have been using OpenRefine extensively in a few projects during the last 12 months and I am now in one of these typical hate-love relationships with it. OpenRefine (formerly Google Refine) is an “open source tool for cleaning and transforming data” although I found it also extremely useful for exploring data (or at least a sample of it) and getting a better understanding of the data before automating the processing of the full dataset using python or java on hadoop.

I mostly used version 2.5 of OpenRefine, which is the last stable version. There is also a version 2.6 that has been under development forever but I have only played with it and haven’t been using it on real projects. OpenRefine is written in Java and runs as a local web application that opens in a web browser. It is available as source code or precompiled packages for Linux, Mac OS X and Windows. My experience (and therefore this review) is limited to version 2.5 on Mac OS X and Windows.

A typical OpenRefine project consists of:

  1. Importing data into OpenRefine;
  2. Transforming it; and
  3. Exporting the result.

Importing Data

Importing data in OpenRefine is relatively easy and leads to the creation of a table of records where each column has a title and each row, which corresponds to a record, is numbered. The application supports the importation of common data formats:

  • CSV-like formats: Personally, I like to have values between double quote and separated by a pipe instead of a comma (e.g., “value1″|”value2″|”value3”). OpenRefine handles it correctly. It shows you how it has interpreted the first records and allows you to interactively modify this interpretation including by selecting the right encodings. The values contained in the first line of the imported file are used as column titles.
  • Excel spreadsheets: This works assuming that your documents only contains data and that you have removed all fancy headings, comments, and formatting. Here also the values contained in the first row of the imported spreadsheet are used as column titles.
  • JSON: When importing a JSON document, an interactive mode lets you select the JSON node (i.e., {object}) that corresponds to the first record to load, then data is imported using the JSON keys found in the imported records as column titles. There are as many columns as keys found in the entire document (not just in the first record). For a reason that I don’t really understand (and that is rather annoying) in each column title, the key name is preceded by the string “__anonymous__ – ” (e.g., “__anonymous__ – TITLE”). Despite this, importing JSON data works relatively well as long as the JSON structure to import stays simple. With complex nested JSON structures, it was necessary to first convert the JSON document into my pipe-separated value format in order to get the result I wanted (and without the annoying “__anonymous__” prefix column name).
  • XML and XHTML: Similarly to what happens for JSON, an interactive mode lets you select the XML element which corresponds to the first record to load. In practice, this only works for very simple XML structures and, in most of the cases, I had to convert the XML documents I wanted to import into CSV format.

The data to import can exist as a file on a file system, as an online file or a web service, or can be pasted from the clipboard. OpenRefine also permits directly importing google data but I haven’t had an opportunity to work with that yet.

Transforming Data

Once your data is imported, OpenRefine offers a large range of tools and features to work on it. All the actions performed on the data are recorded which makes it possible to browse them and undo them at anytime and reestablish the data in a previous state, which often proves extremely useful.

Operations on data can roughly be grouped into three categories depending on the fact that they are based on rows, columns, or cells. I am not going to review all of them but rather pick a few examples in each category demonstrate, in my opinion, the strengths and weaknesses of the tool.

Row-based operations

Row-based operations are limited to marking (with flags or stars) and deleting selected rows.
There is no way to add rows, which is sometimes frustrating.

Rows can be selected by combining filters (simple searches) and facets (facetted searches) on columns. By listing all the distinct values in a column and the number of instances of each of these values in this column, OpenRefine facets are a powerful means to explore, search, and edit data. And sometimes, the facet is the information you need. However, there is no easy way to get and reuse this information. You can try to cut and paste the facet into a text editor but the absence of blank space between a value and the number of instances of this value makes the information hard to reuse, especially if the value is numeric or ends with a digit.

Column-based operations

OpenRefine operations on columns range from simple operations such as renaming/deleting a column or sorting (permanently or not) rows based on the values found in this column; to much more complex ones such as adding a column based on the values found in an existing column or by fetching a URL. Thanks to the Google Refine Expression Language (GREL), these functionalities allow OpenRefine to perform sophisticated operations such as:

  • Geo-localizing IP addresses: For example by submitting the IPs found in a column to a geolocalization web service, parsing the result to extract the corresponding city and adding it to a new column or
  • Merging datasets with common columns: This is the equivalent of performing a “join” with a relational database with the only (and somewhat frustrating) limitation that you can only add one column at a time.

Cell-based operations

Cell-based operations mostly consist of transforming/manipulating the values contained in the cells of a column. This is done by applying GREL statements to the cells of the selected rows. Because writing GREL statements, although a very powerful language that, for example, supports regular expressions, can be cumbersome, OpenRefine also offers a menu of predefined common transformations such as trimming the leading and trailing whitespaces of a string or replacing HTML entities by their corresponding characters.

Exporting Data

Once the dataset is ready, it can be exported. OpenRefine supports the common data formats: Comma-Separated Values (CSV), Tab-Separated Values (TSV), HTML table, Excel spreadsheet. It also offers a power full templating mechanism that allows users to interactively define the format of the data they want to export. By default, the data is presented as a JSON array of record objects (i.e., each OpenRefine row is presented as a JSON object). However, this default format can easily be modified to produce any other format (e.g., a different JSON or an XML).

In addition, OpenRefine has its own format that permits saving an OpenRefine project as a project, which allows, for example, for importing the project in another instance of OpenRefine and continuing to work on the data.

Conclusion

To summarize, OpenRefine has a lot of well-thought out features that make it an excellent tool to clean, transform and explore data such as:

  • Its powerful Undo/Redo functionality;
  • Its excellent support for UTF-8 and other character sets;
  • GREL and, for example, the possibility to “join” columns from different datasets;
  • Its interactive templating export tool.

However, the current version of OpenRefine has some flaws that, in my opinion, make its use in production problematic:

  1. Some frequent operations on data are more complicated than necessary. For example, 5 steps are required to remove duplicate rows when exact values are found in a column.
  2. Much more annoying is the lack of stability of the tool which degrades after a while introducing inconsistencies into data (for example, facets return wrong terms and omit some relevant ones which potentially introduce inconsistencies). The only solution in this case is to restart OpenRefine and in the worse case, when this is not enough, to start the project over.

Let’s hope these bugs and limitations will be fixed in the next release.

Some Pointers

Advertisements
This entry was posted in Data Cleaning and tagged , , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s