Skip to main content

· 4 min read

Currently most of the communication around OpenRefine done is through the mailing list and our twitter account where information are quickly buried for someone not following the project on day to day basis. Those monthly summary highlights key events and contributions in the community and hopefully help to better circulate key information.

Feedback on the format and content are welcome. Ping us on twitter @OpenRefine if we are missing information.

New Tutorials and Articles

If you are new to OpenRefine, Alvin Chang published an excellent introduction to clustering maybe OpenRefine most appreciated feature with new users. For more advanced users, @UMBHLCuration published a tutorial on Normalizing Dates with OpenRefine.

At the Toronto OpenRefine UnConference, I presented on Iterative data discovery and transformation with OpenRefine explaining why OpenRefine is an essential tool for non technical subject matter expert when working with data. To dig further on the topic with the article Agile Data Process with OpenRefine.

If you are working in a library. Checkout the records of this month NCompass Live: Metadata Manipulations: Using MarcEdit and OpenRefine. @silviaegt also published his slides of hisDHBenelux talk with links to VIAF reconc. for OpenRefine & cartoDB.

Finally a great use case presented by @hpiedcoq and @jvilledieu: How to visualize your Facebook network with OpenRefine outwit and Neo4j.

We also have a new resource published in French and Italian:

Development Update

OpenRefine2.6 RC1

Over the last three months we received update to translate OpenRefine interface in Spanish and French. The This will complete the current English, Italian and Chineese version.

We are currently testing the 2.6 RC1 version as a quick checkpoint to allow us to verify all the fixes that have been made since beta 1 and figure out what remaining loose ends need to be cleaned up. More information on the developer mailing list

Reconciliation and Matching Framework (RMF)

Matthew Blissett with the support of the Royal Botanic Gardens in London are releasing Reconciliation and Matching Framework (RMF), a framework to allow the matching of string entities using customised sets of transformations and matchers, plus a tool to produce the necessary configurations and another to expose them as OpenRefine reconciliation services.

###GOKB annoucement:

GOKb, the Global Open Knowledgebase, is a community-managed project that aims to describe electronic journals and books, publisher packages, and platforms which host the resources. GOKb use OpenRefine (with a specially designed extension) as our major mechanism of getting data into GOKb - exploiting the ability to clean up the data (which tends to come from publishers and can be of variable quality) and to re-apply changes to future data from the same publisher/supplier.

GOKb opened to ‘public preview’ in January 2015, and you can signup for an account and access the service at https://gokb.kuali.org/gokb/

Several hundred ejournal packages, and associated information about the ejournal titles, platforms and organisations have been added to the knowledgebase over the past few months. OpenRefine is used to do much of the work to get data ready for loading into GOKb.

Alongside this work of adding content GOKb have also opened up APIs to interact with the service, which could be useful to others using OpenRefine to work with data relating to journals. In particular the ‘Coreference service’ allows you to look up identifiers (such as ISSNs) and get back journal title information and other IDs associated with that title (as JSON or XML).

They are interested in:

  • Talking to people who use OpenRefine and would like to make use of such a service
  • If there is some interest, what support/documentation people would like to see
  • Understanding if we can offer different/better services based on the GOKb data for OpenRefine (e.g. would different data GOKb has be of interest? Would a reconciliation service for journal titles? etc.)

More details and join the discussion on the user mailing list

##Workshop and Events

OpenRefine Twitter feed have been busy last months with over 12 presentations of OpenRefine made! Thank you to our evangelists who introduce OpenRefine to librians, journalist, goverement and open date professional among other groups. Top hastag are:

Want to connect with fellow Refiners? The following events have been announced so far. Ping us on twitter @OpenRefine to announce your event:

· 4 min read

Since the OpenRefine move to Github two years ago, the project has reached a mature stage, and only maintenance work has been done with the release of openrefine2.6-beta. At the same time, the project kept gaining traction with over a thousand weekly downloads and usage with various audiences in particular fields of application. One out of four OpenRefine users identify themselves as librarians, with Researchers and Open Data enthusiasts representing the second largest user group, and Data Journalists and Semantic Web Professionals completing the picture (see the full 2014 survey results)

OpenRefine offers an innovative workflow from data ingestion to consumption, with a capacity to reconcile information consistency and work with remote data processing services. It integrates with over 16 reconciliation services and has 8 community contributed plugins that extend its capability. You can interact directly with the API of 4 other platforms within the context of tasks in OpenRefine alchemyAPI, Zemanta, dataTXT and Crowdflower. The following map lists the different services and plugins working with OpenRefine, as well as projects that have done heavy customization to add OpenRefine in their data manipulation processes.

Map of OpenRefine Ecosystem

The full list of the reconciliation services is up to date on OpenRefine wiki, link to each the extension are available on the download page. Integration and extensions can be broken down by user community types:

General Usage

Several extensions are available to add new functionality to OpenRefine:

  • Diff
  • Stats calculate
  • BITS VIB
  • NER extension
  • extraCTU and geoXtension

Custom reconciliation services can be build using the following project developed by the OpenKnowledge Foundation:

  • Reconcile-csv,
  • nomenklatura
  • SPARQL endpoints.

RefinePro offer hosted services (cloud and on-premise) of OpenRefine providing extra compute power, access from multiple devices and multi user management.

Finally a python and ruby library enable batch processing of the history of an OpenRefine project.

Librarians

Directly from OpenRefine, six different reconciliation services are available to librarians:

  • Vivo
  • FundRef
  • JournalTOCs
  • VIAF
  • FAST (Faceted Application of Subject Terminology) Reconciliation
  • Library of Congress Subject Headings

Librarians with a Global Open Knowledge base (GOKB) account can export and import data directly to this repository of electronic journals and books, publisher packages. The GOKB extension tightly integrate OpenRefine with GOKB workflow.

Open Data

In Open Data, users of CKAN, the Open Source data portal software edited by the Open Knowledge Foundation, can prepare (normalize, cleanup, etc.) their data before publishing them via OpenDataRise, a distribution of OpenRefine. OpenDataRise, allows also insertion/update of entities to a target knowledge base, for now the only reconciliation service supported is Entitypedia (which is yet to go public). Other services might get eventually supported.

Directly from OpenRefine, several reconciliation services are available to the open data community:

  • Influencer Explorer from the Sunlight foundation
  • dbpedia
  • OpenCorporate
  • Ordnance Survey

Biodiversity Researchers

Biodiversity researchers benefit from direct access to seven reconciliation services including:

  • EOL
  • NCBI taxonomy
  • Ubio FindIT
  • WORMS
  • GBIF
  • Global Names Index
  • IPNI

Semantic Web

Semantic Web Professionals have access to the RDF and Link Media Framewok (LMF) extensions to produce and consume linked data. The LODRefine distribution integrated extensions that make transition from tabular data to Linked Data a bit easier.

Under Development Projects

SparkRefine - right now it's just a prototype developed by Andrey Bratus, bachelor's student at the university of Trento in Italy with Spaziodati in the context of the Fusepool P3 project. The initial results are avaialbe in Andrey’s thesis, but the codebase is not in particularly good shape.

BatchRefine is a collection of wrappers for running OpenRefine in batch mode developed by Spaziodati for Fusepool P3.

Redfine is Redlink internal bundle of OpenRefine. The project is currently in development as they integrate some internal services into Redlink public API.

Deprecated

The freebase extension is partly broken and freebase announce the sunset of their services in 2015. The crowdflower extension should update to the new version of crowdflower’s API.

· 6 min read

Following the 2012 survey which gather 99 answers, I wanted to have a fresh picture on who are OpenRefine users. The 2014 survey received 129 answers on the span of two weeks. The goal of this second survey was to understand who is OpenRefine audience and what are they relationship with the official community tools (mailing list and Github issue trackers.)

Community you identify with

One of four OpenRefine user identified himself as librarians making this group the the largest of OpenRefine user base. The Researcher and Open Data enthusiasts represent the two second largest group, each representing over 15% of the userbase. Finally Data journalist and Semantic web each represent around 10%.

We cannot compare directly those results with those from 2012 because in 2012 survey users were able to select multiple answer. However we can notice that in 2012 librarians were not identified as a individual group and are now the largest one.

How often do you use Refine

Taking a slight higher picture the split a 41% using it weekly, a 30% monthly and a 29% less than monthly. Usage frequency remains globally the same between 2012 and 2014.

2014 Results

2012 Results

For how long have you been using OpenRefine

The split of users remains constant between 2012 and 2014 with

  • a third of them using Refine for over two years,
  • a third between one and two years and
  • a third using OpenRefine for less than one year.

Skills

Both in 2012 and 2014 we asked respondents to rate their skills from one to five. One being a novice in Refine and five being a master. When comparing the result we can see a sharp increase in 2014 of the percentage of user with a skills self evaluated at 3. They represent today 43% of OpenRefine user base. It is interesting that the number of user rating their skills 5 (6 users) and 4 (28 in 2012 and 24 in 2014) remained the same over time.

Skills vs time

At a high level, the more experienced user are the higher they will rate their skill. It takes between 6 months for user to take advantages of OpenRefine. This is only after this time period that user move their skills from a 2/5 to a 3/5. After two years of usage no user rate his skills 1/5.

On this other side, it take time to master OpenRefine as the proportion of user rating their skills 4 or 6 really increase after two years of usage with still over 50% rating their level between 2 and 3.

2012 picture is a bit harder to read with because the rating 3/5 is not as prominent as in 2014. However we can see the same trend with the increase of skilled user over time.

Skills vs frequency

Very briefly, the more often people use it the better their skills are.

Frequency vs time

User over two years of experience tends to use it more often but there is no correlation between how long the user has been using OpenRefine and the frequency of usage. As we have seen previously, about 47% of them use it weekly, a third monthly and between 20% and a third less than monthly.

Seeking support

General overview

By order of importance when people need support

  • they first learned how to use OpenRefine by themself (91%) using online tutorial or by exploring the interface.
  • 54% reach to online community (but not the OpenRefine mailing list)
  • 34% asked someone they know
  • Only 19% use the mailing list
  • and 7% have attended a formal training.

It could be interesting to know who are those other online community that provider support for 54% of OpenRefine users.

Usage of the mailing list

We can notice that only 19% of the user are reaching out through the mailing list which is supposed to be main media of communication for the community. When we drill down on who is using the mailing list we realized that this is mainly user with over a year of experience that are using this media. The graphic below show the percentage of user using the mailing list broke down by how long they have been using OpenRefine.

Reporting bugs and requesting new features

Because results are similar for the bug reporting and new feature request, the following analysis will be focus only on the bug reporting answers.

Overview

Close to 63% of the user are happy with OpenRefine and have nothing to report. However 11% of them want to report something but don't know how. Two hypotheses can be explore:

  • Github is too complex for some user to report issue
  • The project is missing links and instruction to report issue and request new feature.

Breakdown per experience with Refine

The following graphic shows that only user with over one year experiences with OpenRefine have reported bug or request new feature. Surprisingly over 60% of the user don't know to report something have been using Refine for over a year.

Breakdown per skills

The following graphic shows that the ability to report is not linked with the skills level since user with a skill level of 4 doesn't know how to report. It is interesting that nearly all user self evaluated at 5 did report bug or request feature.

Even user with experience using Refine have difficulties to report bug and request new features. Better guidance is needed and this can be done by either improving

  • OpenRefine interface with direct link to Github issue list
  • OpenRefine wiki or website with a page describing the process.

Perception of Refine:

Why did you choose OpenRefine

Both in 2012 and 2014 OpenRefine is chosen for its easy and powerful interface to clean large dataset offering reconciliation option. The fact the project is free and open source is also important for a number of user.

2012

2014

Alternative tools

As in 2012 the tools OpenRefine is benchmarked against split into programming / scripting languages (python, R, MySQL) and spreadsheet based interface like Excel or LibreOffice.

2012

2014

Word used to describe OpenRefine

OpenRefine is describe as a data cleaning tool.

2012

2014

· 2 min read

From its inception until October 2012 Refine development was driven mainly by corporations. Metaweb and Google have committed resources to support and grow Refine for more than two years until Google Refine 2.5 release. With the end of Google support 18 months ago, OpenRefine is working as an indepedant community, relying only on volunteer to maintain the code and support user and contributors.

In Fall 2012, I wrote an article on the history of OpenRefine and the need to build a framework for the community. At this time the discussion was focused more on the technical aspect to structure the community (ie moving the code base and documentation to Github) and less on defining a way to work together.

Today, I feel that we are missing a piece to make this community work. A reference document that will guide our decision and provide visibility for existing and new user on how this community work. Having a governance document will help us to:

  • defined responsability within the community,
  • have a way to work together and build consensus,
  • provide visibility for new user joining the project.

I drafted an governance document on the openrefine.github.com project. The document is not available through the main website.

I've build the first section based on the "Meritocratic Governance Model" by University of Oxford and the OWIN Project Governance model. Moreover the section regarding Consensus building remained to be defined by the community. You comments and pull request are welcomed (and needed).

Once the governance model defined I wish to see new committers joining the project to help:

  • Releasing the final 2.6 version (we are in beta since October 2013)
  • Clear the pull request queue (10 pull request are currently pending)
  • Help develop the road map for the version 3.0

[Martin Magdinier](http://www.twitter.com/Martin Magdinier)

· 2 min read

“How do I get started?” is the question we received most during our hands-on workshops on data cleaning and enhancing. OpenRefine is a very powerful tool in the hands of a skilled user, but how do you become one? There is a wiki, several screencasts, and a list of helpful resources. However, until recently, no complete OpenRefine manual existed, so you had to collect documentation from different sources if you wanted to master OpenRefine.

This is why we've written an OpenRefine manual called [_Using OpenRefine_](http://www.packtpub.com/openrefine-guide-for-data-analysis-and-linking-dataset-to-the-web/book) that leads you from your first steps to all advanced OpenRefine topics. Using the [entire dataset of the Powerhouse Museum](http://www.powerhousemuseum.com/collection/database/download.php), it lets you experience OpenRefine techniques in a hands-on way, starting from creating a project and inspecting data and gradually evolving towards complex operations. Rather than being a one-directional text, this book offers detailed recipes you can pick whenever you need them.

In particular, you'll learn about these topics in Using OpenRefine:

  • importing data in various formats
  • exploring datasets in a matter of seconds
  • applying basic and advanced cell transformations
  • dealing with cells that contain multiple values
  • creating instantaneous links between datasets
  • filtering and partition your data easily with regular expressions
  • using named-entity extraction on full-text fields to automatically identify topics
  • performing advanced data operations with the General Refine Expression Language

Get started with OpenRefine right way—for free

Download the entire second chapter of the book for free, so you can already learn about sorting, facets, filters, duplicates, and more. It's the fastest way to get you up to speed with OpenRefine. If you also want to learn about advanced transformation and about connecting your data to the Linked Data cloud, buy the paperback or e-book today!

—Ruben and Max
authors of Using OpenRefine

· 5 min read

Yesterday David Huynh announced that Google will soon stop its active support of Google Refine and count of community to get more involved to growth Refine.

Refine is already a mature data cleaning tool, this change in leadership will be a major challenge for the tool continuity. But first I'd like to clarify what I have read on twitter yesterday night. Google Refine has always been an open source tool and anyone can commit changes, develop an extension or update the wiki.

Through this post I'd like to give my insight on the reason of this decision and what will be the short terms consequences of it.

Google Refine Background

First let's do a brief history of Google Refine. Google Refine finds its root in the Freebase Gridworks solution developed by Metaweb Technologies, Inc. in May 2010. From its first version, Freebase Gridworks  was an open source project. Initially it was a tool designed to support the Freebase database and community for data cleaning, reconciliation and upload. This historical link with Freebase is still present in Google Refine, as the solution supports reconciliation against Freebase database.

In July 2010 Google acquired Metaweb and by extension, Freebase and Gridworks. Freebase Gridworks has been renamed Google Refine and the code and documentation moved to a code.google.com instance. The freshly renamed Google Refine continued to be an open source project for data cleaning. During the 2010 - 2012 period, with the support of Google engineers and the community three upgrade of Google Refine have been made (2.0 ; 2.1 and 2.5). The 2.6 version being on its way (see this discussion for more details).

Over the last 16 months editing this blog, I've seen the tremendous interest for Google Refine from various communities. Librarians, journalists, data analysts have been using Google Refine to clean and reconcile their data. Reconciliation services with more databases have been built,  an extension to support RDF extension has been written. A vivid community has emerged opening new horizons to Refine capability.

The user friendly interface helped thousands of non technical users to take control of their data. We are just at the door step of the big data world and Google Refine by lowering the technical barrier to jump in and empower more people for data analysis and processing. As Tom Hirst published today, Google Refine is a great entry level Glue Logic tool to create bridges between various application or system.

Why Google stop supporting Google Refine?

Google Refine is a desktop based application that can work both offline and online (for reconciliation and web scraping). During the two years Refine was with Google, I didn't witness any specific integration with other Google services (like doc, drive or fusion table). Google didn't develop Refine as a cloud based solution, to take advantage of its computing capability. I guess that the current desktop architecture of Google Refine prevents such a migration. My bet is that Google bought Metaweb Technologies for Freebase data and not for Gridworks / Refine functionality and today they don't see the business case to continue to support Refine.

So, now what?

In the short term, nothing. Google Refine is a stand alone / desktop application, so like most stand alone software, as long as it is installed on your machine, it will run. In a second phase I see two main impacts regarding the end of Google branding and the development of a community supported tool.

The End of Google Branding

Thanks to Google support, Refine is now one of the most mature data cleaning / wrangling tool available. Google branding also help a lot Google Refine marketing and community building.  I suppose that Google name was kind of a guarantee on the product capability and maturity for some users and this naming helped to democratize the tool.

However, I think that losing Googles name will help Refine in two ways:

  1. Google branding made many new Refine users think that Refine was a cloud application and that data was uploaded on Google servers. That have never been the case, Refine is a local application. However, some users might have been reticent to use Refine for this reason. Maybe a different branding will make them more confident about their data privacy.

  2. Reading last nights twitter feed, I realized that most of the people didn't know that Google Refine IS already an open tool. I reckon that the Google branding might have confused most of us, myself included (I created this blog because I didn't know how to edit the wiki). By naming it OpenRefine (or something else), let's hope that more people will engage the community and help with improving it.

OpenRefine - a Community Support Application

Please note that OpenRefine branding is not definitive and open to discussion on the current mailing list.

This new step for Google Refine is also a test on how strong its community is. Will there be enough forces to support, maintain and grow the product from a code, documentation and strategy definition perspective? Organizing the current community and welcoming new users and contributors will be the first challenge of OpenRefine.

We are right now facing a large sheet of white paper, and many things need to be defined from a community management and organization perspective. As the code will be released on Github, little has been defined yet to support the community and articulate the current ecosystem around Google Refine.

I will be part of this adventure. If you are interested to jump in, this is the right time. Join the conversation on the mailing list and share your thoughts!