Skip to main content

· 4 min read

Update (2021-08-06): both positions have been filled.

OpenRefine is seeking a Junior Developer - Wikimedia Commons reconciliation and batch upload functionalities (paid contractor position).

OpenRefine is a power tool to clean messy data, popular in a diverse range of communities. It has been serving the needs of journalists, librarians, Wikipedians, scientists for more than 10 years, and is taught in many curricula and workshops around the world.

OpenRefine is quite actively used on Wikidata, the structured data ‘sister’ of Wikipedia. In addition, thanks to a grant from the Wikimedia Foundation, OpenRefine will, between September 2021 and August 2022, be extended with structured data functionalities for Wikimedia Commons, the media repository of the Wikimedia ecosystem. This code extension will make it possible to batch edit structured data of existing files on Wikimedia Commons, and to batch upload new Wikimedia Commons files with structured data from the start.

OpenRefine is a fiscally sponsored project of Code for Science & Society Inc, a 501(c)(3) charitable organization in the US.

The OpenRefine team is seeking a junior developer who will build the Wikimedia-specific features, as web services hosted on Wikimedia Toolforge or Wikimedia Cloud VPS for the abovementioned functionalities.

  • This is a part time contract, over a period of 6 months.
  • Active work will be around 20 weeks, roughly from September 2021 until end February 2022.
  • For an average of 20 hours per week.
  • Fully remote. We encourage developers from outside of the USA and EU to apply.
  • We have between 14,000 USD and 16,000 USD available to complete this assignment, depending on experience. The payment details will be negotiated with the contractor, who will invoice Code for Science & Society for their work towards the corresponding goals.

Responsibilities

The Junior Developer:

  • Develops a new OpenRefine reconciliation service specifically for Wikimedia Commons, inspired by the existing Wikidata Reconciliation Service and following the Reconciliation Service API Protocol.
  • Develops a batch upload tool for structured data on Wikimedia Commons. Depending on circumstances, this batch upload functionality will be developed to be compatible with already existing upload tools in the Wikimedia ecosystem, such as QuickStatements, or as a new tool.
  • Works in close collaboration with their colleague (OpenRefine developer), and will regularly coordinate with the product manager and the rest of the OpenRefine development team.

You can read more about this project, the planned tasks and the various roles, in the public grant proposal on meta.wikimedia.org.

Qualifications

Please do not self-censor if you do not meet all of these criteria, as you will develop your skills during the project.

  • Experience developing web services in a language of your choice.
  • Enthusiasm for writing good documentation and tests alongside your code.
  • Ability to work independently in a fully remote project.
  • Experience with open source development workflows on GitHub.
  • Familiarity with tool deployment on the Wikimedia Toolforge.
  • Familiarity or experience with datasets in non-Western languages, non-Latin scripts, right to left writing systems, non-Western calendars, etc. is a plus.

How to respond

Please send your resume or CV, sample of your relevant previous work, and a short letter of interest to advisory.committee@openrefine.org. We will schedule an interview with short-listed candidates. Applications will be reviewed on a rolling basis, with an aim to fill the position by July 30.

OpenRefine is fiscally sponsored by Code for Science and Society (CS&S). CS&S is an equal opportunity employer committed to hiring a diverse workforce at all levels of the organization thereby creating a culture that allows us to better serve our clientele, our employees and our communities. We value and encourage the contributions of our colleagues and strive to create an environment where everyone can reach their full potential and drive outstanding results. All qualified applicants will receive consideration for employment without regard to race, national origin, age, sex, religion, disability, sexual orientation, marital status, veteran status, gender identity or expression, or any other basis protected by local, state, or federal law. This policy applies with regard to all aspects of one’s employment, including hiring, transfer, promotion, compensation, eligibility for benefits, and termination.

· 2 min read

The OpenRefine team is delighted to share our new user manual!

This reference covers every aspect of the tool, from installation to exporting a cleaned dataset.

A screenshot of the new user manual.

At this point we would like to hear from you: Does this new user manual help you? Did we cover everything you'd expect? Any suggestions to make it better?

We're looking for your feedback on every aspect: the structure, the text, the design, the images, and anything we've missed. We'd especially love to hear from you if you're a new user and are using this manual to install and run OpenRefine for the first time.

We recommend that you start by following along with the user manual to install or upgrade to the latest version of OpenRefine: version 3.4.1. Then, try out some tasks alongside the relevant sections of the user manual. Don't try to read the whole thing - just focus on sections that are most important to you.

If you'd like to share your thoughts:

  • Please fill out our feedback form. This helps us be sure we're hearing from a broad range of participants. Please fill out the form even if you don't have much to say - we'd love to know if you've taken a look regardless. There is space within the form to report specific requested changes and line edits, as well as give us your general impression.
  • You can also suggest edits directly on GitHub (use the "Edit this page" link at the end of each page of the new docs). This will create a pull request with your suggestions; this requires a GitHub account.

We're also asking people who have written tutorials, who give workshops, and who share links on their websites to update those links to point to docs.openrefine.org. While the documentation will always be open to improvements, this is now the authoritative source for information about the tool's functions and features. We will eventually be deleting old pages from the wiki on Github.

If you're the author of a tutorial, we will try to get in touch soon to talk to you about updating your documentation online. If you don't hear from us, please reach out! We'd love to chat about the new docs.

· 2 min read

The OpenRefine team is delighted to share our new user manual!

This reference covers every aspect of the tool, from installation to exporting a cleaned dataset.

A screenshot of the new user manual.

At this point we would like to hear from you: Does this new user manual help you? Did we cover everything you'd expect? Any suggestions to make it better?

We're looking for your feedback on every aspect: the structure, the text, the design, the images, and anything we've missed. We'd especially love to hear from you if you're a new user and are using this manual to install and run OpenRefine for the first time.

We recommend that you start by following along with the user manual to install or upgrade to the latest version of OpenRefine: version 3.4.1. Then, try out some tasks alongside the relevant sections of the user manual. Don't try to read the whole thing - just focus on sections that are most important to you.

If you'd like to share your thoughts:

  • Please fill out our feedback form. This helps us be sure we're hearing from a broad range of participants. Please fill out the form even if you don't have much to say - we'd love to know if you've taken a look regardless. There is space within the form to report specific requested changes and line edits, as well as give us your general impression.
  • You can also suggest edits directly on GitHub (use the "Edit this page" link at the end of each page of the new docs). This will create a pull request with your suggestions; this requires a GitHub account.

We're also asking people who have written tutorials, who give workshops, and who share links on their websites to update those links to point to docs.openrefine.org. While the documentation will always be open to improvements, this is now the authoritative source for information about the tool's functions and features. We will eventually be deleting old pages from the wiki on Github.

If you're the author of a tutorial, we will try to get in touch soon to talk to you about updating your documentation online. If you don't hear from us, please reach out! We'd love to chat about the new docs.

· 5 min read

OpenRefine is a power tool to clean messy data, popular in a diverse range of communities. It has been serving the needs of journalists, librarians, Wikipedians, scientists for more than 10 years, and is taught in many curricula and workshops around the world.

The OpenRefine advisory committee solicits proposals to improve contributor onboarding and retention in the project, funded by a grant from the Silicon Valley Community Foundation via the Chan Zuckerberg Initiative under their Essential Open Source Software for Science programme.

OpenRefine is a fiscally sponsored project of Code for Science & Society Inc, a 501(c)(3) charitable organization in the US.

Scope

We solicit proposals to improve the day-to-day experience of OpenRefine contributors. This covers the following areas:

  • Documentation for contributors. This can cover a wide range of topics, such as IDE setup instructions, testing guidelines, code style, pull request process, documentation of the overall architecture of the code base, guide to debugging, guide to extension development, translation workflow, release process, or other similar areas.
  • Testing improvements. We currently do not have any UI testing in place, our code coverage for the backend is very sparse, and we do not have wide-ranging integration tests either. Proposals to improve our testing are therefore in scope as well.
  • Tackling technical debt. This can cover migration out of unmaintained or obsolete libraries, dependency management, continuous integration and other housekeeping tasks.

Timeline and Process

We will used a community centered process to ensure that decisions on technical direction are agreed upon by the community.

  • 28 August 2020: Call for proposals announced and mailing list is open for project submissions and discussion
  • 21 September 2020: Proposals due, open discussion period ends, and advisory committee discusses proposals and makes decisions on priorities and budgets
  • 30 September 2020: Selected projects announced and contracts established with CS&S to complete the work
  • 15 October 2020: Earliest work start date
  • 30 April 2021: Latest work end date

Budget

We have 50,000 USD available to fund these projects. Selected proposals will invoice Code for Science & Society for their work towards the corresponding goals.

How to apply

You do not need to have a fully formed idea to submit to the mailing list. The community and OpenRefine Advisory Committee will help you develop and scope your work. We have added a template so you can see the parts of a finished scope of work, but you do not need to know what the timeline and budget will be to propose an idea.

  • Propose your project on the openrefine-dev@googlegroups.com mailing list, to build consensus around it and refine the scope of your work with input from the community. We invite open discussion on all submitted proposals on the mailing list.
  • Submit a proposal to advisory.committee@openrefine.org for the work you intend to carry out, including a timeline and budget (see template below). If you have not yet contributed to OpenRefine, please also include a short portfolio demonstrating your work in other projects;
  • Approved projects will sign a contract for the work with CS&S;
  • The contributor will invoice CS&S according to the payment schedule agreed to with OpenRefine

Projects will be selected on the basis of their estimated benefit to the community of contributors, their cost and the contribution record of the proposer (in OpenRefine or similar projects).

Code for Science & Society is an equal opportunity employer committed to hiring a diverse workforce at all levels of the organization, creating a culture that allows us to better serve our projects, our employees, and our communities. We value and encourage the contributions of our employees and strive to create an environment where everyone can reach their full potential and drive outstanding results. All qualified applicants will receive consideration for employment without regard to race, national origin, age, sex, religion, disability, sexual orientation, marital status, veteran status, gender identity or expression, or any other basis protected by local, state, or federal law. This policy applies with regard to all aspects of one’s employment, including hiring, transfer, promotion, compensation, eligibility for benefits, and termination.

Application template

Project title: Improving OpenRefine development with Foobar

Project description (about 250 words): "I feel that the lack of Foobar has really impeded my work on the project so far. Currently we just have some Barfoo integration, but this tool is abandoned and does not work well on Windows. There seems to be consensus in the community for using Foobar 4, which would be added in the repository. We would make sure its integration with IntelliJ and Eclipse work well. The bulk of the task is to convert the existing Barfoo files to Foobar's format. (...)" Please describe why this work is important and an overview of how you approach solving this problem.

Mailing list thread: Give a link to the openrefine-dev thread where your proposal was discussed.

Key milestones: Please break down the work you want to do into small chunks with defined checkpoints (ie: I will do X, and we will know when it is done because an X will appear on the screen)

Timeline: Is this a 4 week sprint? A 3 month project? Break down how long the work will take.

Budget: Please estimate the cost of this work and include the cost of any additional services you will need to use.

· One min read

We are happy to announce that three interns will join the development team this summer:

  • Lisa Chandra (@lisa761), Google Summer of Code intern, will improve our grid view by adding support for infinite scrolling (displaying the entire grid without the need to browse it by small pages). She is supervised by Owen Stephens and Antonin Delpeuch.

  • Lu Liu (@afkbrb), Google Summer of Code intern, will add support for OAuth and other Wikibase instances in the Wikidata extension. He will be supervised by Tom Morris and Antonin Delpeuch.

  • Ekta Mishra (@darecoder1999), our Outreachy intern, will work on improving the quality assurance features of the Wikidata integration, by adding support for new Wikidata constraints and improving the existing functionality. She will be supervised by Antonin Delpeuch.

We are very excited to welcome them to the team and are looking forward to working with them. The Google Summer of Code internships will run from June 1 to August 24 and the Outreachy internship from May 19 to August 18.

We thank all applicants for the quality of their proposals and their contributions to the project earlier this year.

· 4 min read

The OpenRefine team is seeking a technical writer to help write a reference manual for the tool. This is a 6 months contract, funded by a grant from the Silicon Valley Community Foundation via the Chan Zuckerberg Initiative under their Essential Open Source Software for Science programme.

OpenRefine is a fiscally sponsored project of Code for Science & Society Inc, a 501(c)(3) charitable organization in the US.

Experience

We are looking for one or more enthusiastic contributors to join our fully remote team, to help us write a reference manual for OpenRefine. The following skills are key:

  • Ability to write documentation fluently in English. Experience with writing software documentation is ideal but not necessary;
  • Experience with OpenRefine as a user and ideally as a trainer (writing tutorials, running workshops or developing any other training material);
  • Familiarity with GitHub, as documentation changes will be reviewed there, and much of our project planning is also happening there. Our documentation is written in Markdown, for which there exists many visual editors (for instance, plugins to export from Google Docs to Markdown, or the HackMD online editor).

Overview of the work

As part of our milestones for the EOSS grant, we are in the process of migrating our existing documentation from our GitHub Wiki to a dedicated documentation platform. This effort is described in our root planning document and coordinated by Owen Stephens and Antonin Delpeuch. We have identified multiple documentation areas:

  • Product reference, documenting for users all the features offered by the tool in a systematic fashion;
  • Technical reference, aimed at developers, explaining the architecture of the tool;
  • Project documentation, aimed at anyone who wants to contribute code, translations, documentation, user support around OpenRefine;
  • Tutorials and how-to guides, showing by example how the tool can be used for specific data cleaning problems;
  • Discussions, to support users with their specific issues.

You will be working on the product reference only. The goal for this documentation is to provide a thorough and systematic description of the behaviour of all user-facing features of the tool, such as:

  • Operations
  • Importers
  • Exporters
  • GREL functions
  • Facets
  • Bundled extensions such as Jython and the Wikidata integration

This reference documentation is meant to support users in their exploration of the tool. Users will typically discover the tool through existing tutorials or workshops: these introductory training materials can refer to the product documentation to help users dive deeper in the concepts demonstrated by the course.

The product reference will be written in Markdown using Docusaurus. Documentation for a sample operation is provided as an example of what we are aiming for:

Your task will be to write similar documentation for other functions of the tool.

Compensation and timeline

We have 25,000 USD available to complete these milestones. The payment details will be negotiated with the contractor(s).

The work could be done by a single individual or by a team, depending on the availability of applicants.

We are aiming to hire one or more technical writers for this task by the end of May 2020.

How to respond

Update (2020-05-18): the application period for this position has lapsed, thank you to everyone who got in touch. We will announce the results in the coming weeks.

Please send your resume or CV, sample of your relevant previous work and a short letter of interest to advisory.committee@openrefine.org. We will schedule an interview with short-listed candidates.

Code for Science & Society is an equal opportunity employer committed to hiring a diverse workforce at all levels of the organization, creating a culture that allows us to better serve our projects, our employees, and our communities. We value and encourage the contributions of our employees and strive to create an environment where everyone can reach their full potential and drive outstanding results. All qualified applicants will receive consideration for employment without regard to race, national origin, age, sex, religion, disability, sexual orientation, marital status, veteran status, gender identity or expression, or any other basis protected by local, state, or federal law. This policy applies with regard to all aspects of one’s employment, including hiring, transfer, promotion, compensation, eligibility for benefits, and termination.

· One min read

OpenRefine will offer internships via the Outreachy and Google Summer of Code programs this year.

For Outreachy, pre-applications are still open and our proposed projects will be revealed on March 5th.

For the Google Summer of Code, check out our organization page and our project ideas.

We are short of mentors - if you want to help with mentoring students for the Google Summer of Code, do add yourself to the project ideas (on existing projects or on additional ones).

· 7 min read

The results of our fourth OpenRefine user survey are out! We received 178 responses in less than four weeks (vs. 122 in 2018 over seven months)! The goal of the study is to keep an accurate and up to date picture of the OpenRefine community. When possible, we compared the 2020 results with our previous survey. You can view the full details of the 2012, 2014, and 2018 surveys

Community you identify with

While librarians remain the largest group (37.64%), Wikidata contributor (and semantic web users - 12.92%) made an entry as the second-largest user group in 2020! Researchers now rank third with 10.7% of the user base. Note the community Archivist was not suggested in the answer but still gather 4.49% of the answer!

How often do you use OpenRefine

Usage frequency remains globally the same since 2012, with about a third using OpenRefine weekly, a third using it monthly, and a third using it less than once per month. Note that we did not offer the option Less than once a month in the 2012 survey and remove the First time user option after 2012.

For how long have you been using OpenRefine

Users with over two years of experience keep growing years over the years and now 57% of the user base (vs. 51% in 2018). We take it as a positive fact where we can keep our user base while being able to attract new users.

Note that in the below analysis, we included users that never used OpenRefine in the less than six months group (five respondents in 2014, nine in 2018, three in 2020).

How will you rate your skills using OpenRefine

In all our previous surveys, we asked respondents to rate their skills from one to five. One being a novice in Refine and five being a master. The number of novice respondents dropped in 2020 from 16% to 7%. The skills level is related to the frequency of usage and not to the years of experience. No matter how long you have been using GREL, you got to practice it every day to improve!

Version of OpenRefine

While the majority are using the latest stable version at the time of the survey (OpenRefine 3.2)

High-level tasks you do with OpenRefine

Usage breakdown remains stable compared to the 2012 and 2018 survey. With two notables differences:

  • Preparing data to load into another system ranks now second with over 73% of the respondents (vs. 66% in 2018)
  • Now more than half of the OpenRefine users use the reconciliation feature (44% in 2018 to 54% in 2020).
  • We see a net decrease in OpenRefine for data discovery (understand data you don't own) or preparing for data visualization.

Note: Respondents can select multiple answers. Click on the image to enlarge it.

Do you use plugin or extension?

72% of the respondents do not use plugins or extension; 11% installed only one plugin, and 17% of them installed more than one. Overall we see a drop in usage for many plugins.

The top three plugins are:

  • RDF extension - by DERI - 19.7%
  • Named-Entity Recognition - by Ruben Verborgh (Free Your Metadata) - 13.9%
  • History tools, cross cell tools, pivot tool and scatterplot tool using D3 - by VIB-BITS - 10.9%

The DBpedia extension has not been updated for seven years, but its usage grew from 5.7% two years ago to 10.9% in 2020. The usage of DBpedia is not linked to any particular community (6 librarians, 3 for-profits, 2 archivists, and one for cultural heritage, data scientist, researcher and wikidata contributor). Any explanation is welcome!

Note: Respondents can select multiple answers. We consolidated the 41 blank answers with I don't use plugin or extension. Click on the image to enlarge it.

Do you use a reconciliation service?

In 2020, close to 65% of OpenRefine users connect to a reconciliation service! This is massive progress from 48% two years ago! 26.52% of the respondents use more than one service versus 18.03% in 2018.

While Wikidata (45%) and VIAF (22%) dominate the list, we see new services not listed in 2018:

  • GND (six respondents)
  • Getty Vocabularies (three respondents)
  • in-house reconciliation service (three respondents)
  • Open Library (one respondent)
  • ORCID (one respondent)
  • Organized Crime and Corruption Reporting Project (one respondent)
  • Planning to use with Linked Data Finland's datasets (one respondent)
  • Sharedshelf Built Work Registry Reconciliation Service (one respondent)
  • SNAC (one respondent)
  • Nomisma (one respondent)

Note: Respondents can select multiple answers. Click on the image to enlarge it.

Perception of Refine:

The following word clouds are more eye-pleasing than an in-depth analysis.

Why did you choose OpenRefine (Which features)

Wikidata makes a nice entry on the list!

Alternative tools

Excel, R, python and Google Sheets remain the mains alternative to OpenRefine.

Word used to describe OpenRefine

The most common expressions to describe OpenRefine are "Excel on steroid" and "a data cleaning tool."

From the Feature Request and Anything to add

We compiled below suggestions and feature requests submitted via the survey.

Features Request

  • Improve performance and support larger dataset (7 requests) - this is on the 2020 roadmap and under development on the 4.x branch!
  • Better UX (6 requests)
  • Online support with user login and permission (4 requests)
  • Make regex expression easier to build (4 requests)
  • An easier installation process, potentially with a standalone java (4 requests)
  • List in OpenRefine all plugins available (4 requests)
  • List in OpenRefine all the reconciliation service available (3 requests)
  • Better support for hierarchical data (2 requests)
  • Allow to memory from the front end (2 requests)
  • Would let me choose where I can store my files!
  • Have real Title Case transforms that left articles like A, An, And, The lowercased as well as prepositions instead of capitalizing each word.
  • Add row to a project
  • Support POST request
  • Ability to save facet
  • Allow extensions to be written in python.
  • Support editing and uploading to Structured Data on Commons: see ticket #2144, and external Wikibases (2 requests)
  • Better error handling when uploading to Wikidata (2 requests)
  • Had improved Wikidata integration, particularly more control over modifying existing statements including qualifiers and references
  • Support wikicode
  • Build automated Wikidata ingest pipelines

Improve History

We received three separate requests to improve the history in the way described in our roadmap.

Respondents asked for an easier way to edit/tweak operation history (i.e. adjust an edit made multiple steps ago without losing subsequent steps). It would be nice if cell provenance could be accessed programmatically or via the interface. So, for instance, in terms of rewinding to a certain state, it would be nice to know at what step a cell/row/column last changed, or what the previous state of the cell/row/column was, or what the state was at a particular step.

Plugin to support in core

Training

  • Make OpenRefine better known (5 requests)
  • Make GREL more accessible. Possibly with more contextual help or a GREL-formula builder (3 requests)
  • Provide more documentation on how to use reconciliation services and the different plugins (3 requests)
  • Improve the onboarding.
  • Provide more workshop and training on OpenRefine

· One min read

As announced in the previous post, the OpenRefine project has joined Code for Science and Society as a Sponsored Project. We are thrilled to join this new home, which will help the project grow on the long term and set its openness in stone.

This is the occasion for us to improve our gouvernance model, with the creation of two committees. The advisory committee runs the project on a day to day basis and is composed of Martin Magdinier, Thad Guidry and Antonin Delpeuch. The steering committee oversees the general direction of the project and initates links with other organizations and projects. We are proud of our all-star line up for this steering committee, in alphabetical order:

Thank you all for joining us in this exciting adventure!

· One min read

We are pleased to announce that the Chan Zuckerberg Initiative (CZI) has awarded OpenRefine a $200,000 grant to fund its development in 2020. This award is part of CZI’s Essential Open Source Software for Science program which awarded $5 million to over 40 projects.

This grant will be used towards two main objectives:

  • grow the community of OpenRefine contributors by reaching out to seasoned users and helping them get involved more closely in the project.
  • revamp the core architecture of the tool to handle larger datasets and improve workflows.

This is the occasion for the project to join Code for Science and Society's Sponsored Project Program. We will follow up with another post about that when this is finalized.

Owen Stephens and Antonin Delpeuch will be hired to work towards these goals, with the help of additional contractors for specific subtasks. For more details about our plans, see our grant proposal.