Journey to the open data jungle with OpenRefine, CartoDB, Leaflet and Javascript

open data4

Playing around with data is not an easy task.

There’s no secret formula for successfully finding your way through a bunch of pdf, csv, excel, APIs or whatever else you might have to deal with. There’s not even a single tool that could help you through the whole process, on the contrary, you’ll have to spend some time learning and trying out different techniques.

The good news, is that the tools available are pretty awesome, and they are getting even more awesome day by day.

In this series of articles we will take a look at a couple of really mind-blowing tools.

First we will take our google contacts and export them as a csv file.

Then we will process them with OpenRefine to clean all the messy data, and to access an google map web service to geolocate each contact.

Next we will import it into a CartoDB column, to expose it as a web service API.

Finally, with some Javascript magic we will use leaflet to browse our contacts on a map.

So let’s get started.

Finding the dataset

Usually, finding the right dataset, and understanding how the information is organized and what can we do with it is a whole challenge by itself. Just to make things easy, and to let you play with your own info, we will use our list of gmail contacts.

Just go to your gmail account, choose contacts and then click on more, export.

Choose Outlook csv format (don’t ask me why, but choosing Google csv just didn’t work. You better get used to it, working with data involves a lot of trial and error)

export gogle contacts

Now we will have to process our dataset to find messy data and to try to fix it. And ideal tool for this kind of stuff is openRefine (not so long ago known as google refine).

Just download it uncompress it and start the server:

tar -xvf google-refine-2.5-r2407.tar.gz
cd google-refine-2.5
./refine

Starting Google Refine at 'http://127.0.0.1:3333/'

00:02:22.831 [refine_server] Starting Server bound to '127.0.0.1:3333' (0ms)
[...]

OpenRefine is a java application that comes bundled with it’s own jetty based http server. As you can see from the command output, the server will be listening on port 3333 of the localhost. So just open a web browser and go to http://localhost:3333/

open refine create project

Just click on choose files and the select the file where you exported your contacts from google. OpenRefine will immediately show you his best attempt to parse the file. You have several options to set here.

But if you pay attention to the data, you’ll see you have an unfortunately far too common encoding problem.

openrefine encoding troubles

Just click on character ‘Character encoding’ and try with different settings until you find the most appropriate. In this case it was ISO-8859-1.

I usually also uncheck Store blank rows and Store blank cells as nulls.

Then just choose a name for your projects and click on Create project.

Getting rid of columns

We have a lot of columns that we don’t really need, so we will just get rid of them. The easiest way is to just click on the column arrow, Edit Column and Remove this column. There’s no confirmation dialog, but don’t worry, you’ll see a nice undo message on the top of the screen.

There’s a way to remove several columns at once. Click on the first column arrow, Edit Columns, Re-order / remove columns.

reorder-remove columns

Unfortunately there’s no add all button, so we’ll have tell which columns to remove one by one. But we’ll see there’s a better way to do it. Just select a couple of columns click ok, and click the Undo / Redo link.

This is one of the greatest features of open refine. It has an unlimited undo-redo. That means that every operation you issue on the dataset can be reversed. This allows you to try different things, see how it goes, and if you don’t like it you just undo it. It allows you to explore transformations in a very interactive way, feeling confident that you can always take a step back.

undo-redo

But that’s not the whole story. Every operation you issue is saved in a human readable json notation. To see the operations just click on the ‘Extract’ button. For example, the two columns we just removed from our dataset, generated the following operation:

[
  {
    "op": "core/column-reorder",
    "description": "Reorder columns",
    "columnNames": [
      "First Name",
      "Middle Name",
      "Last Name",
      "Web Page",
      "Gender",
      "Birthday",
      [...]
      ...a whole bunch of more fields, you get the idea, right?
      [...]
      "Column 89"
    ]
  }
]

So, to just keep the columns we want, we copy that json, and modify it like this:

[
  {
    "op": "core/column-reorder",
    "description": "Reorder columns",
    "columnNames": [
      "First Name",
      "Middle Name",
      "Last Name",
      "Notes",
      "E-mail Address",
      "Mobile Phone",
      "Home Address"
    ]
  }
]

To execute it, click on apply, paste it and then click on the Perform Operations button.

perform operation

Creating a column from other columns

We would like to concatenate the First, Middle and Last name in a single column. Just click on any column and select Edit column, Add column based on this column

create calculated column

We have to specify the name of the new column, what to do in case there’s an error processing the cell, and the language to use. Google Refine expression Language (GREL for short) is the default language, and it’s very similar to JavaScript.

this it the formula we entered:

cells['First Name'].value + ' ' +
cells['Middle Name'].value + ' ' +
cells['Last Name'].value + ' '

As in JavaScript, we could have used the dot notation (cells.columns), but because our column name has a space in it we had to use the bracket notation.

You also get a nice preview pane, so that you can interactively test your formula while you are entering it.

It is worth noting that this is not a calculated column like the one you would enter in a spreadsheet. A new column is created and the result of evaluating the formula is copied. If any of the source column is updated, the new column will not be re-evaluated.

One more thing, as you might have already noticed, for those who didn’t have a Middle Name we created two adjacent spaces. There are a couple of common transformations that might become pretty handy. Just click on the name column, Edit cells, Common transforms, and then Trim Leading and trailing spaces, and then also Collapse consecutive spaces.

trim

Now that we have our new column, we can just get rid of the other columns. Just click on each column and select Edit column, Remove this column.

And then we will rename each column to have a shorter name.

This is how our dataset now looks like:

our dataset

Advanced data cleaning, getting rid of empty cells

Now we will want to get rid of contacts that have no address. This is the easiest way I found to do it. You can click on the address column, facet, text facet, and then click include on the blank facet. You will only see the contacts with an empty address. Another way is to click on facet, customized facet, facet by blank and then include the true facet.

Once you have all you contacts with no address in sight, you can click on All, Edit rows, remove all matching rows. Then remove the address facet.

Normalizing field values

Sometimes times you have certain fields with some kind of categorization, but because of typos or of people just entering different terms for the same category you have inconsistent data.

In our case, we used the notes field to save the nationality of our contact. But we may see that sometimes we entered us, usa, eeuu, etc.

OpenRefine has a couple of options to help us deal with those cases.

Let’s first change the column name from notes to nation. Then click on facet, text facet. You’ll see something like the following:

facets

You can then just edit the facet, and it will update all the underlying rows. For example, click on the edit link to the right of english, and enter england. You will see that the facet will be updated.

But there’s a more powerful tool. OpenRefine will apply different algorithms to try to guess duplicates.

For example, OpenRefine here recognizes that german and germany are the same categories, and also italian and Italian. Just try with different methods and see which works best for you.

cluster

This is how it looks after fixing the facets

facet fixed

Eliminating duplicate rows

We also have a couple of duplicate contacts. First we will have to normalize them use the method we mentioned above. This are the clusters that OpenRefine suggests us:

cluster name

Pretty good job, indeed.

Then we will sort the dataset by name. Just click on name, sort, sort and click ok. Then we will blank each duplicated cell clicking on name, Edit cells, blank down. You will see that all the duplicated cells will have it’s name column blanked.

Now just filter by blank cells (remember, facet by blank, include true facet) and remove all matching rows.

You can also have a quick view at duplicate rows by selecting Facet, Customized facets, Duplicates facet.

Next steps

We pretty much covered many of the most common things we can do with OpenRefine. In the next article, we will see how to access an external web service to enrich our information. In this case we will use google maps api to georeferece our contacts.

open data3

Tip: I advice you to have a good look at this wiki page. It has lots of useful recipes for handling several common transformation.

About these ads
Tagged with: , , , ,
Posted in open data, tutorial
2 comments on “Journey to the open data jungle with OpenRefine, CartoDB, Leaflet and Javascript
  1. […] the previous article we saw how to take our comma-separated list of google contacts, and play with them using google […]

  2. […] the first article we saw how to take our comma-separated list of google contacts, and mangle them using […]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: