Nowadays, if you want to start an open data project, you should rather check the availability of information first, and then imagine something useful because so little useful raw data is available. That’s why for the foreseeable future, at least in countries such as Germany, collecting documents is the way forward.
Frankfurt-Gestalten.de (OpenStreetMap Creative Commons CC-by-SA 2.0 Lizenz. Rendering © 2010 Cloudmade)
Although there are millions of documents available on the Internet, the most interesting ones are hidden in databases, protected as PDF files or only partially offered on websites. It often takes hours to get figures out of a PDF file to be used for analysis. For example, the budget of the city of Frankfurt is offered in a PDF page with more than one thousand pages.
How can you possibly draw different conclusions or see problems and even public misexpenditure? Even a local press paper does not have the resources (any more) to disclose the puzzle of large figure columns. So I am glad to see my friends at Tactic Tools started a project called Open Budget (Offener Haushalt) to shed some more light in the public budget of Germany’s government. Other great example comes from David McCandless, who presented them in a great TED presentation called “the beauty of data visualization”. However the remaining problem is that it takes a lot of time to extract data, not talking about how to present it.
- To use an API (Application Programming Interface such as the one from the World Bank)
- To download open available data
- To copy and paste data from documents
- To scrape content from websites through software
- To collect automatically data from different sources
- Or by crowdsourcing your data
Number one and two are the perfect case. Number three can lead to incredible work. Imagine you copy and paste a PDF document of one thousand pages, which is probably printed as an Excel sheet version. Number four is even possible with PDF files thanks to OCR, but scrapping can lead to a load of information.
Number 6 is a very different collective approach. For example, the widely cited Ushahidi is an instrument to offer new channels for data collection if one gets a critical base of contributors. The Guardian uses it these days to track the effects of budget cuts in the city of Leeds.
If you do not have the network or public relation budget to run such a crowd sourcing initiative, then you should think about collecting it from existing sources. For example, a lot of data is offered in RSS or XML format. An advantage is the way data is already referenced, such as date, key words and even locations if you are lucky. A nice tool in this regard is Drupal driven Managing News, which I use also for Create Frankfurt to geo-reference all incoming information.
Such aggregating tools allow automatic collection of data. So you can identify such information, subscribe to it and look at it from time to time. Two examples: Public transport congestion alerts or political municipality documents. A year of such data can give you some insights and might lead to interesting conclusions.
Related posts:
- Who to feed? The open vs. the commercial race for data Google Maps has been an incredible service in the past...
- Results of the Open Aid Data Hackday Around 150 participants joined the Open Aid Data Conference in...
- Open Aid Data conference and Hackday Berlin The past year I have written on many occasions about...
This blog explores worldwide social innovations and information and communication technologies. 
{ 8 comments… read them below or add one }
RT @ckreutz: blogged about the challenges and options to get non open data http://cxed.net/arhSz8 < collecting them early!
This is a great Open Data overview post RT @ckreutz: blogged about the challenges and options to get non open data http://cxed.net/arhSz8
RT @ckreutz: blogged about the challenges and options to get non open data http://cxed.net/arhSz8 (via RT @carriebish)
RT @ckreutz The challenges & options to get non #opendata http://cxed.net/arhSz8 < collecting them early!
Getting the data open is a big part of the challenge — and I like the way you broke down the potential solutions. But using the data — to find conclusions or influence outcomes — is even more complex, I think, than just availability. There are all kinds of issues — from technical capacity to analytic skills to advocacy skills — that play into really making that happen. I think, as we address the actual availability of data, we also have to address the skills that make that data usable.
Online challenges of collecting "open data" #Haiti @mediahacker http://ht.ly/2Z9Ax
Yes, Marnie I agree with you. It is a great challenge to make something useful with all the data. Particular to prepare it, so citizen can understand it quickly and can interpretate it from different angles. And then it is still not guaranteed that someone (e.g. journalist) picks it up and makes a compelling story of it. This post was only meant to give an overview on how to get data. I will write on other points in later blog posts.
The challenges and options to get non open data http://bit.ly/aBwwMT