“Every two days now we create as much information as we did from the dawn of civilization up until 2003” said Eric Schmidt from Google in 2010. Behind most of this information there is Big Data. And although we Internet users contribute mostly to that vast data, we get little or nothing back. But what is big data? Big data is so far a closed shop with a wealth of information that urgently needs to change.
In older days most data, such as statistics, could be easily analyzed through a simple spreadsheet (e.g. Excel) or a database. Such data had hundreds or thousands of rows of information. It often needed to be collected in a survey form by a person walking door to door. But nowadays, data is collected 24/7 as we all browse and interact through the Internet. Let’s take for example log files: When you browse a website, all your clicks are saved in a log file with your ip address, your browser and a lot more. Now, imagine a website with millions of visitors, who are all protocolled. With mobile phones we now easily leave digital footprints wherever we are; just look at Twitter, which has about 6 Terabytes of tweets every day. These millions of rows are Big Data.
But here lies a problem. Big data is a little bit like open data. Many talk about it and only a tiny fraction is offered publicly even worse, half of government data is not usable. Governments sit on huge chunks of data, but like the corporate sector, share almost no data. If such data were open it would be possible to develop greater information services or held governments accountable easier. But shall all big data be opened?
Some argue that collection of data is a privacy nightmare. For example, in the case of the smart meter, your company knows by your electric consumption when you are on holidays and when you normally go to bed at night. Initially, this was meant to achieve a more responsible energy consumption. With a smart meter, electric consumption data is sent every few minutes, whereas in the old days it was read only once a year. No doubt big data can contain a lot of sensitive information and for a good reason it is not offered online.
A lot of data from consumers and citizens is collected, but one cannot have access to it. Imagine, if consumer rights organizations had access to anonymous transactions done through credit cards, you would get insight of the consumption habits of millions of people and would get to see on what the money is really spent. Many companies have such data; either they buy it or collect it themselves. So is the case of mobile providers, who have data about the movements of their clients. But hardly any company is sharing such data although science, media and civil society could benefit from it.
A study shows that tweets from Twitter can be used to analyze where an outbreak of Cholera happens first long before the media or government realizes it. Michael J. Paul and Mark Dredze used Twitter to analyze the situation of public health in the USA. They could locate the degree of various illnesses and deceases across the country. Or the Indiana University Center for Complex Networks and Systems Research has some interesting network analyzes on how worldwide protests are reflected on Twitter. It says a lot that most such data analysis is done with data from one company.
Another interesting project is UN Global Pulse, which is “Exploring innovative methods and frameworks for combining new types of real-time data with traditional development indicators to detect early impacts of global shocks”. Imagine you had access to mobile banking account data by location in Africa and could monitor sudden widespread changing spending behaviors. Are these early signs of an economical crisis or a drought? The hope is that with such data you can react and intervene much faster to crises. Global Pulse has partnered with organizations and companies to gain helpful insights from big data, but unfortunately next to nice reports, they do not offer a single data set on their website.
Robert Kirkpatrick talked about the importance of data data philanthropy back in September 2011, but as all the other players in this field, offers no data. I sent a tweet to the UN Global Pulse team and here is their reply: “This year, Pulse Labs projects will make every effort to share datasets for cross-verification and open access when possible” and “we’re talking to a mobile company about hosting a data mining competition with anonymized & aggregated call detail records.” Both points sound nice to me, but what I miss is a real attempt to share data, concepts and approaches to analyzing it with a community around the world. A good example is the Guardian data blog, which publishes each data set; readers started to play with the data and came up with more results.
The problem is that critical data for transparency is rarely available. Whereas UN global Pulse can partner with companies and institutions to get data, other initiatives have to work hard to get any data. For example, the International Land Coalition has released, with the support of the Tactical Technology Collective, a great overview of land acquisitions world wide. These one thousand rows of data about land acquisition were difficult to collect, but are such an important step for transparency in the massive field of land acclamation world wide. That’s why innovative solutions, such as the Forestwatcher project to crowdsource deforestation monitoring worldwide, are needed.
If one looks at publicly available big data sets the view is terribly disappointing. Ridiculously little is available and more than often data sets are very old. Amazon hosts 54 large public data sets. Compare that to the estimate made by Eric Schmidt at the top of this post. Or check the list on Quora, which is quite comprehensive, but tiny compared to data hidden behind corporate and government walls. The open data exceptions are projects such as Openstreetmaps, Wikipedia, Dpedia, Datahub, Opencorporates or the World Bank. Please hint me to more resources, which I have not mentioned.
The release of data can only be a first step, open collaboration around data and solutions has to further develop helpful services. There is a great promise in big data, but it has to be treated as carefully as any other data set or statistic and has also many pitfalls as Danah Boyd and Kate Crawford summarized nicely in their six provocations for big data.
"Big Data offers the humanistic disciplines a new way to claim the status of quantitative science and objective method. It makes many more social spaces quantifiable. In reality, working with Big Data is still subjective, and what it quantifies does not necessarily have a closer claim on objective truth – particularly when considering messages from social media sites.“