Rob Kitchin: The Data Revolution Big Data, Open Data, Data Infrastructures and Their Consequences 
We can get a sense of why this digital data explosion is occurring by considering some specific examples. TechAmerica estimates that each day 114 billion e-mails and 24 billion text messages are sent, and 12 billion phone calls made globally (Strohm and Homan 2013). According to CISCO, in 2013 there were estimated to be 10 billion objects (devices and sensors) making up the Internet of things, each of which is producing data in varying quantities, with this figure set to rise to 50 billion by 2020 (Farber 2013). With respect to online activity, in 2012 Google was processing 3 billion search queries daily, each one of which it stored (Mayer-Schonberger and Cukier 2013) and about 24 petabytes of data every day (Davenport et al. 2012). In 2011, Facebook’s active users spent more than 9.3 billion hours a month on the site (Manyika et al. 2011), and by 2012 Facebook reported that it was processing 2.5 billion pieces of content (links, stores, photos, news, etc.), 2.7 billion ‘Like’ actions and 300 million photo uploads per day (Constine 2012). In 2012, over 400 millions tweets a day were produced, growing at a rate of 200 per cent a year, each tweet having 33 discrete items of metadata (Mayer-Schonberger and Cukier 2013). Much of these data are unstructured in nature. A similar explosion in structured data has taken place. For example, with respect to retail data concerning stock and sales, collected through logistics chains and checkouts, Walmart was generating more than 2.5 petabytes of data relating to more than 1 million customer transactions every hour in 2012 (‘equivalent to 167 times the information contained in all the books in the Library of Congress’; Open Data Center Alliance 2012: 6), and the UK supermarket Tesco was generating more than 1.5 billion new items of data every month in 2011 (Manyika et al. 2011).
Likewise, governments and public bodies are generating vast quantities of data about their own citizens and other nations. For example, transit bodies have started to monitor the constant flow of people through transport systems, for example, collating the time and location of the use of pre-paid travel cards such as the Oyster Card in London. Many forms of tax payment, or applications for government services, are now conducted online. In 2009, the US Government produced 848 petabytes of data (TechAmerica Foundation 2012). The 16 intelligence agencies that make up US security, along with the branches of the US military, screen, store and analyse massive amounts of data hourly, with thousands of analysts employed to sift and interpret the results. To get a sense of the scale of some military intelligence projects, the ARGUS-IS project, unveiled by DARPA and the US Army in 2013, is ‘a 1.8-gigapixel video surveillance platform that can resolve details as small as six inches from an altitude of 20,000 feet (6km)’ (Anthony 2013). It collects ‘1.8 billion pixels, at 12 fps [frames per second], generat[ing] on the order of 600 gigabits per second. This equates to around 6 petabytes ... of video data per day.’ Using a supercomputer, analysis is undertaken in near real-time and the system can simultaneously track up to 65 moving objects within its field of vision. This is just one project in an arsenal of similar and related intelligence projects. Similarly, with respect to scientific projects, a personal human genome sequence consists of about 100 gigabytes of data (Vanacek 2012): multiply that across thousands of individuals and the database soon scales into terabytes and petabytes of data. When the Sloan Digital Sky Survey began operation in 2000, its telescope in New Mexico generated more observational data in the first couple of months than had previously been collected in the history of astronomy up to that point (Cukier 2010). In 2010, its archive was 140 TB of data, an amount soon to be collected every five days by the Large Synoptic Survey Telescope due to become operational in Chile in 2016 (Cukier 2010). Even more voluminous, the Large Hadron Collider at CERN, Europe’s particle-physics laboratory, generates 40 terabytes every second (The Economist 2010). In this, and other cases, the data generated are so vast that they neither get analysed nor stored, consisting instead of transient data. Indeed, the capacity to store all these data does not exist because, although storage is expanding rapidly, it is not keeping pace with data generation (Gantz et al. 2007; Manyika et al. 2011).
In open systems like large scientific projects, such as measuring climatic data for weather reporting and meteorological modelling, or collecting astronomical data using a powerful telescope, the drive is towards much larger sets of data, with increased sample sizes across as many variables as possible. For example, in astronomy this means not just collecting light data, but data from across the electromagnetic spectrum, in as high a resolution as possible, for as much of the sky as possible. In the case of closed systems, such as Facebook or buying goods from an online store such as Amazon or sending e-mails, it is possible to record all the interactions and transactions that occur, as well as the level of inaction. And in these cases, that is indeed the case. Every posting, ‘like’, uploaded photo, link to another website, direct message, game played, periods of absence, etc., is recorded by Facebook for all of its billion or so users. Similarly, Amazon records not only every purchase and purchaser details, but also all the links clicked on and all the goods viewed on its site, as well as items placed in the shopping basket but not purchased. All e-mails are recorded by the servers on which a client e-mail is hosted, storing the whole e-mail and all associated metadata (e.g., who the email was sent to or received from, the time/date, subject, attachments). Even if the e-mail is downloaded locally and deleted it is still retained on the server, with most institutions and companies keeping such data for a number of years.
Like other forms of data, spatial data has grown enormously in recent years, from real-time remote sensing and radar imagery, to large crowdsourced projects such as OpenStreetMap, to digital spatial trails created by GPS receivers being embedded in devices. The first two seek to be spatially exhaustive, capturing the terrain of the entire planet, mapping the infrastructure of whole countries and providing a creative commons licensed mapping dataset. The third provides the ability to track and trace movement across space over time; to construct individual time–space trails that can be aggregated to provide time–space models of behaviour across whole cities and regions. Together they enable detailed modelling of places and mobility, comparison across space, marketing to be targeted at particular communities, new location-based services, and data that share spatial referents to be mashed-up to create new datasets and applications that can be searched spatially (e.g., combining data about an area to create neighbourhood profiles).
A fundamental difference between small and big data is the dynamic nature of data generation. Small data usually consist of studies that are freeze-framed at a particular time and space. Even in longitudinal studies, the data are captured at discrete times (e.g., every few months or years). For example, censuses are generally conducted every five or ten years. In contrast, big data are generated on a much more continuous basis, in many cases in real-time or near to real-time. Rather than a sporadic trickle of data, laboriously harvested or processed, data are flowing at speed. Therefore, there is a move from dealing with batch processing to streaming data (Zikopoulos et al. 2012). On the one hand, this contributes to the issue of data volume by producing data more quickly, on the other it makes the entire data cycle much more dynamic, raising issues of how to manage a data system that is always in flux.Both small and big data can be varied in their nature, being structured, unstructured or semistructured, consisting of numbers, text, images, video, audio and other kinds of data. In big data these different kinds of data are more likely to be combined and linked, conjoining structured and unstructured data. For example, Facebook posts consist of text that is often linked to photos, or video files, or other websites, and they attract comments by other Facebook users; or a company could combine its financial data concerning sales with customer surveys that express product sentiment. Small data, in contrast, are more discrete and linked, if at all, through key identifiers and common fields. A key advance with regards to big data is how they differ from earlier forms of digital data management, which was extremely proficient at processing and storing numeric data using relational databases, and which enabled various kinds of statistical analysis. It was, however, much weaker at handling non-numeric data formats, other than to store them as flat or compressed files. As the Open Data Center Alliance (2012: 7) notes, ‘[p]reviously, unstructured data was either ignored or, at best, used inefficiently’. However, advances in distributed computing and database design using NoSQL structures (see Chapter 5), and in data mining and knowledge discovery techniques (see Chapter 6), have hugely increased the capacity to manage, process and extract information from unstructured data. Indeed, it is widely suggested that approximately 80 per cent of all big data is unstructured in nature, though as Grimes (2011) details, this figure has become a truism with little evidential support.
Velocity occurs because repeated observations are continuously made over time and/or space (Jacobs 2009) with many systems operating in perpetual, always-on mode (Dodge and Kitchin 2005). For example, websites continuously record logs that track all visits and the activity undertaken on the site; medical equipment constantly monitors vital signs, recording how a body is responding to treatment and triggering an alarm if a threshold is crossed; mobile phone companies track the location, use and identity of devices accessing their networks every few seconds; weather sensor networks monitor atmospheric indicators every few minutes and transmit their findings to a central database for incorporation into weather forecasts; transponders along a city’s road and rail routes record the identity of buses and trains as they pass, enabling the public transit authority to know where all of its vehicles are at any time and to calculate the estimated arrival time at different stops; a retailer monitors the sales of thousands of different products by thousands of customers, using the data to know when to restock shelves and order from suppliers; people communicate with each other through social media sites in a never-ending flow of exchanges and interconnections; a telescope continually monitors the heavens measuring fluctuations in radio waves in order to understand the nature of the universe. In all these cases, there is a persistent stream of data requiring continual management and analysis.
Transferring and managing large volumes of dynamically produced data is a technical challenge as capacity issues can quickly create bottlenecks. For example, just as YouTube videos might freeze because the bandwidth is not sufficient to keep up with the data streaming speed required, the same effect can operate with respect to capturing and processing data, with systems unable to keep up with the flow. Solutions to the problem include increasing bandwidth capacity, data sorting and compression techniques that reduce the volume of data to be processed, and efficiency improvements in processing algorithms and data-management techniques. Analysing such streaming data is also a challenge because at no point does the system rest, and in cases such as the financial markets microsecond analysis of trades can be extremely valuable. Here, sophisticated algorithms, alongside visualisations that display dynamic data in flux, are employed to track and evaluate the system.
Both small and big data can be varied in their nature, being structured, unstructured or semistructured, consisting of numbers, text, images, video, audio and other kinds of data. In big data these different kinds of data are more likely to be combined and linked, conjoining structured and unstructured data. For example, Facebook posts consist of text that is often linked to photos, or video files, or other websites, and they attract comments by other Facebook users; or a company could combine its financial data concerning sales with customer surveys that express product sentiment. Small data, in contrast, are more discrete and linked, if at all, through key identifiers and common fields. A key advance with regards to big data is how they differ from earlier forms of digital data management, which was extremely proficient at processing and storing numeric data using relational databases, and which enabled various kinds of statistical analysis. It was, however, much weaker at handling non-numeric data formats, other than to store them as flat or compressed files. As the Open Data Center Alliance (2012: 7) notes, ‘[p]reviously, unstructured data was either ignored or, at best, used inefficiently’. However, advances in distributed computing and database design using NoSQL structures (see Chapter 5), and in data mining and knowledge discovery techniques (see Chapter 6), have hugely increased the capacity to manage, process and extract information from unstructured data. Indeed, it is widely suggested that approximately 80 per cent of all big data is unstructured in nature, though as Grimes (2011) details, this figure has become a truism with little evidential support.