Diverse Daten (BD2015)
- 1 Rob Kitchin: The Data Revolution Big Data, Open Data, Data Infrastructures and Their Consequences 
Rob Kitchin: The Data Revolution Big Data, Open Data, Data Infrastructures and Their Consequences 
Data are commonly understood to be the raw material produced by abstracting the world into categories, measures and other representational forms – numbers, characters, symbols, images, sounds, electromagnetic waves, bits – that constitute the building blocks from which information and knowledge are created. Data are usually representative in nature (e.g., measurements of a phenomena, such as a person’s age, height, weight, colour, blood pressure, opinion, habits, location, etc.), but can also be implied (e.g., through an absence rather than presence) or derived (e.g., data that are produced from other data, such as percentage change over time calculated by comparing data from two time periods), and can be either recorded and stored in analogue form or encoded in digital form as bits (binary digits).
Data then are a key resource in the modern world. Yet, given their utility and value, and the amount effort and resources devoted to producing and analysing them, it is remarkable how little conceptual attention has been paid to data in and of themselves. In contrast, there are thousands of articles and books devoted to the philosophy of information and knowledge. Just as we tend to focus on buildings and neighbourhoods when considering cities, rather than the bricks and mortar used to build them, so it is the case with data. Moreover, just as we think of bricks and mortar as simple building blocks rather than elements that are made within factories by companies bound within logistical, financial legal and market concerns, and are distributed, stored and traded, so we largely do with data.
What are data?
Etymologically the word data is derived from the Latin dare, meaning ‘to give’. In this sense, data are raw elements that can be abstracted from (given by) phenomena – measured and recorded in various ways. However, in general use, data refer to those elements that are taken; extracted through observations, computations, experiments, and record keeping (Borgman 2007). Technically, then, what we understand as data are actually capta (derived from the Latin capere, meaning ‘to take’); those units of data that have been selected and harvested from the sum of all potential data (Kitchinand Dodge 2011). As Jensen (1950: ix, cited in Becker 1952: 278) states:
- it is an unfortunate accident of history that the term datum... rather than captum... should have come to symbolize the unit-phenomenon in science. For science deals, not with ‘that which has been given’ by nature to the scientist, but with ‘that which has been taken’ or selected from natureby the scientist in accordance with his purpose.
Other scholars have noted that what has been understood as data has changed over time with the development of science. Rosenberg (2013) details that the term ‘data’ was first used in the English language in the seventeenth century. As a concept then it is very much tied to that of modernity and the growth and evolution of science and new modes of producing, presenting and debating knowledge in the seventeenth and eighteenth century that shifted information and argument away from theology, exhortation and sentiment to facts, evidence and the testing of theory through experiment (Poovey 1998; Garvey 2013; Rosenberg 2013). Over time, data came to be understood as being pre-analytical and pre-factual, different in nature to facts, evidence, information and knowledge, but a key element in the constitution of these elements (though often the terms and definitions of data, facts, evidence, information and knowledge are conflated). As Rosenberg (2013: 18) notes,
- facts are ontological, evidence is epistemological, data is rhetorical. A datum may also be a fact, just as a fact may be evidence... [T]he existence of a datum has been independent of any consideration of corresponding ontological truth. When a fact is proven false, it ceases to be a fact. False data is data nonetheless.
In rhetorical terms, data are that which exists prior to argument or interpretation that converts them to they are abstract, discrete, aggregative (they can be added together) (Rosenberg 2013), and are meaningful independent of format, medium, language, producer and context (i.e., data hold their meaning whether stored as analogue or digital, viewed on paper or screen or expressed in any language, and ‘adhere to certain non-varying patterns, such as the number of tree rings always being equal to the age of the tree’) (Floridi 2010).
Floridi (2008) explains that from an epistemic position data are collections of facts, from an informational position data are information, from a computational position data are collections of binary elements that can be processed and transmitted electronically, and from a diaphoric position data are abstract elements that are distinct and intelligible from other data. In the first case, data provide the basis for further reasoning or constitute empirical evidence. In the second, data constitute representative information that can be stored, processed and analysed, but do not necessarily constitute facts. In the third, data constitute the inputs and outputs of computation but have to be processed to be turned into facts and information (for example, a DVD contains gigabytes of data but no facts or information per se) (Floridi 2005). In the fourth, data are meaningful because they capture and denote variability (e.g., patterns of dots, alphabet letters and numbers, wavelengths) that provides a signal that can be interpreted. As discussed below, other positions include understanding data as being socially constructed, as having materiality, as being ideologically loaded, as a commodity to be traded, as constituting a public good, and so on. The point is, data are never simply just data; how data are conceived and used varies between those who capture, analyse and draw conclusions from
Kinds of data
Captured, exhaust, transient and derived data
There are two primary ways in which data can be generated. The first is that data can be captured directly through some form of measurement such as observation, surveys, lab and field experiments, record keeping (e.g., filling out forms or writing a diary), cameras, scanners and sensors. In these cases, data are usually the deliberate product of measurement; that is, the intention was to generate useful data. In contrast, exhaust data are inherently produced by a device or system, but are a by-product of the main function rather than the primary output (Manyika et al. 2011). For example, an electronic checkout till is designed to total the goods being purchased and to process payment, but it also produces data that can be used to monitor stock, worker performance and customer purchasing.
Many software-enabled systems produce such exhaust data, much of which have become valuable sources of information. In other cases, exhaust data are transient in nature; that is, they are never examined or processed and are simply discarded, either because they are too voluminous or unstructured in nature, or costly to process and store, or there is a lack of techniques to derive value from them, or they are of little strategic or tactical use (Zikopoulos et al. 2012; Franks 2012). For example, Manyika et al. (2011: 3) report that ‘health care providers... discard 90 percent of the data that they generate (e.g., almost all real-time video feeds created during surgery)’.
Captured and exhaust data are considered ‘raw’ in the sense that they have not been converted or combined with other data. In contrast, derived data are produced through additional processing or analysis of captured data. For example, captured data might be individual traffic counts through an intersection and derived data the total number of counts or counts per hour. The latter have been derived from the former. Captured data are often the input into a model, with derived data the output. For example, traffic count data might be an input into a transportation model with the output being predicted or simulated data (such as projected traffic counts at different times or under different conditions). In the case of a model, the traffic count data are likely to have been combined with other captured or derived data (such as type of vehicle, number of passengers, etc.) to create new derived data for input into the model.
Primary, secondary and tertiary data
Primary data are generated by a researcher and their instruments within a research design of their making. Secondary data are data made available to others to reuse and analyse that are generated by someone else. So one person’s primary data can be another person’s secondary data. Tertiary data are a form of derived data, such as counts, categories, and statistical results. Tertiary data are often released by statistical agencies rather than secondary data to ensure confidentiality with respect to whom the data refer. For example, the primary data of the Irish census are precluded from being released as secondary data for 100 years after generation; instead the data are released as summary counts and categorical tertiary data.
Indexical and attribute data and metadata
Data also vary in kind. Indexical data are those that enable identification and linking, and include unique identifiers, such as passport and social security numbers, credit card numbers, manufacturer serial numbers, digital object identifiers, IP and MAC addresses, order and shipping numbers, as well as names, addresses, and zip codes. Indexical data are important because they enable large amounts of non-indexical data to be bound together and tracked through shared identifiers, and enable discrimination, combination, disaggregation and re-aggregation, searching and other forms of processing and analysis. As discussed in Chapter 4, indexical data are becoming increasingly common and granular, escalating the relationality of datasets.
Attribute data are data that represent aspects of a phenomenon, but are not indexical in nature. For example, with respect to a person the indexical data might be a fingerprint or DNA sequence, with associated attribute data being age, sex, height, weight, eye colour, blood group, and so on. The vast bulk of data that are generated and stored within systems are attribute data.
Metadata are data about data. Metadata can either refer to the data content or the whole dataset. Metadata about the content includes the names and descriptions of specific fields (e.g., the column headers in a spreadsheet) and data definitions. These metadata help a user of a dataset to understand its composition and how it should be used and interpreted, and facilitates the conjoining of datasets, interoperability and discoverability, and to judge their provenance and lineage. Metadata that refers to a dataset as a whole has three different forms (NISO 2004). Descriptive metadata concernsidentification and discovery and includes elements such as title, author, publisher, subject, and description.
Structural metadata refers to the organisation and coverage of the dataset. Administrative metadata concerns when and how the dataset was created, details of the technical aspects of the data, such as file format, and who owns and can use the data. A common metadata standard for datasets that combines these three types of metadata is the Dublin Core (http://dublincore.org/). This standard requires datasets to have 15 accompanying metadata fields: title, creator, subject, description, publisher, contributor, date, type, format, identifier, source, language, relation, coverage, and rights. Metadata are essential components of all datasets, though they are often a neglected element of data curation, especially amongst researchers who are compiling primary data for their own use rather
Data, information, knowledge, wisdom
What unites these various kinds of data is that they form the base or bedrock of a knowledge pyramid: data precedes information, which precedes knowledge, which precedes understanding and wisdom (Adler 1986; Weinberger 2011). Each layer of the pyramid is distinguished by a process of distillation (reducing, abstracting, processing, organising, analysing, interpreting, applying) that adds organisation, meaning and value by revealing relationships and truths about the world.
While the order of the concepts within the pyramid is generally uncontested, the nature and difference between concepts often varies between schools of thought. Information, for example, is a concept that is variously understood across scholars. For some, information is an accumulation of associated data, for others it is data plus meaning, or the signal in the noise of data, or a multifaceted construct, or tertiary data wherein primary data has been reworked into analytical form. To a physicist, data are simply zeros and ones, raw bits; they are noise. Information is when these zeros and ones are organised into distinct patterns; it is the signal (von Baeyer 2003). Airwaves and communication cables then are full of flowing information – radio and television signals, telephone conversations, internet packets – meaningful patterns of data within the wider spectrum of noise. For others, information is a broader concept.
Regardless of how it is conceived, Floridi (2010) notes that given that information adds meaning to data, it gains currency as a commodity. It is, however, a particular kind of commodity, possessing three main properties (which data also share):
- Non-rivalrous: more than one entity can possess the same information (unlike material goods)
- Non-excludable: it is easily shared and it takes effort to seek to limit such sharing (such as enforcing intellectual property rights agreements or inserting pay walls)
- Zero marginal cost: once information is available, the cost of reproduction is often negligible.
While holding the properties of being non-rivalrous and non-excludable, because information is valuable many entities seek to limit and control its circulation, thus increasing its value. Much of this value is added through the processes enacted in the information life cycle (Floridi 2010):
- Occurrence: discovering, designing, authoring
- Transmission: networking, distributing, accessing, retrieving, transmitting
- Processing and management: collecting, validating, modifying, organising, indexing, classifying, filtering, updating, sorting, storing
- Usage: monitoring, modelling, analysing, explaining, planning, forecasting, decision-making, instructing, educating, learning.
It is through processing, management and usage that information is converted into the even more valuable knowledge.
As with all the concepts in the pyramid, knowledge is similarly a diversely understood concept. For some, knowledge is the ‘know-how that transforms information into instructions’ (Weinberger 2011:3). For example, semantic information can be linked into recipes (first do this, then do that...) or a conditional form of inferential procedures (if such and such is the case do this, otherwise do this) (Floridi 2010). In this framing, information is structured data and knowledge is actionable information (Weinberger 2011). In other words, ‘knowledge is like the recipe that turns information into bread, while data are like the atoms that make up the flour and the yeast’ (Zelany 1987, cited in Weinberger 2011). For others, knowledge is more than a set of instructions; it can be a practical skill, a way of knowing how to undertake or achieve a task, or a system of thought that coherently links together information to reveal a wider picture about a phenomenon. Creating knowledge involves applying complex cognitive processes such as perception, synthesis, extraction, association, reasoning and communication to information. Knowledge has more value than information because it provides the basis for understanding, explaining and drawing insights about the world, which can be used to formulate policy and actions. Wisdom, the pinnacle of the knowledge pyramid, is being able to sagely apply knowledge.