Diverse Daten (BD2015): Unterschied zwischen den Versionen

Aus Philo Wiki
Wechseln zu:Navigation, Suche
K (edit)
K (edit)
Zeile 19: Zeile 19:
 
::it is an unfortunate accident of history that the term datum... rather than captum... should have come to symbolize the unit-phenomenon in science. For science deals, not with ‘that which has been given’ by nature to the scientist, but with ‘that which has been taken’ or selected from natureby the scientist in accordance with his purpose.
 
::it is an unfortunate accident of history that the term datum... rather than captum... should have come to symbolize the unit-phenomenon in science. For science deals, not with ‘that which has been given’ by nature to the scientist, but with ‘that which has been taken’ or selected from natureby the scientist in accordance with his purpose.
  
Strictly speaking, then, this book should be entitled The Capta Revolution. However, since the term
+
Other scholars have noted that what has been understood as data has changed over time with the development of science. Rosenberg (2013) details that the term ‘data’ was first used in the English language in the seventeenth century. As a concept then it is very much tied to that of modernity and the growth and evolution of science and new modes of producing, presenting and debating knowledge in the seventeenth and eighteenth century that shifted information and argument away from theology, exhortation and sentiment to facts, evidence and the testing of theory through experiment (Poovey 1998; Garvey 2013; Rosenberg 2013). Over time, data came to be understood as being pre-analytical and pre-factual, different in nature to facts, evidence, information and knowledge, but a key element in the constitution of these elements (though often the terms and definitions of data, facts, evidence, information and knowledge are conflated). As Rosenberg (2013: 18) notes,
data has become so thoroughly ingrained in the language of the academy and business to mean capta,
 
rather than confuse the matter further it makes sense to continue to use the term data where capta
 
would be more appropriate. Beyond highlighting the etymological roots of the term, what this brief
 
discussion starts to highlight is that data harvested through measurement are always a selection from
 
the total sum of all possible data available – what we have chosen to take from all that could
 
potentially be given. As such, data are inherently partial, selective and representative, and the
 
distinguishing criteria used in their capture has consequence.
 
Other scholars have noted that what has been understood as data has changed over time with the
 
development of science. Rosenberg (2013) details that the term ‘data’ was first used in the English
 
language in the seventeenth century. As a concept then it is very much tied to that of modernity and the
 
growth and evolution of science and new modes of producing, presenting and debating knowledge in
 
the seventeenth and eighteenth century that shifted information and argument away from theology,
 
exhortation and sentiment to facts, evidence and the testing of theory through experiment (Poovey
 
1998; Garvey 2013; Rosenberg 2013). Over time, data came to be understood as being pre-analytical
 
and pre-factual, different in nature to facts, evidence, information and knowledge, but a key element in
 
the constitution of these elements (though often the terms and definitions of data, facts, evidence,
 
information and knowledge are conflated). As Rosenberg (2013: 18) notes,
 
facts are ontological, evidence is epistemological, data is rhetorical. A datum may also be a fact,
 
just as a fact may be evidence... [T]he existence of a datum has been independent of any
 
consideration of corresponding ontological truth. When a fact is proven false, it ceases to be a
 
fact. False data is data nonetheless.
 
In rhetorical terms, data are that which exists prior to argument or interpretation that converts them to
 
  
 +
::facts are ontological, evidence is epistemological, data is rhetorical. A datum may also be a fact, just as a fact may be evidence... [T]he existence of a datum has been independent of any consideration of corresponding ontological truth. When a fact is proven false, it ceases to be a fact. False data is data nonetheless.
 +
 +
In rhetorical terms, data are that which exists prior to argument or interpretation that converts them to they are abstract, discrete, aggregative (they can be added together) (Rosenberg 2013), and are meaningful independent of format, medium, language, producer and context (i.e., data hold their meaning whether stored as analogue or digital, viewed on paper or screen or expressed in any language, and ‘adhere to certain non-varying patterns, such as the number of tree rings always being equal to the age of the tree’) (Floridi 2010).
 +
 +
...
  
they are abstract, discrete, aggregative (they can be added together) (Rosenberg 2013), and are
 
meaningful independent of format, medium, language, producer and context (i.e., data hold their
 
meaning whether stored as analogue or digital, viewed on paper or screen or expressed in any
 
language, and ‘adhere to certain non-varying patterns, such as the number of tree rings always being
 
equal to the age of the tree’) (Floridi 2010). Floridi (2008) contends that the support-independencedata is reliant on three types of neutrality: taxonomic (data are relational entities defined with respect
 
to other specific data); typological (data can take a number of different non-mutually exclusive forms,
 
e.g., primary, secondary, metadata, operational, derived); and genetic (data can have a semantics
 
independent of their comprehension; e.g., the Rosetta Stone hieroglyphics constitute data regardlessthe fact that when they were discovered nobody could interpret them).
 
of
 
of
 
Not everyone who thinks about or works with data holds such a narrow rhetorical view. How data
 
are understood has not just evolved over time, it varies with respect to perspective. For example,
 
 
Floridi (2008) explains that from an epistemic position data are collections of facts, from an
 
Floridi (2008) explains that from an epistemic position data are collections of facts, from an
 
informational position data are information, from a computational position data are collections of
 
informational position data are information, from a computational position data are collections of
Zeile 72: Zeile 43:
  
  
Kinds of data
+
=== Kinds of data ===
Whether data are pre-factual and rhetorical in nature or not, it is clear that data are diverse in their
+
 
characteristics, which shape in explicit terms how they are handled and what can be done with them.
+
==== Captured, exhaust, transient and derived data ====
In broad terms, data vary by form (qualitative or quantitative), structure (structured, semi-structured
+
 
or unstructured), source (captured, derived, exhaust, transient), producer (primary, secondary,
+
There are two primary ways in which data can be generated. The first is that data can be captured directly through some form of measurement such as observation, surveys, lab and field experiments, record keeping (e.g., filling out forms or writing a diary), cameras, scanners and sensors. In these cases, data are usually the deliberate product of measurement; that is, the intention was to generate
 +
useful data. In contrast, exhaust data are inherently produced by a device or system, but are a by-product of the main function rather than the primary output (Manyika et al. 2011). For example, an electronic checkout till is designed to total the goods being purchased and to process payment, but it also produces data that can be used to monitor stock, worker performance and customer purchasing.
 +
 
 +
Many software-enabled systems produce such exhaust data, much of which have become valuable sources of information. In other cases, exhaust data are transient in nature; that is, they are never examined or processed and are simply discarded, either because they are too voluminous or unstructured in nature, or costly to process and store, or there is a lack of techniques to derive value from them, or they are of little strategic or tactical use (Zikopoulos et al. 2012; Franks 2012). For example, Manyika et al. (2011: 3) report that ‘health care providers... discard 90 percent of the data that they generate (e.g., almost all real-time video feeds created during surgery)’.
 +
 
 +
Captured and exhaust data are considered ‘raw’ in the sense that they have not been converted or combined with other data. In contrast, derived data are produced through additional processing or analysis of captured data. For example, captured data might be individual traffic counts through an intersection and derived data the total number of counts or counts per hour. The latter have been derived from the former. Captured data are often the input into a model, with derived data the output. For example, traffic count data might be an input into a transportation model with the output being predicted or simulated data (such as projected traffic counts at different times or under different conditions). In the case of a model, the traffic count data are likely to have been combined with other captured or derived data (such as type of vehicle, number of passengers, etc.) to create new derived data for input into the model.
 +
 
 +
==== Primary, secondary and tertiary data ====
 +
 
 +
Primary data are generated by a researcher and their instruments within a research design of their making. Secondary data are data made available to others to reuse and analyse that are generated by someone else. So one person’s primary data can be another person’s secondary data. Tertiary data are a form of derived data, such as counts, categories, and statistical results. Tertiary data are often
 +
released by statistical agencies rather than secondary data to ensure confidentiality with respect to whom the data refer. For example, the primary data of the Irish census are precluded from being released as secondary data for 100 years after generation; instead the data are released as summary counts and categorical tertiary data.
 +
 
 +
==== Indexical and attribute data and metadata ====
  
Captured, exhaust, transient and derived data
+
Data also vary in kind. Indexical data are those that enable identification and linking, and include unique identifiers, such as passport and social security numbers, credit card numbers, manufacturer serial numbers, digital object identifiers, IP and MAC addresses, order and shipping numbers, as well as names, addresses, and zip codes. Indexical data are important because they enable large amounts of non-indexical data to be bound together and tracked through shared identifiers, and enable discrimination, combination, disaggregation and re-aggregation, searching and other forms of processing and analysis. As discussed in Chapter 4, indexical data are becoming increasingly
There are two primary ways in which data can be generated. The first is that data can be captured
+
common and granular, escalating the relationality of datasets.  
directly through some form of measurement such as observation, surveys, lab and field experiments,
 
record keeping (e.g., filling out forms or writing a diary), cameras, scanners and sensors. In these
 
cases, data are usually the deliberate product of measurement; that is, the intention was to generate
 
useful data. In contrast, exhaust data are inherently produced by a device or system, but are a by-
 
product of the main function rather than the primary output (Manyika et al. 2011). For example, an
 
electronic checkout till is designed to total the goods being purchased and to process payment, but it
 
also produces data that can be used to monitor stock, worker performance and customer purchasing.
 
Many software-enabled systems produce such exhaust data, much of which have become valuable
 
sources of information. In other cases, exhaust data are transient in nature; that is, they are never
 
examined or processed and are simply discarded, either because they are too voluminous or
 
unstructured in nature, or costly to process and store, or there is a lack of techniques to derive value
 
from them, or they are of little strategic or tactical use (Zikopoulos et al. 2012; Franks 2012). For
 
example, Manyika et al. (2011: 3) report that ‘health care providers... discard 90 percent of the data
 
that they generate (e.g., almost all real-time video feeds created during surgery)’.
 
Captured and exhaust data are considered ‘raw’ in the sense that they have not been converted or
 
combined with other data. In contrast, derived data are produced through additional processing or
 
analysis of captured data. For example, captured data might be individual traffic counts through an
 
intersection and derived data the total number of counts or counts per hour. The latter have been
 
derived from the former. Captured data are often the input into a model, with derived data the output.
 
For example, traffic count data might be an input into a transportation model with the output being
 
predicted or simulated data (such as projected traffic counts at different times or under different
 
conditions). In the case of a model, the traffic count data are likely to have been combined with other
 
captured or derived data (such as type of vehicle, number of passengers, etc.) to create new derived
 
data for input into the model. Derived data are generated for a number of reasons, including to reduce
 
the volume of data to a manageable amount and to produce more useful and meaningful measures.
 
Sometimes the original captured data might be processed to varying levels of derivation depending on
 
its intended use. For example, the NASA Earth Observing System organises its data into six levels
 
that run from unprocessed captured data, through increasing degrees of processing and analysis, to
 
model outputs based on analyses of lower-level data (Borgman 2007; see Table 1.2).
 
  
Primary, secondary and tertiary data
+
Attribute data are data that represent aspects of a phenomenon, but are not indexical in nature. For example, with respect to a person the
Primary data are generated by a researcher and their instruments within a research design of their
+
indexical data might be a fingerprint or DNA sequence, with associated attribute data being age, sex, height, weight, eye colour, blood group, and so on. The vast bulk of data that are generated and stored within systems are attribute data.
making. Secondary data are data made available to others to reuse and analyse that are generated by
 
someone else. So one person’s primary data can be another person’s secondary data. Tertiary data
 
are a form of derived data, such as counts, categories, and statistical results. Tertiary data are often
 
released by statistical agencies rather than secondary data to ensure confidentiality with respect to
 
whom the data refer. For example, the primary data of the Irish census are precluded from being
 
released as secondary data for 100 years after generation; instead the data are released as summary
 
counts and categorical tertiary data. Many researchers and institutions seek to generate primary data
 
because they are tailored to their specific needs and foci, whereas these design choices are not
 
available to those analysing secondary or tertiary data. Moreover, those using secondary and tertiary
 
data as inputs for their own studies have to trust that the original research is valid.
 
In many cases researchers will combine primary data with secondary and tertiary data to produce
 
more valuable derived data. For example, a retailer might seek to create a derived dataset that merges
 
their primary sales data with tertiary geodemographics data (data about what kind of people live in
 
different areas, which are derived from census and other public and commercial data) in order to
 
determine which places to target with marketing material. Secondary and tertiary data are valuable
 
because they enable replication studies and the building of larger, richer and more sophisticated
 
datasets. They later produce what Crampton et al. (2012) term ‘data amplification’; that is, data when
 
combined enables far greater insights by revealing associations, relationships and patterns which
 
remain hidden if the data remain isolated. As a consequence, the secondary and tertiary data market is
 
a multi-billion dollar industry (see Chapter 2).
 
  
 +
Metadata are data about data. Metadata can either refer to the data content or the whole dataset. Metadata about the content includes the names and descriptions of specific fields (e.g., the column headers in a spreadsheet) and data definitions. These metadata help a user of a dataset to understand its composition and how it should be used and interpreted, and facilitates the conjoining of datasets,
 +
interoperability and discoverability, and to judge their provenance and lineage. Metadata that refers to a dataset as a whole has three different forms (NISO 2004). Descriptive metadata concernsidentification and discovery and includes elements such as title, author, publisher, subject, and description.
  
Indexical and attribute data and metadata
+
Structural metadata refers to the organisation and coverage of the dataset. Administrative metadata concerns when and how the dataset was created, details of the technical aspects of the data, such as file format, and who owns and can use the data. A common metadata standard for datasets that combines these three types of metadata is the Dublin Core (http://dublincore.org/). This standard requires datasets to have 15 accompanying metadata fields: title, creator, subject, description, publisher, contributor, date, type, format, identifier, source, language, relation, coverage, and rights. Metadata are essential components of all datasets, though they are often a neglected element of data curation, especially amongst researchers who are compiling primary data for their own use rather
Data also vary in kind. Indexical data are those that enable identification and linking, and include
 
unique identifiers, such as passport and social security numbers, credit card numbers, manufacturer
 
serial numbers, digital object identifiers, IP and MAC addresses, order and shipping numbers, as
 
well as names, addresses, and zip codes. Indexical data are important because they enable large
 
amounts of non-indexical data to be bound together and tracked through shared identifiers, and enable
 
discrimination, combination, disaggregation and re-aggregation, searching and other forms of
 
processing and analysis. As discussed in Chapter 4, indexical data are becoming increasingly
 
common and granular, escalating the relationality of datasets. Attribute data are data that represent
 
aspects of a phenomenon, but are not indexical in nature. For example, with respect to a person the
 
indexical data might be a fingerprint or DNA sequence, with associated attribute data being age, sex,
 
height, weight, eye colour, blood group, and so on. The vast bulk of data that are generated and stored
 
within systems are attribute data.
 
Metadata are data about data. Metadata can either refer to the data content or the whole dataset.
 
Metadata about the content includes the names and descriptions of specific fields (e.g., the column
 
headers in a spreadsheet) and data definitions. These metadata help a user of a dataset to understand
 
its composition and how it should be used and interpreted, and facilitates the conjoining of datasets,
 
interoperability and discoverability, and to judge their provenance and lineage. Metadata that refers
 
to a dataset as a whole has three different forms (NISO 2004). Descriptive metadata concerns
 
identification and discovery and includes elements such as title, author, publisher, subject, and
 
description. Structural metadata refers to the organisation and coverage of the dataset. Administrative
 
metadata concerns when and how the dataset was created, details of the technical aspects of the data,
 
such as file format, and who owns and can use the data. A common metadata standard for datasets that
 
combines these three types of metadata is the Dublin Core (http://dublincore.org/). This standard
 
requires datasets to have 15 accompanying metadata fields: title, creator, subject, description,
 
publisher, contributor, date, type, format, identifier, source, language, relation, coverage, and rights.
 
Metadata are essential components of all datasets, though they are often a neglected element of data
 
curation, especially amongst researchers who are compiling primary data for their own use rather
 
  
  
Data, information, knowledge, wisdom
+
=== Data, information, knowledge, wisdom ===
 
What unites these various kinds of data is that they form the base or bedrock of a knowledge pyramid:
 
What unites these various kinds of data is that they form the base or bedrock of a knowledge pyramid:
 
data precedes information, which precedes knowledge, which precedes understanding and wisdom
 
data precedes information, which precedes knowledge, which precedes understanding and wisdom

Version vom 7. Januar 2016, 13:51 Uhr

Rob Kitchin: The Data Revolution Big Data, Open Data, Data Infrastructures and Their Consequences [1]

Extrakte

Conceptualising Data

Data are commonly understood to be the raw material produced by abstracting the world into categories, measures and other representational forms – numbers, characters, symbols, images, sounds, electromagnetic waves, bits – that constitute the building blocks from which information and knowledge are created. Data are usually representative in nature (e.g., measurements of a phenomena, such as a person’s age, height, weight, colour, blood pressure, opinion, habits, location, etc.), but can also be implied (e.g., through an absence rather than presence) or derived (e.g., data that are produced from other data, such as percentage change over time calculated by comparing data from two time periods), and can be either recorded and stored in analogue form or encoded in digital form as bits (binary digits).

....

Data then are a key resource in the modern world. Yet, given their utility and value, and the amount effort and resources devoted to producing and analysing them, it is remarkable how little conceptual attention has been paid to data in and of themselves. In contrast, there are thousands of articles and books devoted to the philosophy of information and knowledge. Just as we tend to focus on buildings and neighbourhoods when considering cities, rather than the bricks and mortar used to build them, so it is the case with data. Moreover, just as we think of bricks and mortar as simple building blocks rather than elements that are made within factories by companies bound within logistical, financial legal and market concerns, and are distributed, stored and traded, so we largely do with data.

What are data?

Etymologically the word data is derived from the Latin dare, meaning ‘to give’. In this sense, data are raw elements that can be abstracted from (given by) phenomena – measured and recorded in various ways. However, in general use, data refer to those elements that are taken; extracted through observations, computations, experiments, and record keeping (Borgman 2007). Technically, then, what we understand as data are actually capta (derived from the Latin capere, meaning ‘to take’); those units of data that have been selected and harvested from the sum of all potential data (Kitchinand Dodge 2011). As Jensen (1950: ix, cited in Becker 1952: 278) states:

it is an unfortunate accident of history that the term datum... rather than captum... should have come to symbolize the unit-phenomenon in science. For science deals, not with ‘that which has been given’ by nature to the scientist, but with ‘that which has been taken’ or selected from natureby the scientist in accordance with his purpose.

Other scholars have noted that what has been understood as data has changed over time with the development of science. Rosenberg (2013) details that the term ‘data’ was first used in the English language in the seventeenth century. As a concept then it is very much tied to that of modernity and the growth and evolution of science and new modes of producing, presenting and debating knowledge in the seventeenth and eighteenth century that shifted information and argument away from theology, exhortation and sentiment to facts, evidence and the testing of theory through experiment (Poovey 1998; Garvey 2013; Rosenberg 2013). Over time, data came to be understood as being pre-analytical and pre-factual, different in nature to facts, evidence, information and knowledge, but a key element in the constitution of these elements (though often the terms and definitions of data, facts, evidence, information and knowledge are conflated). As Rosenberg (2013: 18) notes,

facts are ontological, evidence is epistemological, data is rhetorical. A datum may also be a fact, just as a fact may be evidence... [T]he existence of a datum has been independent of any consideration of corresponding ontological truth. When a fact is proven false, it ceases to be a fact. False data is data nonetheless.

In rhetorical terms, data are that which exists prior to argument or interpretation that converts them to they are abstract, discrete, aggregative (they can be added together) (Rosenberg 2013), and are meaningful independent of format, medium, language, producer and context (i.e., data hold their meaning whether stored as analogue or digital, viewed on paper or screen or expressed in any language, and ‘adhere to certain non-varying patterns, such as the number of tree rings always being equal to the age of the tree’) (Floridi 2010).

...

Floridi (2008) explains that from an epistemic position data are collections of facts, from an informational position data are information, from a computational position data are collections of binary elements that can be processed and transmitted electronically, and from a diaphoric position data are abstract elements that are distinct and intelligible from other data. In the first case, data provide the basis for further reasoning or constitute empirical evidence. In the second, data constitute representative information that can be stored, processed and analysed, but do not necessarily constitute facts. In the third, data constitute the inputs and outputs of computation but have to be processed to be turned into facts and information (for example, a DVD contains gigabytes of data but no facts or information per se) (Floridi 2005). In the fourth, data are meaningful because they capture and denote variability (e.g., patterns of dots, alphabet letters and numbers, wavelengths) that provides a signal that can be interpreted. As discussed below, other positions include understanding data as being socially constructed, as having materiality, as being ideologically loaded, as a commodity to be traded, as constituting a public good, and so on. The point is, data are never simply just data; how data are conceived and used varies between those who capture, analyse and draw conclusions from


Kinds of data

Captured, exhaust, transient and derived data

There are two primary ways in which data can be generated. The first is that data can be captured directly through some form of measurement such as observation, surveys, lab and field experiments, record keeping (e.g., filling out forms or writing a diary), cameras, scanners and sensors. In these cases, data are usually the deliberate product of measurement; that is, the intention was to generate useful data. In contrast, exhaust data are inherently produced by a device or system, but are a by-product of the main function rather than the primary output (Manyika et al. 2011). For example, an electronic checkout till is designed to total the goods being purchased and to process payment, but it also produces data that can be used to monitor stock, worker performance and customer purchasing.

Many software-enabled systems produce such exhaust data, much of which have become valuable sources of information. In other cases, exhaust data are transient in nature; that is, they are never examined or processed and are simply discarded, either because they are too voluminous or unstructured in nature, or costly to process and store, or there is a lack of techniques to derive value from them, or they are of little strategic or tactical use (Zikopoulos et al. 2012; Franks 2012). For example, Manyika et al. (2011: 3) report that ‘health care providers... discard 90 percent of the data that they generate (e.g., almost all real-time video feeds created during surgery)’.

Captured and exhaust data are considered ‘raw’ in the sense that they have not been converted or combined with other data. In contrast, derived data are produced through additional processing or analysis of captured data. For example, captured data might be individual traffic counts through an intersection and derived data the total number of counts or counts per hour. The latter have been derived from the former. Captured data are often the input into a model, with derived data the output. For example, traffic count data might be an input into a transportation model with the output being predicted or simulated data (such as projected traffic counts at different times or under different conditions). In the case of a model, the traffic count data are likely to have been combined with other captured or derived data (such as type of vehicle, number of passengers, etc.) to create new derived data for input into the model.

Primary, secondary and tertiary data

Primary data are generated by a researcher and their instruments within a research design of their making. Secondary data are data made available to others to reuse and analyse that are generated by someone else. So one person’s primary data can be another person’s secondary data. Tertiary data are a form of derived data, such as counts, categories, and statistical results. Tertiary data are often released by statistical agencies rather than secondary data to ensure confidentiality with respect to whom the data refer. For example, the primary data of the Irish census are precluded from being released as secondary data for 100 years after generation; instead the data are released as summary counts and categorical tertiary data.

Indexical and attribute data and metadata

Data also vary in kind. Indexical data are those that enable identification and linking, and include unique identifiers, such as passport and social security numbers, credit card numbers, manufacturer serial numbers, digital object identifiers, IP and MAC addresses, order and shipping numbers, as well as names, addresses, and zip codes. Indexical data are important because they enable large amounts of non-indexical data to be bound together and tracked through shared identifiers, and enable discrimination, combination, disaggregation and re-aggregation, searching and other forms of processing and analysis. As discussed in Chapter 4, indexical data are becoming increasingly common and granular, escalating the relationality of datasets.

Attribute data are data that represent aspects of a phenomenon, but are not indexical in nature. For example, with respect to a person the indexical data might be a fingerprint or DNA sequence, with associated attribute data being age, sex, height, weight, eye colour, blood group, and so on. The vast bulk of data that are generated and stored within systems are attribute data.

Metadata are data about data. Metadata can either refer to the data content or the whole dataset. Metadata about the content includes the names and descriptions of specific fields (e.g., the column headers in a spreadsheet) and data definitions. These metadata help a user of a dataset to understand its composition and how it should be used and interpreted, and facilitates the conjoining of datasets, interoperability and discoverability, and to judge their provenance and lineage. Metadata that refers to a dataset as a whole has three different forms (NISO 2004). Descriptive metadata concernsidentification and discovery and includes elements such as title, author, publisher, subject, and description.

Structural metadata refers to the organisation and coverage of the dataset. Administrative metadata concerns when and how the dataset was created, details of the technical aspects of the data, such as file format, and who owns and can use the data. A common metadata standard for datasets that combines these three types of metadata is the Dublin Core (http://dublincore.org/). This standard requires datasets to have 15 accompanying metadata fields: title, creator, subject, description, publisher, contributor, date, type, format, identifier, source, language, relation, coverage, and rights. Metadata are essential components of all datasets, though they are often a neglected element of data curation, especially amongst researchers who are compiling primary data for their own use rather


Data, information, knowledge, wisdom

What unites these various kinds of data is that they form the base or bedrock of a knowledge pyramid: data precedes information, which precedes knowledge, which precedes understanding and wisdom (Adler 1986; Weinberger 2011). Each layer of the pyramid is distinguished by a process of distillation (reducing, abstracting, processing, organising, analysing, interpreting, applying) that adds organisation, meaning and value by revealing relationships and truths about the world (see Figure 1.1). While the order of the concepts within the pyramid is generally uncontested, the nature and difference between concepts often varies between schools of thought. Information, for example, is a concept that is variously understood across scholars. For some, information is an accumulation of associated data, for others it is data plus meaning, or the signal in the noise of data, or a multifaceted construct, or tertiary data wherein primary data has been reworked into analytical form. To a physicist, data are simply zeros and ones, raw bits; they are noise. Information is when these zeros and ones are organised into distinct patterns; it is the signal (von Baeyer 2003). Airwaves and communication cables then are full of flowing information – radio and television signals, telephone conversations, internet packets – meaningful patterns of data within the wider spectrum of noise. For others, information is a broader concept. Floridi (2010: 74), for example, identifies three types of information: Factual: information as reality (e.g., patterns, fingerprints, tree rings) Instructional: information for reality (e.g., commands, algorithms, recipes) Semantic: information about reality (e.g., train timetables, maps, biographies). Figure 1.1 Knowledge pyramid (adapted from Adler 1986 and McCandless 2010) The first is essentially meaningful data, what are usually termed facts. These are data that are organised and framed within a system of measurement or an external referent that inherently provide

a basis to establish an initial meaning that holds some truth. Information also extends beyond data and facts through adding value that aids interpretation. Weinberger (2011: 2) thus declares: ‘Information is to data what wine is to the vineyard: the delicious extract and distillate.’ Such value could be gained through sorting, classifying, linking, or adding semantic content through some form of text or visualisation that informs about something and/or instructs what to do (for example, a warning light on a car’s dashboard indicating that the battery is flat and needs recharging, Floridi, 2010). Case (2002; summarised in Borgman 2007: 40) argues that differences in the definition of information hinge on five issues: uncertainty, or whether something has to reduce uncertainty to qualify as information; physicality, or whether something has to take on a physical form such as a book, an object, or the sound waves of speech to qualify as information; structure/process, or whether some set of order or relationships is required; intentionality, or whether someone must intend that something be communicated to qualify as information; and truth, or whether something must be true to qualify as information. Regardless of how it is conceived, Floridi (2010) notes that given that information adds meaning to data, it gains currency as a commodity. It is, however, a particular kind of commodity, possessing three main properties (which data also share): Non-rivalrous: more than one entity can possess the same information (unlike material goods) Non-excludable: it is easily shared and it takes effort to seek to limit such sharing (such as enforcing intellectual property rights agreements or inserting pay walls) Zero marginal cost: once information is available, the cost of reproduction is often negligible. While holding the properties of being non-rivalrous and non-excludable, because information is valuable many entities seek to limit and control its circulation, thus increasing its value. Much of this value is added through the processes enacted in the information life cycle (Floridi 2010): Occurrence: discovering, designing, authoring Transmission: networking, distributing, accessing, retrieving, transmitting Processing and management: collecting, validating, modifying, organising, indexing, classifying, filtering, updating, sorting, storing Usage: monitoring, modelling, analysing, explaining, planning, forecasting, decision-making, instructing, educating, learning.

It is through processing, management and usage that information is converted into the even more valuable knowledge. As with all the concepts in the pyramid, knowledge is similarly a diversely understood concept. For some, knowledge is the ‘know-how that transforms information into instructions’ (Weinberger 2011: 3). For example, semantic information can be linked into recipes (first do this, then do that...) or a conditional form of inferential procedures (if such and such is the case do this, otherwise do this) (Floridi 2010). In this framing, information is structured data and knowledge is actionable information (Weinberger 2011). In other words, ‘knowledge is like the recipe that turns information into bread, while data are like the atoms that make up the flour and the yeast’ (Zelany 1987, cited in Weinberger 2011). For others, knowledge is more than a set of instructions; it can be a practical skill, a way of knowing how to undertake or achieve a task, or a system of thought that coherently links together information to reveal a wider picture about a phenomenon. Creating knowledge involves applying complex cognitive processes such as perception, synthesis, extraction, association, reasoning and communication to information. Knowledge has more value than information because it provides the basis for understanding, explaining and drawing insights about the world, which can be used to formulate policy and actions. Wisdom, the pinnacle of the knowledge pyramid, is being able to sagely apply knowledge. While not all forms of knowledge are firmly rooted in data – for example, conjecture, opinions, beliefs – data are clearly a key base material for how we make sense of the world. Data provide the basic inputs into processes such as collating, sorting, categorising, matching, profiling, and modelling that seek to create information and knowledge in order to understand, predict, regulate and control phenomena. And generating data over time and in different locales enables us to track, evaluate and compare phenomena across time, space and scale. Thus, although information and knowledge are rightly viewed as being higher order and more valuable concepts, data are nonetheless a key ingredient with significant latent value that is realised when converted to information and knowledge. Whoever then has access to high-quality and extensive data has a competitive advantage over those excluded in being able to generate understanding and wisdom. A key rationale for the open data movement, examined in Chapter 3, is gaining access to the latent value of administrative and public