Diverse Daten (BD2015): Unterschied zwischen den Versionen

Version vom 7. Januar 2016, 13:40 Uhr

Rob Kitchin: The Data Revolution Big Data, Open Data, Data Infrastructures and Their Consequences [1]

Extrakte

Conceptualising Data

Data are commonly understood to be the raw material produced by abstracting the world into categories, measures and other representational forms – numbers, characters, symbols, images, sounds, electromagnetic waves, bits – that constitute the building blocks from which information and knowledge are created. Data are usually representative in nature (e.g., measurements of a phenomena, such as a person’s age, height, weight, colour, blood pressure, opinion, habits, location, etc.), but can also be implied (e.g., through an absence rather than presence) or derived (e.g., data that are produced from other data, such as percentage change over time calculated by comparing data from two time periods), and can be either recorded and stored in analogue form or encoded in digital form as bits (binary digits).

....

Data then are a key resource in the modern world. Yet, given their utility and value, and the amount effort and resources devoted to producing and analysing them, it is remarkable how little conceptual attention has been paid to data in and of themselves. In contrast, there are thousands of articles and books devoted to the philosophy of information and knowledge. Just as we tend to focus on buildings and neighbourhoods when considering cities, rather than the bricks and mortar used to build them, so it is the case with data. Moreover, just as we think of bricks and mortar as simple building blocks rather than elements that are made within factories by companies bound within logistical, financial legal and market concerns, and are distributed, stored and traded, so we largely do with data.

What are data?

Etymologically the word data is derived from the Latin dare, meaning ‘to give’. In this sense, data are raw elements that can be abstracted from (given by) phenomena – measured and recorded in various ways. However, in general use, data refer to those elements that are taken; extracted through observations, computations, experiments, and record keeping (Borgman 2007). Technically, then, what we understand as data are actually capta (derived from the Latin capere, meaning ‘to take’); those units of data that have been selected and harvested from the sum of all potential data (Kitchinand Dodge 2011). As Jensen (1950: ix, cited in Becker 1952: 278) states:

it is an unfortunate accident of history that the term datum... rather than captum... should have come to symbolize the unit-phenomenon in science. For science deals, not with ‘that which has been given’ by nature to the scientist, but with ‘that which has been taken’ or selected from natureby the scientist in accordance with his purpose.

Strictly speaking, then, this book should be entitled The Capta Revolution. However, since the term data has become so thoroughly ingrained in the language of the academy and business to mean capta, rather than confuse the matter further it makes sense to continue to use the term data where capta would be more appropriate. Beyond highlighting the etymological roots of the term, what this brief discussion starts to highlight is that data harvested through measurement are always a selection from the total sum of all possible data available – what we have chosen to take from all that could potentially be given. As such, data are inherently partial, selective and representative, and the distinguishing criteria used in their capture has consequence. Other scholars have noted that what has been understood as data has changed over time with the development of science. Rosenberg (2013) details that the term ‘data’ was first used in the English language in the seventeenth century. As a concept then it is very much tied to that of modernity and the growth and evolution of science and new modes of producing, presenting and debating knowledge in the seventeenth and eighteenth century that shifted information and argument away from theology, exhortation and sentiment to facts, evidence and the testing of theory through experiment (Poovey 1998; Garvey 2013; Rosenberg 2013). Over time, data came to be understood as being pre-analytical and pre-factual, different in nature to facts, evidence, information and knowledge, but a key element in the constitution of these elements (though often the terms and definitions of data, facts, evidence, information and knowledge are conflated). As Rosenberg (2013: 18) notes, facts are ontological, evidence is epistemological, data is rhetorical. A datum may also be a fact, just as a fact may be evidence... [T]he existence of a datum has been independent of any consideration of corresponding ontological truth. When a fact is proven false, it ceases to be a fact. False data is data nonetheless. In rhetorical terms, data are that which exists prior to argument or interpretation that converts them to

they are abstract, discrete, aggregative (they can be added together) (Rosenberg 2013), and are meaningful independent of format, medium, language, producer and context (i.e., data hold their meaning whether stored as analogue or digital, viewed on paper or screen or expressed in any language, and ‘adhere to certain non-varying patterns, such as the number of tree rings always being equal to the age of the tree’) (Floridi 2010). Floridi (2008) contends that the support-independencedata is reliant on three types of neutrality: taxonomic (data are relational entities defined with respect to other specific data); typological (data can take a number of different non-mutually exclusive forms, e.g., primary, secondary, metadata, operational, derived); and genetic (data can have a semantics independent of their comprehension; e.g., the Rosetta Stone hieroglyphics constitute data regardlessthe fact that when they were discovered nobody could interpret them). of of Not everyone who thinks about or works with data holds such a narrow rhetorical view. How data are understood has not just evolved over time, it varies with respect to perspective. For example, Floridi (2008) explains that from an epistemic position data are collections of facts, from an informational position data are information, from a computational position data are collections of binary elements that can be processed and transmitted electronically, and from a diaphoric position data are abstract elements that are distinct and intelligible from other data. In the first case, data provide the basis for further reasoning or constitute empirical evidence. In the second, data constitute representative information that can be stored, processed and analysed, but do not necessarily constitute facts. In the third, data constitute the inputs and outputs of computation but have to be processed to be turned into facts and information (for example, a DVD contains gigabytes of data but no facts or information per se) (Floridi 2005). In the fourth, data are meaningful because they capture and denote variability (e.g., patterns of dots, alphabet letters and numbers, wavelengths) that provides a signal that can be interpreted. As discussed below, other positions include understanding data as being socially constructed, as having materiality, as being ideologically loaded, as a commodity to be traded, as constituting a public good, and so on. The point is, data are never simply just data; how data are conceived and used varies between those who capture, analyse and draw conclusions from

Kinds of data Whether data are pre-factual and rhetorical in nature or not, it is clear that data are diverse in their characteristics, which shape in explicit terms how they are handled and what can be done with them. In broad terms, data vary by form (qualitative or quantitative), structure (structured, semi-structured or unstructured), source (captured, derived, exhaust, transient), producer (primary, secondary,

Captured, exhaust, transient and derived data There are two primary ways in which data can be generated. The first is that data can be captured directly through some form of measurement such as observation, surveys, lab and field experiments, record keeping (e.g., filling out forms or writing a diary), cameras, scanners and sensors. In these cases, data are usually the deliberate product of measurement; that is, the intention was to generate useful data. In contrast, exhaust data are inherently produced by a device or system, but are a by- product of the main function rather than the primary output (Manyika et al. 2011). For example, an electronic checkout till is designed to total the goods being purchased and to process payment, but it also produces data that can be used to monitor stock, worker performance and customer purchasing. Many software-enabled systems produce such exhaust data, much of which have become valuable sources of information. In other cases, exhaust data are transient in nature; that is, they are never examined or processed and are simply discarded, either because they are too voluminous or unstructured in nature, or costly to process and store, or there is a lack of techniques to derive value from them, or they are of little strategic or tactical use (Zikopoulos et al. 2012; Franks 2012). For example, Manyika et al. (2011: 3) report that ‘health care providers... discard 90 percent of the data that they generate (e.g., almost all real-time video feeds created during surgery)’. Captured and exhaust data are considered ‘raw’ in the sense that they have not been converted or combined with other data. In contrast, derived data are produced through additional processing or analysis of captured data. For example, captured data might be individual traffic counts through an intersection and derived data the total number of counts or counts per hour. The latter have been derived from the former. Captured data are often the input into a model, with derived data the output. For example, traffic count data might be an input into a transportation model with the output being predicted or simulated data (such as projected traffic counts at different times or under different conditions). In the case of a model, the traffic count data are likely to have been combined with other captured or derived data (such as type of vehicle, number of passengers, etc.) to create new derived data for input into the model. Derived data are generated for a number of reasons, including to reduce the volume of data to a manageable amount and to produce more useful and meaningful measures. Sometimes the original captured data might be processed to varying levels of derivation depending on its intended use. For example, the NASA Earth Observing System organises its data into six levels that run from unprocessed captured data, through increasing degrees of processing and analysis, to model outputs based on analyses of lower-level data (Borgman 2007; see Table 1.2).

Primary, secondary and tertiary data Primary data are generated by a researcher and their instruments within a research design of their making. Secondary data are data made available to others to reuse and analyse that are generated by someone else. So one person’s primary data can be another person’s secondary data. Tertiary data are a form of derived data, such as counts, categories, and statistical results. Tertiary data are often released by statistical agencies rather than secondary data to ensure confidentiality with respect to whom the data refer. For example, the primary data of the Irish census are precluded from being released as secondary data for 100 years after generation; instead the data are released as summary counts and categorical tertiary data. Many researchers and institutions seek to generate primary data because they are tailored to their specific needs and foci, whereas these design choices are not available to those analysing secondary or tertiary data. Moreover, those using secondary and tertiary data as inputs for their own studies have to trust that the original research is valid. In many cases researchers will combine primary data with secondary and tertiary data to produce more valuable derived data. For example, a retailer might seek to create a derived dataset that merges their primary sales data with tertiary geodemographics data (data about what kind of people live in different areas, which are derived from census and other public and commercial data) in order to determine which places to target with marketing material. Secondary and tertiary data are valuable because they enable replication studies and the building of larger, richer and more sophisticated datasets. They later produce what Crampton et al. (2012) term ‘data amplification’; that is, data when combined enables far greater insights by revealing associations, relationships and patterns which remain hidden if the data remain isolated. As a consequence, the secondary and tertiary data market is a multi-billion dollar industry (see Chapter 2).

Indexical and attribute data and metadata Data also vary in kind. Indexical data are those that enable identification and linking, and include unique identifiers, such as passport and social security numbers, credit card numbers, manufacturer serial numbers, digital object identifiers, IP and MAC addresses, order and shipping numbers, as well as names, addresses, and zip codes. Indexical data are important because they enable large amounts of non-indexical data to be bound together and tracked through shared identifiers, and enable discrimination, combination, disaggregation and re-aggregation, searching and other forms of processing and analysis. As discussed in Chapter 4, indexical data are becoming increasingly common and granular, escalating the relationality of datasets. Attribute data are data that represent aspects of a phenomenon, but are not indexical in nature. For example, with respect to a person the indexical data might be a fingerprint or DNA sequence, with associated attribute data being age, sex, height, weight, eye colour, blood group, and so on. The vast bulk of data that are generated and stored within systems are attribute data. Metadata are data about data. Metadata can either refer to the data content or the whole dataset. Metadata about the content includes the names and descriptions of specific fields (e.g., the column headers in a spreadsheet) and data definitions. These metadata help a user of a dataset to understand its composition and how it should be used and interpreted, and facilitates the conjoining of datasets, interoperability and discoverability, and to judge their provenance and lineage. Metadata that refers to a dataset as a whole has three different forms (NISO 2004). Descriptive metadata concerns identification and discovery and includes elements such as title, author, publisher, subject, and description. Structural metadata refers to the organisation and coverage of the dataset. Administrative metadata concerns when and how the dataset was created, details of the technical aspects of the data, such as file format, and who owns and can use the data. A common metadata standard for datasets that combines these three types of metadata is the Dublin Core (http://dublincore.org/). This standard requires datasets to have 15 accompanying metadata fields: title, creator, subject, description, publisher, contributor, date, type, format, identifier, source, language, relation, coverage, and rights. Metadata are essential components of all datasets, though they are often a neglected element of data curation, especially amongst researchers who are compiling primary data for their own use rather

Data, information, knowledge, wisdom What unites these various kinds of data is that they form the base or bedrock of a knowledge pyramid: data precedes information, which precedes knowledge, which precedes understanding and wisdom (Adler 1986; Weinberger 2011). Each layer of the pyramid is distinguished by a process of distillation (reducing, abstracting, processing, organising, analysing, interpreting, applying) that adds organisation, meaning and value by revealing relationships and truths about the world (see Figure 1.1). While the order of the concepts within the pyramid is generally uncontested, the nature and difference between concepts often varies between schools of thought. Information, for example, is a concept that is variously understood across scholars. For some, information is an accumulation of associated data, for others it is data plus meaning, or the signal in the noise of data, or a multifaceted construct, or tertiary data wherein primary data has been reworked into analytical form. To a physicist, data are simply zeros and ones, raw bits; they are noise. Information is when these zeros and ones are organised into distinct patterns; it is the signal (von Baeyer 2003). Airwaves and communication cables then are full of flowing information – radio and television signals, telephone conversations, internet packets – meaningful patterns of data within the wider spectrum of noise. For others, information is a broader concept. Floridi (2010: 74), for example, identifies three types of information: Factual: information as reality (e.g., patterns, fingerprints, tree rings) Instructional: information for reality (e.g., commands, algorithms, recipes) Semantic: information about reality (e.g., train timetables, maps, biographies). Figure 1.1 Knowledge pyramid (adapted from Adler 1986 and McCandless 2010) The first is essentially meaningful data, what are usually termed facts. These are data that are organised and framed within a system of measurement or an external referent that inherently provide

a basis to establish an initial meaning that holds some truth. Information also extends beyond data and facts through adding value that aids interpretation. Weinberger (2011: 2) thus declares: ‘Information is to data what wine is to the vineyard: the delicious extract and distillate.’ Such value could be gained through sorting, classifying, linking, or adding semantic content through some form of text or visualisation that informs about something and/or instructs what to do (for example, a warning light on a car’s dashboard indicating that the battery is flat and needs recharging, Floridi, 2010). Case (2002; summarised in Borgman 2007: 40) argues that differences in the definition of information hinge on five issues: uncertainty, or whether something has to reduce uncertainty to qualify as information; physicality, or whether something has to take on a physical form such as a book, an object, or the sound waves of speech to qualify as information; structure/process, or whether some set of order or relationships is required; intentionality, or whether someone must intend that something be communicated to qualify as information; and truth, or whether something must be true to qualify as information. Regardless of how it is conceived, Floridi (2010) notes that given that information adds meaning to data, it gains currency as a commodity. It is, however, a particular kind of commodity, possessing three main properties (which data also share): Non-rivalrous: more than one entity can possess the same information (unlike material goods) Non-excludable: it is easily shared and it takes effort to seek to limit such sharing (such as enforcing intellectual property rights agreements or inserting pay walls) Zero marginal cost: once information is available, the cost of reproduction is often negligible. While holding the properties of being non-rivalrous and non-excludable, because information is valuable many entities seek to limit and control its circulation, thus increasing its value. Much of this value is added through the processes enacted in the information life cycle (Floridi 2010): Occurrence: discovering, designing, authoring Transmission: networking, distributing, accessing, retrieving, transmitting Processing and management: collecting, validating, modifying, organising, indexing, classifying, filtering, updating, sorting, storing Usage: monitoring, modelling, analysing, explaining, planning, forecasting, decision-making, instructing, educating, learning.

It is through processing, management and usage that information is converted into the even more valuable knowledge. As with all the concepts in the pyramid, knowledge is similarly a diversely understood concept. For some, knowledge is the ‘know-how that transforms information into instructions’ (Weinberger 2011: 3). For example, semantic information can be linked into recipes (first do this, then do that...) or a conditional form of inferential procedures (if such and such is the case do this, otherwise do this) (Floridi 2010). In this framing, information is structured data and knowledge is actionable information (Weinberger 2011). In other words, ‘knowledge is like the recipe that turns information into bread, while data are like the atoms that make up the flour and the yeast’ (Zelany 1987, cited in Weinberger 2011). For others, knowledge is more than a set of instructions; it can be a practical skill, a way of knowing how to undertake or achieve a task, or a system of thought that coherently links together information to reveal a wider picture about a phenomenon. Creating knowledge involves applying complex cognitive processes such as perception, synthesis, extraction, association, reasoning and communication to information. Knowledge has more value than information because it provides the basis for understanding, explaining and drawing insights about the world, which can be used to formulate policy and actions. Wisdom, the pinnacle of the knowledge pyramid, is being able to sagely apply knowledge. While not all forms of knowledge are firmly rooted in data – for example, conjecture, opinions, beliefs – data are clearly a key base material for how we make sense of the world. Data provide the basic inputs into processes such as collating, sorting, categorising, matching, profiling, and modelling that seek to create information and knowledge in order to understand, predict, regulate and control phenomena. And generating data over time and in different locales enables us to track, evaluate and compare phenomena across time, space and scale. Thus, although information and knowledge are rightly viewed as being higher order and more valuable concepts, data are nonetheless a key ingredient with significant latent value that is realised when converted to information and knowledge. Whoever then has access to high-quality and extensive data has a competitive advantage over those excluded in being able to generate understanding and wisdom. A key rationale for the open data movement, examined in Chapter 3, is gaining access to the latent value of administrative and public

@@ Zeile 1: / Zeile 1: @@
-==Rob Kitchin: The Data Revolution Big Data, Open Data, Data Infrastructures and Their Consequences==
+==Rob Kitchin: The Data Revolution Big Data, Open Data, Data Infrastructures and Their Consequences [https://uk.sagepub.com/en-gb/eur/the-data-revolution/book242780]==
-Conceptualising Data
+'''Extrakte'''
-Data are commonly understood to be the raw material produced by abstracting the world into
-categories, measures and other representational forms – numbers, characters, symbols, images,
+=== Conceptualising Data ===
-sounds, electromagnetic waves, bits – that constitute the building blocks from which information and
-knowledge are created. Data are usually representative in nature (e.g., measurements of a phenomena,
+Data are commonly understood to be the raw material produced by abstracting the world into categories, measures and other representational forms – numbers, characters, symbols, images, sounds, electromagnetic waves, bits – that constitute the building blocks from which information and knowledge are created. Data are usually representative in nature (e.g., measurements of a phenomena, such as a person’s age, height, weight, colour, blood pressure, opinion, habits, location, etc.), but can also be implied (e.g., through an absence rather than presence) or derived (e.g., data that are produced from other data, such as percentage change over time calculated by comparing data from two time periods), and can be either recorded and stored in analogue form or encoded in digital form as bits (binary digits).
-such as a person’s age, height, weight, colour, blood pressure, opinion, habits, location, etc.), but can
-also be implied (e.g., through an absence rather than presence) or derived (e.g., data that are
+....
-produced from other data, such as percentage change over time calculated by comparing data from
-two time periods), and can be either recorded and stored in analogue form or encoded in digital form
+Data then are a key resource in the modern world. Yet, given their utility and value, and the amount effort and resources devoted to producing and analysing them, it is remarkable how little conceptual attention has been paid to data in and of themselves. In contrast, there are thousands of articles and books devoted to the philosophy of information and knowledge. Just as we tend to focus on buildings
-as bits (binary digits). Good-quality data are discrete and intelligible (each datum is individual,
+and neighbourhoods when considering cities, rather than the bricks and mortar used to build them, so it is the case with data. Moreover, just as we think of bricks and mortar as simple building blocks rather than elements that are made within factories by companies bound within logistical, financial legal and market concerns, and are distributed, stored and traded, so we largely do with data.
-separate and separable, and clearly defined), aggregative (can be built into sets), have associated
-metadata (data about data), and can be linked to other datasets to provide insights not available from
+=== What are data? ===
-a single dataset (Rosenberg 2013). Data have strong utility and high value because they provide the
-key inputs to the various modes of analysis that individuals, institutions, businesses and science
+Etymologically the word data is derived from the Latin dare, meaning ‘to give’. In this sense, data are raw elements that can be abstracted from (given by) phenomena – measured and recorded in various ways. However, in general use, data refer to those elements that are taken; extracted through observations, computations, experiments, and record keeping (Borgman 2007). Technically, then, what we understand as data are actually capta (derived from the Latin capere, meaning ‘to take’); those units of data that have been selected and harvested from the sum of all potential data (Kitchinand Dodge 2011). As Jensen (1950: ix, cited in Becker 1952: 278) states:
-employ in order to understand and explain the world we live in, which in turn are used to create
-innovations, products, policies and knowledge that shape how people live their lives.
+::it is an unfortunate accident of history that the term datum... rather than captum... should have come to symbolize the unit-phenomenon in science. For science deals, not with ‘that which has been given’ by nature to the scientist, but with ‘that which has been taken’ or selected from natureby the scientist in accordance with his purpose.
-Data then are a key resource in the modern world. Yet, given their utility and value, and the amounteffort and resources devoted to producing and analysing them, it is remarkable how little conceptual
-attention has been paid to data in and of themselves. In contrast, there are thousands of articles and
-books devoted to the philosophy of information and knowledge. Just as we tend to focus on buildings
-and neighbourhoods when considering cities, rather than the bricks and mortar used to build them, so
-it is the case with data. Moreover, just as we think of bricks and mortar as simple building blocks
-rather than elements that are made within factories by companies bound within logistical, financial,
-legal and market concerns, and are distributed, stored and traded, so we largely do with data.
-Consequently, when data are the focus of enquiry it is usually to consider, in a largely technical sense,
-how they should be generated and analysed, or how they can be leveraged into insights and value,
-rather than to consider the nature of data from a more conceptual and philosophical perspective.
-of
-With this observation in mind, the principal aim of this book is threefold: to provide a detailed
-reflection on the nature of data and their wider assemblages; to chart how these assemblages are
-shifting and mutating with the development of new data infrastructures, open data and big data; and to
-think through the implications of these new data assemblages with respect to how we make sense of
-and act in the world. To supply an initial conceptual platform, in this chapter the forms, nature and
-philosophical bases of data are examined in detail. Far from being simple building blocks, the
-discussion will reveal that data are a lot more complex. While many analysts may accept data at face
-value, and treat them as if they are neutral, objective, and pre-analytic in nature, data are in fact
-framed technically, economically, ethically, temporally, spatially and philosophically. Data do not
-exist independently of the ideas, instruments, practices, contexts and knowledges used to generate,
-process and analyse them (Bowker 2005; Gitelman and Jackson 2013). Thus, the argument developed
-is that understanding data and the unfolding data revolution requires a more nuanced analysis than
-much of the open and big data literature presently demonstrates.
-What are data?
-Etymologically the word data is derived from the Latin dare, meaning ‘to give’. In this sense, data are
-raw elements that can be abstracted from (given by) phenomena – measured and recorded in various
-ways. However, in general use, data refer to those elements that are taken; extracted through
-observations, computations, experiments, and record keeping (Borgman 2007). Technically, then,
-what we understand as data are actually capta (derived from the Latin capere, meaning ‘to take’);
-those units of data that have been selected and harvested from the sum of all potential data (Kitchin
-and Dodge 2011). As Jensen (1950: ix, cited in Becker 1952: 278) states:
-it is an unfortunate accident of history that the term datum... rather than captum... should have
-come to symbolize the unit-phenomenon in science. For science deals, not with ‘that which has
-been given’ by nature to the scientist, but with ‘that which has been taken’ or selected from nature
-by the scientist in accordance with his purpose.
 Strictly speaking, then, this book should be entitled The Capta Revolution. However, since the term
 data has become so thoroughly ingrained in the language of the academy and business to mean capta,

Diverse Daten (BD2015): Unterschied zwischen den Versionen

Version vom 7. Januar 2016, 13:40 Uhr

Rob Kitchin: The Data Revolution Big Data, Open Data, Data Infrastructures and Their Consequences [1]

Conceptualising Data

What are data?

Navigationsmenü

Ansichten

Meine Werkzeuge

Navigation

Suche

Werkzeuge

Drucken/exportieren