When we initially introduced our data fragment model, for reasons of simplicity, some details were not elaborated because they were not essential to the understanding of the principles the data integration approach is based upon. However, these details are necessary to make the approach work in practice.
Data fragments correspond to facts about resources that are collected during data acquisition actions. Each data fragment is used to store both the data itself (i.e., an atomic piece of information about a resource) and metadata about it (e.g., where, when, by whom, using what tool this data was collected, but also what it represents and how it is represented). This metadata simplifies data integration and improves data quality by making it possible to centralize and automate the detection of inconsistencies and missing information.
This post takes a closer look at the data types of the different data elements that compose data fragments.
The “Data Acquisition Action” Part
Data acquisition actions occur when tools are used to acquire data from data sources, a process that is supported by a Data Acquisition Architecture. In this architecture, each data acquisition tool has a unique identifier, data sources are described in a Data Source Registry, and data acquisition tools register each of their data acquisition actions in a Data Acquisition Action Registry.
In the fragment data model, data acquisition actions are characterized by:
- An action identifier (e.g., action #1): This is a unique identifier generated by the Data Acquisition Action Registry when a tool registers an action. These identifiers are sortable so that it is possible to order them from the first action to the last one. The current version of the Acquisition Action Registry implements action identifiers as long integers;
- A timestamp (e.g., 2015-04-02T08:53:02Z): This should correspond to the moment at which an action is successfully completed. In practice, this is the time at which the action is registered. Timestamps are expressed according to the ISO-8601 standard;
- A tool (e.g., Web Scrapper #24): This is the unique identifier of the tool in the architecture. It is a character string; and
- A data source (e.g., a given data set from http://census.gov): This is the unique identifier of the source in the Data Source Registry. It is expressed as a character string.
As shown above, the types of the 4 data acquisition action fields are well defined and their domains (i.e., the authorized values for these fields) strictly consists of lists managed in other systems (e.g., Acquisition Action Registry, Data Source Registry).
The “Resource” Part
Resources (i.e., the ‘things’ of interest that are described by the facts collected during the data acquisition actions) are characterized by:
- A resource type (e.g., product, customer, geographical area): Supported resource types are defined in a dedicated resource type controlled vocabulary managed in a vocabulary bank. In practice, this field contains a term identifier from the resource type vocabulary. It is expressed as a character string.
- A unique identifier: Resource unique identifiers are generated and managed by the Identifier Service taking into account the type of the resource to identify. In the current implementation, it consists of a long integer.
Here also the types of the 2 resource fields are well defined and their domains strictly consists of lists managed in two other systems (i.e., the Vocabulary Bank and the Identifier Service).
The “Fact” Part
When initially introduced, facts (about resources) were described with 3 fields: A fact property, a fact value, and a fact timestamp. For example, a building (the resource) had a value (the fact property) of 1,000,000 (the fact value) on Jan. 1, 2015 (the fact time stamp). In this simplified representation, information such as the type of the value (a money amount) or its unit (US$) was omitted. It was considered as implicit and depended on the property itself.
This is problematic for several reasons:
- When collecting information from several sources, it is the responsibility of each data acquisition tool to ensure that the correct type and unit are used This is complex and ambiguous. For example, if I collect an amount in euros, what change rate does the tool use to do the conversion, how does it get it, and more importantly, what will this amount mean in 6 months?.
- Some values are free texts for which it is valuable to mention the language. Others are controlled vocabulary terms for which it is important to identify the source vocabulary.
This is why, in the full version of the fragment data model, facts are characterized by:
- A fact property: Supported fact property names are defined in a dedicated fact property name controlled vocabulary managed in a vocabulary bank. In practice, this field contains a term identifier from the fact property name vocabulary. It is expressed as a character string.
- A fact type: Supported fact types are defined in a dedicated fact type controlled vocabulary managed in a vocabulary bank. In practice, this field contains a term identifier from the fact type vocabulary. It is expressed as a character string. Examples of fact types include: vocabulary term, language string, money amount, area, etc.
- A fact value context: Fact value contexts provide additional information about fact values and depend on the type of the fact. For example, the context of a vocabulary term will be the vocabulary the term comes from, the context of a language string will be the language of the string, the context of a money amount will be the currency of the amount, the context of an area will be the unit in which its surface is expressed, etc. For most fact types, supported fact value contexts are defined in a controlled vocabulary specific to this type. In practice, this field contains a term identifier from the fact value context vocabulary specific to the fact type considered. It is expressed as a character string.
- A fact value: This is the actual value of the fact. As was just described, its type and domain depends on the fact type field.
- A fact timestamp: This corresponds to the time at which the fact value applied (e.g., when was the value of a building appraised to $1,000,000 when the person was 35 years of age). Fact timestamps are expressed according to the ISO-8601 standard.
The table below provides a commented example of an action-fact data fragment. It contains 11 rows, one for each fragment field. The first column provides the field name. The second column indicates the type/domain of the field. The third column provides the actual value of the field for the fragment example considered. Finally, the fourth column provides additional comments or explanations.
When looking at the 11 fields of a data fragment, fields “resource identifier”, “fact property”, and “fact value” are the actual data. They correspond to the information found in the subject-predicate-object triples used by the semantic web to make statements about resources (e.g., area 94114 has buildings for a total value of $8,508,810,400). The 8 other fields are metadata. They make explicit what the described resource is, how information (i.e., facts) about this resource was collected, what these facts are and what they represent. The existence of this metadata simplifies data integration by making it possible to automatically detect inconsistencies (e.g., values expressed using incompatible units) and missing information (e.g., a string in an unknown language). Explicit metadata will make possible simpler and more efficient code and improved data quality.