In recent posts, we have introduced a data acquisition workflow during which data from multiple sources are collected and used to uniquely identify resources of interest and fragment their descriptions into normalized action-fact data fragments. Once this is done, data integration and curation can begin.
In our approach, both activities (i.e., data integration and data curation) consist of data acquisition actions. Actually, data acquisitions (i.e., the creation of data via the addition of new action-fact fragments) are the only actions allowed. Existing data fragments are never updated or even deleted. Unless there is a legal obligation to delete information like, for example, in the framework of data privacy laws.
Data integration occurs at the action-fact fragment level. It is required when two or more fragments are inconsistent. The table below shows an example of such inconsistencies. In this example, two different actions (action #1 and #2) have lead to the acquisition from two different sources (source #13 and #6) with two incompatible values (32100 and 30100) for the population of area 94114 on July 1, 2012.
Once such an inconsistency is detected, there are several ways to resolve it:
- Do nothing (and let each application handle the inconsistency);
- Select one of the possible values, for example by deciding that data source #13 is more reliable than data source #6 or that the data acquired most recently is more likely to be correct;
- Derive a new values from existing values, for example by deciding that the real value of the population is the average of the values provided by source #13 and #6;
- Acquire the correct value from a third source.
Unless one decides to do nothing, the data integration results in a new acquisition action.
Like data integration, data curation occurs at the action-fact fragment level and consists of creating, removing, or updating facts about resources by adding new fragments.
In the example above, a new fact was added on Jan. 24, 2015: “the population of area 94113 was 15,000 on Jul. 1, 2012.” It was then modified on Jan. 25, 2015 to: “the population of area 94113 was 17,000 on Jul. 1, 2012.” Finally, this fact was removed on Jan. 26, 2015.
Although data fragments are a powerful way to store, integrate and manage data, it is generally more efficient for applications to consume data when it is presented in a more optimized way such as, for example, the facetted descriptions we have introduced in a previous post. These descriptions are obtained by building views.
Views are read-only construction built from data fragments. They are expendable ways of presenting data optimized for the specific applications they are built for.
In addition to application-specific views, this flexible approach allows one to easily build special views such as:
- State: All the facts at a given timestamp;
- Fact trending: The evolution of a given fact over time; or
- Action visualization: All the facts acquired by a given action.
Note that, because views have access to all the fragments, an application can be set to use a view built from already integrated/curated data or use a view built from pre-integrated/curated data. The latter leaves the responsibility of the integration/curation to the application.