This post concludes our series on data integration by reviewing the architectural components necessary to support the data acquisition and integration approach presented earlier.
Data Acquisition Action Registry
Each data fragment in the fragment store was added as the result of a data acquisition action (see part 1). In order to keep track of these actions, each application that brings new fragments into the store (e.g., web scrapers, data extractors, data curation tools) has to register every data acquisition action in a Data Acquisition Action Registry. The registry ensures that each registered action is uniquely identified by providing the action identifiers required by the Action-Fact Fragment data model.
Data Source Registry
The Data Source Registry is a catalog service that provides up-to-date information about the data sources being integrated. This information is machine-readable and describes, for each data source:
- Its location;
- The application program interfaces (APIs) it supports;
- The data available: The type of resources described, the list of record fields provided for each resource and how each of these fields can be mapped to a fact property name and value.
Data acquisition systems (e.g., data scrapers, data extractors) use the Data Source Registry to discover, connect to and query data sources in an automatic way.
All the resources described in the fragment store are uniquely identified. The Identifier Service is responsible for identifying (i.e., providing a unique identifier to) the resources described in incoming data records. The identification of resources is specific to the resource type and can be more or less complex depending on the existence of a “natural” identifier for the type of resource considered and the presence of this identifier in the data collected. For example, it is easy to identify geographic areas with zip codes, especially when these codes are part of the incoming records. For resource types for which no natural identifier exists, the Identifier Service has to maintain its own identifier scheme. In that case, it is necessary to compare each new incoming record to existing records in order to decide if it corresponds to a new resource that requires a new identifier or to an existing resource for which an existing identifier can be reused.
As seen previously, facts are basic properties of resources characterized by a property name and a property value.
In order to avoid ambiguities and deal more efficiently with multilingualism aspects, property names and as many property values as possible come from controlled vocabularies managed in an vocabulary bank.
We explained the rationale for using controlled vocabularies and managing them in a vocabulary bank in a previous post.
Data Acquisition Architecture
The diagram below shows an architecture that supports the data acquisition workflow introduced in part 2.
Data is pulled into the workflow using a Data Extractor and/or a Web Scraper. Both of them rely on the Data Source Registry to find and query their targets and on the Data Acquisition Registry to register their data acquisition actions and obtain action ids. Once this is done, the systems push the newly collected data into a Data Receiver that queues it until it is consumed by a Data Ingestor. The Data Ingestor caches the data, splits it into records, and calls an Identifier Service to identify the resources described in the records. Once the records have been identified, the Data Fragmentor looks up:
- The Data Source Registry to know how to map record fields into fact properties and
- The Vocabulary Bank to normalize the fact values.
Finally, the Data Fragmentor breaks the records into action-fact fragments and stores them in a Fragment Store.
Note that this architecture can easily be extended to support push scenarios with data sources publishing their data directly to the Data Receiver.
Data Integration and Curation Architecture
The diagram below shows how applications can use the data contained in the Fragment Store.
Except for the Data Integrator, most applications do not use data fragments directly. Rather get access to data via optimized views produced by a View Builder. As the data contained in the Fragment Store is normalized (i.e., only contains term identifiers), these applications might also have to look up these term identifiers in the Vocabulary Bank before the data can be presented to end-users.
In addition to reading data from the Fragment Store via the View Builder, Applications, such as the Data Curation Application of the diagram, that also writes data to the Fragment Store, have to register their data acquisition actions in the Data Acquisition Action Registry before adding data as action-fact fragments directly to the Fragment Store.
As data integration occurs at the fragment level, the Data Integrator, which only works at the fragment level (i.e., directly reading and writing fragments), only needs to register its data acquisition actions in the Registry without having to access the Vocabulary Bank and the View Builder.