Data Integration, Part 2: Data Acquisition Workflow

In this post , we will look at the workflow that leads from the acquisition of raw data about resources to its storage as action-fact data fragments introduced in part 1.

DataAcquisitionWorkflow

As depicted on the diagram above, this workflow consists of 4 main steps:

  1. Data acquisition and caching,
  2. Identification,
  3. Normalization, and
  4. Fragmentation and storage.

Data Acquisition

The data acquisition step consists of acquiring data from a data source. This can involve calling an API, scrapping a website, querying a database, OCR-izing digitized documents, or any other data acquisition action considered as a relevant way to bring new data in the workflow. Once this raw data is acquired, it is immediately time-stamped and stored, either as-is or after transformation into JSON records, into a cache repository. This caching mechanism makes it possible to easily adapt and replay the workflow at anytime by avoiding the hassle of the data acquisition step (and without having to worry about possible changes of states that may have occurred at the source).

Identification

The identification step consists of identifying the resources described in the incoming data records. Depending on the resource type, the data available, and the identifier, the complexity of this task can vary from rather simple to highly complex and demanding. For example, it is relatively easy to identify geographic areas by their zip codes, even if it might involve looking up incomplete addresses or GPS coordinates. Identifying resources described in bibliographical records, on the other hand, from different sources can prove tedious as it potentially requires assigning and maintaining an identifier for each resource and involves comparing each new incoming record to the ones already acquired.

Normalization

As explained in a previous post, controlled vocabularies are a valuable alternative to free texts because their meaning is explicit, which makes it possible to avoid the inevitable ambiguities inherent in natural language.  The normalization step consists of replacing properties and values found in incoming records by the corresponding identifiers from data dictionaries and controlled vocabularies. For example, replacing country names (e.g., Belgium) by ISO3166-1 codes (e.g., “be”).

Fragmentation and Storage

Finally, the fragmentation step consists of breaking each normalized record into actions and facts and storing the resulting action-fact fragment. A new fragment is created for each field in the record. This requires the identification of the facts corresponding to the record fields. It also requires that each fragment be completed with information about the data acquisition action (i.e., action id, data source, and tool id).

When the fragments are ready, they are stored in a fragment store where they can be further integrated, curated, and used as we’ll see in a next post.

Advertisements
This entry was posted in Data Integration and tagged , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s