We will hear and then discuss the following presentations:
Joe Nelson: Coordinating Web Scrapers
Joe Nelson will demo a computer cluster architecture for massive parallel web scraping built out of free, open-source components. Learn how to provision, coordinate, and monitor scrapers in real-time. After reviewing the pieces of the architecture we’ll see it in action scraping a real site. Finally we’ll see how the data is consolidated in S3 storage and touch on the next steps for data transformation.
David Massart: A Data Model, Workflow, and Architecture for Integrating Data
The presentation proposes an approach for integrating data from different data sources. It starts by introducing “actions” and “facts”, the two core concepts of the data model upon which the proposed approach is based. Then it looks at the workflow that leads from the acquisition of raw data from various sources to its storage and integration as action-fact data. Finally, it proposes an architecture for supporting this workflow.