Workshop: Introduction to web crawling with StormCrawler

JULIEN NIOCHE

CameraForensics

Introduction to web crawling with StormCrawler (and Elasticsearch)

In this workshop, we will explore StormCrawler a collection of resources for building low-latency, large scale web crawlers on Apache Storm. After a short introduction to Apache Storm and an overview of what Storm-Crawler provides, we’ll put it to use for a simple crawl before moving on to the deployed mode of Storm.

In the second part of the session, we will introduce metrics and index documents with Elasticsearch and Kibana and dive into data extraction. Finally, we’ll cover recursive crawls and scalability. This course will be hands-on: attendees will run the code on their own machines.

Agenda

We will cover the following topics:

Introduction to web crawling
Apache Storm: architecture and concepts
StormCrawler: basic building blocks
How to use the archetype
Building & configuring
URLFilters, ParseFilters
Simple recursive crawls
How to debug?
Distributed mode: UI, logs, metrics
Elasticsearch resources
WARC module
Q&As

Audience

This course will suit Java developers with an interest in big data, stream processing, web crawling and archiving. It will provide a practical introduction to both Apache Storm and Elasticsearch as well of course as StormCrawler and will not require advanced programming skills.

Prerequisites

Attendees should bring their own laptop with Apache Maven and Java 8 or above installed. The examples and instructions will be conducted on a Linux distribution and using Eclipe IDE. Ideally, students should look at the Apache Storm and StormCrawler documentation and think about particular websites or crawl scenarios that they might be interested in.

Workshop: Introduction to web crawling with StormCrawler