Asqatasun v4 - Summary

Asqatasun v4 - Heritrix configuration

The Crawler Component of Asqatasun is based on Heritrix. Considering the specific needs of Asqatasun in terms of contents filtering and performance considerations, the default spring configuration file provided by heritrix to define crawl properties has to be adapted. The changes are described below, chain by chain.

Candidate chain configuration

The Candidate chain is composed of 2 components : the scoper and the preparer

The Scoper

The scoper's work is to define, through a decision rule sequence, the set of possible URIs that can be captured (for more informations about the rules, please refer to https://webarchive.jira.com/wiki/display/Heritrix/Configuring+Crawl+Scope+Using+DecideRules).

Page Crawl

The Decision Rule Sequence adapted to asqatasun's page crawl needs is defined as follows :

<value>.*(?i)(\.(avi|wmv|mpe?g))$</value>
<value>.*(?i)(\.(rar|zip|tar))$</value>
<value>.*(?i)(\.(doc|xls|odd))$</value>
<value>.*(?i)(\.(xml))$</value>
<value>.*(?i)(\.(txt|conf|pdf))$</value>
<value>.*(?i)(\.(swf))$</value>
<value>.*(?i)(\.(js))$</value>
<value>.*(?i)(\.(bmp|gif|jpe?g|png|svg|tiff?))$</value>
<value>.*(?i)(\.css(\?.*)?)$</value>

Site Crawl

The Decision Rule Sequence adapted to asqatasun's page crawl needs is defined as follows

<value>.*(?i)(\.(avi|wmv|mpe?g))$</value>
<value>.*(?i)(\.(rar|zip|tar))$</value>
<value>.*(?i)(\.(doc|xls|odd))$</value>
<value>.*(?i)(\.(xml))$</value>
<value>.*(?i)(\.(txt|conf|pdf))$</value>
<value>.*(?i)(\.(swf))$</value>
<value>.*(?i)(\.(js))$</value>
<value>.*(?i)(\.(bmp|gif|jpe?g|png|svg|tiff?))$</value>
<value>.*(?i)(\.css(\?.*)?)$</value>

The preparer

This component is used with default settings.

Fetch chain configuration

The fetch chain is composed of 5 components (FetchDNS, FetchHttp, ExtractorHttp, ExtractorHtml, ExtractorCss) in case of page crawl and 6 components (PreconditionEnforcer, FetchDNS, FetchHttp, ExtractorHttp, ExtractorHtml, ExtractorCss) in case of site crawl.

The preselector, extractorJs and extractorSwf components have been removed because they are not adapted to asqatasun's crawl needs. (for more informations about the processors, please refer to https://webarchive.jira.com/wiki/display/Heritrix/Processor+Settings

PreconditionEnforcer (in case of Site Crawl)

In this component, one property has been overriden :

FetchDNS

This component's work is to realize a dns lookup. One property has been overidden:

FetchHttp

This component's work is to fetch the content. Three properties have been overriden :

ExtractorHttp

This component is used with default settings.

ExtractorHTML

The heritrix HTML extractor has been extended to enable to register a listener. Otherwise, this extractor behaves exactly the same way.

Two properties have been been overriden :

ExtractorCSS

The heritrix CSS extractor has been extended to enable to register a listener. Otherwise, this extractor behaves exactly the same way.

This component is used with default settings.

Disposation chain configuration

The disposation chain is composed of 3 components : The writerProcessor, the candidateProcessor and the disposationProcessor

The WriterProcessor

The original "WarcWriterProcessor" has been replaced by the "TanaguruWriterProcessor". This module is specific and corresponds to asqatasun's crawl needs. It converts the results of successful fetches (raw data) to tanaguru-like Web-resources and Contents.

Two properties are needed to define this processor

<value>.*(?i)(/|\.htm|\.html|\.php|\.asp|\.aspx|\.jsp|\.do)$</value>
<value>.*(?i)(\.css(\?.*)?)$</value>

The CandidateProcessor

This component is used with default settings.

The DisposationProcessor

This component's work is to mark-up, late in the processing, the crawl with values and updating informations.

Three properties have been overriden :

Crawl Controller configuration

This controller manages allt the crawl context ; it collects all the processors which cooperate to perform a crawl and provides a high-level interface to the running crawl.

Two properties have been overriden for this component:

The Frontier configuration

This component is used to manage the known hosts (queues) and pending URIs.

Two properties have been overriden for this component:

Crawl Limiter configuration (in case of Site Crawl)

This module enables to stop the crawl when some configured limits are reached.

Three properties can be set :