Asqatasun v4 - Heritrix configuration
The Crawler Component of Asqatasun is based on Heritrix. Considering the specific needs of Asqatasun in terms of contents filtering and performance considerations, the default spring configuration file provided by heritrix to define crawl properties has to be adapted. The changes are described below, chain by chain.
Candidate chain configuration
The Candidate chain is composed of 2 components : the scoper and the preparer
The Scoper
The scoper's work is to define, through a decision rule sequence, the set of possible URIs that can be captured (for more informations about the rules, please refer to https://webarchive.jira.com/wiki/display/Heritrix/Configuring+Crawl+Scope+Using+DecideRules).
Page Crawl
The Decision Rule Sequence adapted to asqatasun's page crawl needs is defined as follows :
RejectDecideRule
with default settingsTooManyHopsDecideRule
with "maxHops" property set to 0 (to limit the crawl of the specified url)TransclusionDecideRule
with "maxTransHops" property set to 1 (to deal with redirected pages) and "maxSpeculativeHops" property set to 0.NotOnDomainsDecideRule
with the "decision" property set to REJECTPathologicalPathDecideRule
with default value ("maxRepetition" property set to 2)TooManyPathSegmentsDecideRule
with default value ("maxPathDepth" property set to 15)MatchesListRegexDecideRule
with the "decision" property set to REJECT, the "listLogicalOr" property set to true and the "regexList" property defined as follow :
<value>.*(?i)(\.(avi|wmv|mpe?g))$</value>
<value>.*(?i)(\.(rar|zip|tar))$</value>
<value>.*(?i)(\.(doc|xls|odd))$</value>
<value>.*(?i)(\.(xml))$</value>
<value>.*(?i)(\.(txt|conf|pdf))$</value>
<value>.*(?i)(\.(swf))$</value>
<value>.*(?i)(\.(js))$</value>
<value>.*(?i)(\.(bmp|gif|jpe?g|png|svg|tiff?))$</value>
MatchesListRegexDecideRule
with the "decision" property set to ACCEPT, the "listLogicalOr" property set to true and the "regexList" property defined as follow to include all CSS contents despite their domain:
<value>.*(?i)(\.css(\?.*)?)$</value>
PrerequisiteAcceptDecideRule
with default settings
Site Crawl
The Decision Rule Sequence adapted to asqatasun's page crawl needs is defined as follows
RejectDecideRule
with default settingsSurtPrefixedDecideRule
with "seedsAsSurtPrefixes" property set to trueTooManyHopsDecideRule
with "maxHops" property set to 300 (arbitrary value that correspond to infinite)TransclusionDecideRule
with "maxTransHops" property set to 1 (to deal with redirected pages) and "maxSpeculativeHops" property set to 0.NotOnDomainsDecideRule
with the "decision" property set to REJECTPathologicalPathDecideRule
with default value ("maxRepetition" property set to 2)TooManyPathSegmentsDecideRule
with default value ("maxPathDepth" property set to 15)MatchesListRegexDecideRule
with the "decision" property set to REJECT, the "listLogicalOr" property set to true and the "regexList" property defined as follow :
<value>.*(?i)(\.(avi|wmv|mpe?g))$</value>
<value>.*(?i)(\.(rar|zip|tar))$</value>
<value>.*(?i)(\.(doc|xls|odd))$</value>
<value>.*(?i)(\.(xml))$</value>
<value>.*(?i)(\.(txt|conf|pdf))$</value>
<value>.*(?i)(\.(swf))$</value>
<value>.*(?i)(\.(js))$</value>
<value>.*(?i)(\.(bmp|gif|jpe?g|png|svg|tiff?))$</value>
MatchesListRegexDecideRule
with the "decision" property set to ACCEPT, the "listLogicalOr" property set to true and the "regexList" property defined as follow to include all css contents despite their domain:
<value>.*(?i)(\.css(\?.*)?)$</value>
The preparer
This component is used with default settings.
Fetch chain configuration
The fetch chain is composed of 5 components (FetchDNS, FetchHttp, ExtractorHttp, ExtractorHtml, ExtractorCss) in case of page crawl and 6 components (PreconditionEnforcer, FetchDNS, FetchHttp, ExtractorHttp, ExtractorHtml, ExtractorCss) in case of site crawl.
The preselector, extractorJs and extractorSwf components have been removed because they are not adapted to asqatasun's crawl needs. (for more informations about the processors, please refer to https://webarchive.jira.com/wiki/display/Heritrix/Processor+Settings
PreconditionEnforcer (in case of Site Crawl)
In this component, one property has been overriden :
- The "calculateRobotsOnly" is set to true to exclude Uri's that are excluded by the robots.txt file.
FetchDNS
This component's work is to realize a dns lookup. One property has been overidden:
- The "digestContent" property set to false (to avoid useless treatments)
FetchHttp
This component's work is to fetch the content. Three properties have been overriden :
- The "TimeoutSeconds" property set to 10.
- The "defaultEncoding' property set to UTF-8
- The "SoTimeoutMs" property set to 10000
- The "digestContent" property set to false (to avoid useless treatments)
ExtractorHttp
This component is used with default settings.
ExtractorHTML
The heritrix HTML extractor has been extended to enable to register a listener. Otherwise, this extractor behaves exactly the same way.
Two properties have been been overriden :
- The
extractJavascript
property is set to false - The
treatFramesAsEmbedLinks
property is set to false to make sure frames are filtered in case of page crawl.
ExtractorCSS
The heritrix CSS extractor has been extended to enable to register a listener. Otherwise, this extractor behaves exactly the same way.
This component is used with default settings.
Disposation chain configuration
The disposation chain is composed of 3 components : The writerProcessor, the candidateProcessor and the disposationProcessor
The WriterProcessor
The original "WarcWriterProcessor" has been replaced by the "TanaguruWriterProcessor". This module is specific and corresponds to asqatasun's crawl needs. It converts the results of successful fetches (raw data) to tanaguru-like Web-resources and Contents.
Two properties are needed to define this processor
- The
htmlRegexp
property defined as follows :
<value>.*(?i)(/|\.htm|\.html|\.php|\.asp|\.aspx|\.jsp|\.do)$</value>
- The
cssRegexp
property defined as follows
<value>.*(?i)(\.css(\?.*)?)$</value>
The CandidateProcessor
This component is used with default settings.
The DisposationProcessor
This component's work is to mark-up, late in the processing, the crawl with values and updating informations.
Three properties have been overriden :
- The
minDelayMs
property is set to 0 to avoid wasting time before recontacting same server (by default, this property is set to 3000 ms) - The
maxDelayMs
property is set to 1000(ms) (by default, this property is set to 30000) - The
respectCrawlDelayUpToSeconds
property set to 5
Crawl Controller configuration
This controller manages allt the crawl context ; it collects all the processors which cooperate to perform a crawl and provides a high-level interface to the running crawl.
Two properties have been overriden for this component:
- The
PauseAtStart
property set to false - The
PauseAtFinish
property set to false - The
maxToeThread
property set to 10 in case of site crawl and set to 3 in case of page crawl.
The Frontier configuration
This component is used to manage the known hosts (queues) and pending URIs.
Two properties have been overriden for this component:
- The
snoozeLongMs
property set to 0 - The
retryDelaySeconds
property set to 1 - The
maxRetries
set to 3
Crawl Limiter configuration (in case of Site Crawl)
This module enables to stop the crawl when some configured limits are reached.
Three properties can be set :
maxBytesDownload
that limits the number of downloaded bytesmaxDocumentsDownload
that limits the number of fetched resources. This property has been set to 10000 in the version1.0.0 of Asqatasun.maxTimeSeconds
that limits the duration of a crawl.