** <tt>filters</tt> ''(opt.)'' filters with conditions to in- or exclude files and folders from import

** <tt>filters</tt> ''(opt.)'' filters with conditions to in- or exclude files and folders from import

*** <tt>maxFileSize</tt>: maximum file size, files that are bigger are filtered out

*** <tt>maxFileSize</tt>: maximum file size, files that are bigger are filtered out

−

*** <tt>maxFolderDepth</tt>: starting from the root folder, this is the maximum depth to crawl into subdirectories

+

*** <tt>maxFolderDepth</tt>: starting from the root folder, this is the maximum depth to crawl into subdirectories. ''(Hint: Folder structures in compounds are not taken into account here)''

*** <tt>followSymbolicLinks</tt>: whether to follow symbolic links to files/folders or not

*** <tt>followSymbolicLinks</tt>: whether to follow symbolic links to files/folders or not

*** <tt>filePatterns</tt>: regex patterns for filtering crawled files on the basis of their file name

*** <tt>filePatterns</tt>: regex patterns for filtering crawled files on the basis of their file name

**** <tt>include</tt>: if include patterns are specified, at least one of them must match the file name. If no include patterns are specified, this is handled as if all file names are included.

**** <tt>include</tt>: if include patterns are specified, at least one of them must match the file name. If no include patterns are specified, this is handled as if all file names are included.

**** <tt>exclude</tt>: if at least one exclude pattern matches the file name, the crawled file is filtered out

**** <tt>exclude</tt>: if at least one exclude pattern matches the file name, the crawled file is filtered out

−

*** <tt>folderPatterns</tt>: regex patterns for filtering crawled folders and files on the basis of their file path

+

**** '''(Hint: the patterns need to use forward slashes as directory seperators, even if your file system uses backslashes as folder delimiters)'''

+

*** <tt>folderPatterns</tt>: regex patterns for filtering crawled folders and files on the basis of their complete folder path. '''(Hint: Contrary to the file patterns a folder pattern must match the complete path, it doesn't work if it just matches the folder name!)'''

**** <tt>include</tt>: Only relevant for crawled files: If include patterns are specified, at least one of them must match the file path. If no include patterns are specified, this is handled as if all file paths are included.

**** <tt>include</tt>: Only relevant for crawled files: If include patterns are specified, at least one of them must match the file path. If no include patterns are specified, this is handled as if all file paths are included.

**** <tt>exclude</tt>: Only relevant for crawled folders: If at least one exclude pattern matches the folder name, the folder (and its subdirectories) will not be imported.

**** <tt>exclude</tt>: Only relevant for crawled folders: If at least one exclude pattern matches the folder name, the folder (and its subdirectories) will not be imported.

+

**** '''(Hint: the patterns need to use forward slashes as directory seperators, even if your file system uses backslashes as folder delimiters)'''

** <tt>mapping</tt> ''(req.)'' specifies how to map file properties to record attributes

** <tt>mapping</tt> ''(req.)'' specifies how to map file properties to record attributes

The File Crawler starts crawling in the <tt>rootFolder</tt>. It produces one record for each subdirectory in the bucket connected to <tt>directoriesToCrawl</tt> and one record per file in the bucket connected to <tt>filesToCrawl</tt>. The bucket in slot <tt>directoriesToCrawl</tt> should be connected to the input slot of the File Crawler so that the subdirectories are crawled in followup tasks. The resulting records do not yet contain the file content but only metadata attributes configured in the <tt>mapping</tt>.

The File Crawler starts crawling in the <tt>rootFolder</tt>. It produces one record for each subdirectory in the bucket connected to <tt>directoriesToCrawl</tt> and one record per file in the bucket connected to <tt>filesToCrawl</tt>. The bucket in slot <tt>directoriesToCrawl</tt> should be connected to the input slot of the File Crawler so that the subdirectories are crawled in followup tasks. The resulting records do not yet contain the file content but only metadata attributes configured in the <tt>mapping</tt>.

−

The directory and file records are collected in bulks, whose size can be configured via <tt>maxFilesPerBulk</tt> and <tt>minFilesPerBulk</tt>:

+

The directory and file records are collected in bulks, whose size can be configured via the parameters <tt>maxFilesPerBulk</tt>, <tt>minFilesPerBulk</tt> and <tt>directoriesPerBulk</tt>:

−

* On top level (root folder), each subdirectory record goes to a separate bulk. Parameter <tt>minFilesPerBulk</tt> is not considered (-> distribution for immediate scale up)

+

* <tt>maxFilesPerBulk</tt> has the same effect in any of the following cases:

−

* <tt>minFilesPerBulk</tt>

+

** ''not configured:'' a new <tt>filesToCrawl</tt> bulk is started each 1000 files.

−

** ''not configured:'' each subdirectory record goes to a separate bulk.

+

** ''configured:'' a new <tt>filesToCrawl</tt> bulk is started when the configured value is reached.

−

** ''configured:'' when <tt>minFilesPerBulk</tt> is not reached with all files of the current folder, we step into the subfolder(s) to reach the configured minimum size. If min size is reached, all remaining files of the current subfolder are also put in the same bulk until <tt>maxFilesPerBulk</tt> is reached. Remaining subfolders go each to a separate bulk

+

* On top level (root folder), each file of the folder is written to the <tt>filesToCrawl</tt> bulks and each subdirectory record goes to a separate <tt>directoriesToCrawl</tt> bulk. Parameters <tt>minFilesPerBulk</tt> and <tt>directoriesPerBulk</tt> are not considered (-> distribution for immediate scale up).

−

* <tt>maxFilesPerBulk</tt>

+

* In follow-up tasks the handling is:

−

** ''not configured:'' a new file record bulk is started each 1000 files.

+

** <tt>minFilesPerBulk</tt>

−

** ''configured:'' a new file record bulk is started when the configured value is reached.

+

*** ''not configured:'' only files in the crawled directory are added to <tt>filesToCrawl</tt> bulks, all subdirectories are written to <tt>directoriesToCrawl</tt> bulks.

+

*** ''configured:'' when <tt>minFilesPerBulk</tt> is not reached with all files of the current folder, we step into the immediate subfolder(s) to reach the configured minimum size. If min size is reached, all remaining files of the current subfolder are also written to <tt>filesToCrawl</tt> bulk(s). Remaining subfolders of the current folder and subfolders of already crawled subfolders are written to <tt>directoriesToCrawl</tt> bulks.

+

** <tt>directoriesPerBulk</tt>

+

*** ''not configured:'' each sub-directory that is not read directly will be written to a seperate <tt>directoriesToCrawl</tt> bulk

+

*** ''configured:'' the given number of sub-directories will be written to the same <tt>directoriesToCrawl</tt> bulk before a new one is started.

Please note that both parameters must be >= 0 and also that <tt>minFilesPerBulk</tt> must be < <tt>maxFilesPerBulk</tt>. Otherwise your job will fail.

Please note that both parameters must be >= 0 and also that <tt>minFilesPerBulk</tt> must be < <tt>maxFilesPerBulk</tt>. Otherwise your job will fail.

Line 66:

Line 74:

=== File Fetcher ===

=== File Fetcher ===

−

For each input record, reads the file referenced in attribute <tt>filePath</tt> and adds the content as attachment <tt>fileContent</tt>.

+

For each input record, reads the file referenced in attribute <tt>filePath</tt> and adds the content as attachment <tt>fileContent</tt> and optionally further file properties.

+

The File Fetcher can be used in combination with the File Crawler where the File Crawler extracts the metadata of files and the Fetcher adds the file content or it can be used individually to get both the file content and metadata properties.

===== Configuration =====

===== Configuration =====

−

* Worker name: <tt>fileCrawler</tt>

+

* Worker name: <tt>fileFetcher</tt>

* Parameters:

* Parameters:

** <tt>mapping</tt> ''(req.)'' needed to get the file path and to add the fetched file content

** <tt>mapping</tt> ''(req.)'' needed to get the file path and to add the fetched file content

*** <tt>filePath</tt> ''(req.)'' to read the attribute that contains the file path

*** <tt>filePath</tt> ''(req.)'' to read the attribute that contains the file path

*** <tt>fileContent</tt> ''(req.)'' attachment name where the file content is written to

*** <tt>fileContent</tt> ''(req.)'' attachment name where the file content is written to

Contents

File Crawler

The File Crawler crawls files from a root folder and the subdirectories below.

Configuration

The File Crawler worker is usually the first worker in a workflow and the job is started in runOnce mode.

Worker name: fileCrawler

Parameters:

dataSource: (req.) value for attribute _source, needed e.g. by the delta service

rootFolder: (req.) crawl starting point

filters(opt.) filters with conditions to in- or exclude files and folders from import

maxFileSize: maximum file size, files that are bigger are filtered out

maxFolderDepth: starting from the root folder, this is the maximum depth to crawl into subdirectories. (Hint: Folder structures in compounds are not taken into account here)

followSymbolicLinks: whether to follow symbolic links to files/folders or not

filePatterns: regex patterns for filtering crawled files on the basis of their file name

include: if include patterns are specified, at least one of them must match the file name. If no include patterns are specified, this is handled as if all file names are included.

exclude: if at least one exclude pattern matches the file name, the crawled file is filtered out

(Hint: the patterns need to use forward slashes as directory seperators, even if your file system uses backslashes as folder delimiters)

folderPatterns: regex patterns for filtering crawled folders and files on the basis of their complete folder path. (Hint: Contrary to the file patterns a folder pattern must match the complete path, it doesn't work if it just matches the folder name!)

include: Only relevant for crawled files: If include patterns are specified, at least one of them must match the file path. If no include patterns are specified, this is handled as if all file paths are included.

exclude: Only relevant for crawled folders: If at least one exclude pattern matches the folder name, the folder (and its subdirectories) will not be imported.

(Hint: the patterns need to use forward slashes as directory seperators, even if your file system uses backslashes as folder delimiters)

mapping(req.) specifies how to map file properties to record attributes

Processing

The File Crawler starts crawling in the rootFolder. It produces one record for each subdirectory in the bucket connected to directoriesToCrawl and one record per file in the bucket connected to filesToCrawl. The bucket in slot directoriesToCrawl should be connected to the input slot of the File Crawler so that the subdirectories are crawled in followup tasks. The resulting records do not yet contain the file content but only metadata attributes configured in the mapping.

The directory and file records are collected in bulks, whose size can be configured via the parameters maxFilesPerBulk, minFilesPerBulk and directoriesPerBulk:

maxFilesPerBulk has the same effect in any of the following cases:

not configured: a new filesToCrawl bulk is started each 1000 files.

configured: a new filesToCrawl bulk is started when the configured value is reached.

On top level (root folder), each file of the folder is written to the filesToCrawl bulks and each subdirectory record goes to a separate directoriesToCrawl bulk. Parameters minFilesPerBulk and directoriesPerBulk are not considered (-> distribution for immediate scale up).

In follow-up tasks the handling is:

minFilesPerBulk

not configured: only files in the crawled directory are added to filesToCrawl bulks, all subdirectories are written to directoriesToCrawl bulks.

configured: when minFilesPerBulk is not reached with all files of the current folder, we step into the immediate subfolder(s) to reach the configured minimum size. If min size is reached, all remaining files of the current subfolder are also written to filesToCrawl bulk(s). Remaining subfolders of the current folder and subfolders of already crawled subfolders are written to directoriesToCrawl bulks.

directoriesPerBulk

not configured: each sub-directory that is not read directly will be written to a seperate directoriesToCrawl bulk

configured: the given number of sub-directories will be written to the same directoriesToCrawl bulk before a new one is started.

Please note that both parameters must be >= 0 and also that minFilesPerBulk must be < maxFilesPerBulk. Otherwise your job will fail.

Source:

The attribute _source is set from the task parameter dataSource which has no further meaning currently, but it is needed by the delta service.

Compounds:

If the runnning CompoundExtractor service identifies an object as a extractable compound, it is marked with attribute _isCompound set to true.

File Fetcher

For each input record, reads the file referenced in attribute filePath and adds the content as attachment fileContent and optionally further file properties.
The File Fetcher can be used in combination with the File Crawler where the File Crawler extracts the metadata of files and the Fetcher adds the file content or it can be used individually to get both the file content and metadata properties.

Configuration

Worker name: fileFetcher

Parameters:

mapping(req.) needed to get the file path and to add the fetched file content

filePath(req.) to read the attribute that contains the file path

fileContent(req.) attachment name where the file content is written to

fileLastModified(opt.) mapping attribute for the file's last modified date

Input slots:

filesToFetch

Output slots:

files

File Extractor Worker

Used for extracting compounds (zip, tgz, etc.) in file crawling.

Configuration

Worker name: fileExtractor

Parameters:

filters(opt., see File Crawler)

maxFileSize: (opt., see File Crawler)

filePatterns: (opt., see File Crawler)

include: (opt., see File Crawler)

exclude: (opt., see File Crawler)

folderPatterns: (opt., see File Crawler)

include: (opt., see File Crawler)

exclude: (opt.) The behaviour is slightly different here to that of the File Crawler: If an exclude pattern matches the folder path of an extracted file, then the file is filtered out. But according to the pattern, files from subdirectories may be imported!

mapping(req.)

filePath(req., see File Crawler): needed to get the file path of the compound file to extract

fileFolder(opt., see File Crawler)

fileName(opt., see File Crawler)

fileExtension(opt., see File Crawler)

fileSize(opt., see File Crawler)

fileLastModified(opt., see File Crawler)

fileContent(req., see File Fetcher)

Input slots:

compounds

Output slots:

files

Processing

For each input record, an input stream to the described file is created and fed into the CompoundExtractor service to extract the compound elements. If an element is a compound itself, it is also extracted. If it is not a compound, a new record is created. The produced records are converted to look like records produced by the file crawler resp. fetcher, with the attributes and attachment set that are specified in the mapping configuration. Additionally, the following attributes are set:

_deltaHash: computed as by the FileCrawler worker

_compoundRecordId: record ID of top-level compound this element was extracted from

_isCompound: set to true for elements that are compounds themselves.

_compoundPath: sequence of filePath attribute values of the compound objects needed to navigate to the compound element.