BoxOfDocs is an online ‘one-stop’ curated document sharing platform for the sharing of documented information between peer members in any industry. An AI-driven industry-specific search engine ensures that members can quickly and easily find the most recent and relevant documents available from their industry peers.

Project Description:

The ECOS project requires the engineering of an intelligent agent that will, with no direct human control, collect key unstructured data from 3500 distinct websites.

Using Deep Learning, develop an algorithm that will determine how and when the websites must be searched, identify which data and documents to collect, categorize each, extract key information from the documents, and ensure all the data collected is kept up to date and secure.

Ensure the algorithm’s performance is efficient and accurate.

Research Objectives/Sub-Objectives:

OBJECTIVE 1 – Collect

Determine the best methodology for accessing and collecting key unstructured data and documents from 3500 distinct websites on a regularly scheduled basis.

Develop, most likely through Natural Language Processing and Natural Language Generation, an algorithm that identifies documents with non-descript titles. Generate a new reader-friendly title for those documents.

OBJECTIVE 5 – Correlate

Develop an algorithm that identifies documents that are related.

Design a structure that will maintain that document relation within a database and accessible to web applications.

Methodology:

Complete a Build vs. Buy Assessment for each objective to ascertain whether a solution currently exists and determine the best approach for BoxOfDocs. Key considerations are:

cost of solution,

availability,

stability,

maintenance and support,

scalability and

ease of integration between components.

For Objective 1, consideration should be made to tools such as scrappy.org, Python, and other similarly available web data extraction tools.

When assessing exisitng AI tools, Amazon Comprehend and Amazon SageMaker and other existing tools must be explored.

The use of existing and/or open source solutions or modules shall be maximized, with custom integration as required.