Details

Description

This patch addresses a number of issues in the MailEntityProcessor contrib-extras module.

The changes are outlined here:

Added an 'includeContent' entity attribute to allow specifying content to be included independently of processing attachments
e.g. <entity includeContent="true" processAttachments="false" . . . /> would include message content, but not attachment content

Added a synonym called 'processAttachments', which is synonymous to the mis-spelled (and singular) 'processAttachement' property. This property functions the same as processAttachement. Default= 'true' - if either is false, then attachments are not processed. Note that only one of these should really be specified in a given <entity> tag.

Added a FLAGS.NONE value, so that if an email has no flags (i.e. it is unread, not deleted etc.), there is still a property value stored in the 'flags' field (the value is the string "none")
Note: there is a potential backward compat issue with FLAGS.NONE for clients that expect the absence of the 'flags' field to mean 'Not read'. I'm calculating this would be extremely rare, and is inadviasable in any case as user flags can be arbitrarily set, so fixing it up now will ensure future client access will be consistent.

The folder name of an email is now included as a field called 'folder' (e.g. folder=INBOX.Sent). This is quite handy in search/post-indexing processing

The addPartToDocument() method that processes attachments is significantly re-written, as there looked to be no real way the existing code would ever actually process attachment content and add it to the row data

Activity

This latest version of the updated MailEntityProcessor adds a few new features:

1. Incorporated SOLR-1958 (exception if fetchMailsSince isn't specified) into this patch
2. Added a hacky version of delta mail retrieval for scheduled import runs:
The new property is called 'deltaFetch'. If 'true', the first time the import is run, it will read the 'fetchMailsSince' property and import as normal
On subsequent runs (within the same process session), the import will only fetch mail since the last run.
Because it uses a runtime system property to hold the last_index_time, and there is currently no persistence, if/when the server is restarted, the last_index_time is not saved and the original fetchMailsSince value is used.
I couldn't find exposed APIs for the dataimport.properties file (all the methods are private or pkg protected), persistence is not included in this patch version
3. Added support for including shared folders in the import
4. Added support for including personal folders (other folders) in the import

Peter Sturge
added a comment - 23/Nov/10 10:54 - edited This latest version of the updated MailEntityProcessor adds a few new features:
1. Incorporated SOLR-1958 (exception if fetchMailsSince isn't specified) into this patch
2. Added a hacky version of delta mail retrieval for scheduled import runs:
The new property is called 'deltaFetch'. If 'true', the first time the import is run, it will read the 'fetchMailsSince' property and import as normal
On subsequent runs (within the same process session), the import will only fetch mail since the last run.
Because it uses a runtime system property to hold the last_index_time, and there is currently no persistence, if/when the server is restarted, the last_index_time is not saved and the original fetchMailsSince value is used.
I couldn't find exposed APIs for the dataimport.properties file (all the methods are private or pkg protected), persistence is not included in this patch version
3. Added support for including shared folders in the import
4. Added support for including personal folders (other folders) in the import
A typical <entity> element in data-config.xml might look something like this:
<entity name= "email"
user= "user@mydomain.com"
password= "userpwd"
host= "imap.mydomain.com"
fetchMailsSince= "2010-08-01 00:00:00"
deltaFetch= "true"
include=""
exclude=""
recurse= "false"
folders= "INBOX,Inbox,inbox"
includeContent= "true"
processAttachments= "true"
includeOtherUserFolders= "true"
includeSharedFolders= "true"
batchSize= "100"
processor= "MailEntityProcessor"
protocol= "imap" />

This patch update does a more proper delta-import implementation, rather than the kludge used in the previous version.
MailEntityProcessor with this patch is useful for importing emails 'en-masse' the first time 'round, then only new mails after that.

Behaviour:

If you send a full-import command, then the 'fetchMailsSince' property specified in data-config.xml will always be used.

If you send a delta-import command, the 'fetchMailsSince' property specified in data-config.xml is used for the first call only.
Subsequent delta-import commands will use the time since the last index update.

There are significant code changes in this version. So much so, that I've included the complete MailEntityProcessor source as well as a PATCH file.

This version doesn't use the persistent last_index_time functionality of dataimport.properties (i.e. it's delta only for the life of the solr process). If I get some free cycles, I'll try to put this in.

Peter Sturge
added a comment - 25/Nov/10 22:36 This patch update does a more proper delta-import implementation, rather than the kludge used in the previous version.
MailEntityProcessor with this patch is useful for importing emails 'en-masse' the first time 'round, then only new mails after that.
Behaviour:
If you send a full-import command, then the 'fetchMailsSince' property specified in data-config.xml will always be used.
If you send a delta-import command, the 'fetchMailsSince' property specified in data-config.xml is used for the first call only.
Subsequent delta-import commands will use the time since the last index update.
There are significant code changes in this version. So much so, that I've included the complete MailEntityProcessor source as well as a PATCH file.
This version doesn't use the persistent last_index_time functionality of dataimport.properties (i.e. it's delta only for the life of the solr process). If I get some free cycles, I'll try to put this in.

Peter Sturge
added a comment - 15/Feb/11 15:51 I've been meaning to get back to this, as I have made some local updates to this that help performance.
Could you give me some feedback on these 2 questions please - it would be really useful:
Is there a "committer's standard" or similar spec that describes what tests should be included, and if so, could you point me to it please?
I can then make sure I include appropriate tests
Is there a time-frame for committing for this or next release?
I have a product release of my own coming fup or beg-March, so if I know the time-scales, I can plan accordingly.
Thanks!
Peter

Hoss Man
added a comment - 21/Mar/12 18:08 Bulk of fixVersion=3.6 -> fixVersion=4.0 for issues that have no assignee and have not been updated recently.
email notification suppressed to prevent mass-spam
psuedo-unique token identifying these issues: hoss20120321nofix36

2) added a new dependency on the Sun Gmail Java mail extensions, which support true server-side filtering; the performance gains in processing large folders is significant, especially for delta processing. Currently, the only server-side gmail filter is the after: filter but more can be added as needed.

3) I also fixed an issue with the ClassLoader and Java Activation API where some messages were not being processed correctly unless the Thread's context class loader is the one that loaded the activation classes.

Timothy Potter
added a comment - 19/Mar/14 02:24 Updated patch built against trunk that builds upon Peter's work. Specifically, this patch does:
1) revived unit test that uses GreenMail as an embedded imap server during unit testing.
2) added a new dependency on the Sun Gmail Java mail extensions, which support true server-side filtering; the performance gains in processing large folders is significant, especially for delta processing. Currently, the only server-side gmail filter is the after: filter but more can be added as needed.
3) I also fixed an issue with the ClassLoader and Java Activation API where some messages were not being processed correctly unless the Thread's context class loader is the one that loaded the activation classes.

Here's an updated patch that's close to being ready for commit. However, I've changed a few things in the implementation but I believe it still meets the spirit of Peter's original work. Mainly, this patch removes support for the delta-import command and instead only does full-import with support for using the last_index_time from the previous run as the value for the fetchMailsSince filter.

The delta-import stuff is really for importing updates to existing rows and the MailEntityProcessor was sort of hijacking that behavior. More to the point, I couldn't get the DocBuilder#collectDelta code to work with the rows generated by the MailEntityProcessor#nextModifiedRowKey. Put simply, nextModifiedRowKey was returning new mails that occurred after the fetchMailsSince date filter and the DocBuilder was processing them like they were updates to pre-existing rows.

Thus, I felt is better to just support full-import and then have the code set the fetchMailsSince filter based on the last_index_time set by the DIH framework, which gets persisted in dataimport.properties. Of course if that property is not set, then the code falls back to fetchMailsSince from the config.

Timothy Potter
added a comment - 30/Jun/14 22:44 Here's an updated patch that's close to being ready for commit. However, I've changed a few things in the implementation but I believe it still meets the spirit of Peter's original work. Mainly, this patch removes support for the delta-import command and instead only does full-import with support for using the last_index_time from the previous run as the value for the fetchMailsSince filter.
The delta-import stuff is really for importing updates to existing rows and the MailEntityProcessor was sort of hijacking that behavior. More to the point, I couldn't get the DocBuilder#collectDelta code to work with the rows generated by the MailEntityProcessor#nextModifiedRowKey. Put simply, nextModifiedRowKey was returning new mails that occurred after the fetchMailsSince date filter and the DocBuilder was processing them like they were updates to pre-existing rows.
Thus, I felt is better to just support full-import and then have the code set the fetchMailsSince filter based on the last_index_time set by the DIH framework, which gets persisted in dataimport.properties. Of course if that property is not set, then the code falls back to fetchMailsSince from the config.

Tim, I have reverted your commit because the licensing terms for greenmail aren't clear. Their website says ASL 2.0 but I peeked into some of their source files and all of them have a header saying that they are licensed according to LGPL. This is a red flag and we need to tread carefully. There are plenty of ASL projects using greenmail and maybe I am just being paranoid but after consulting with Steve Rowe, I thought it safer to just revert the commit and get more clarity on the licensing issue.

Shalin Shekhar Mangar
added a comment - 02/Jul/14 22:06 Tim, I have reverted your commit because the licensing terms for greenmail aren't clear. Their website says ASL 2.0 but I peeked into some of their source files and all of them have a header saying that they are licensed according to LGPL. This is a red flag and we need to tread carefully. There are plenty of ASL projects using greenmail and maybe I am just being paranoid but after consulting with Steve Rowe, I thought it safer to just revert the commit and get more clarity on the licensing issue.
Example:
http://grepcode.com/file/repo1.maven.org/maven2/com.icegreen/greenmail/1.3.1b/com/icegreen/greenmail/store/MailMessageAttributes.java

Steve Rowe
added a comment - 02/Jul/14 22:28 Redhat apparently had the same licensing realization as you, Shalin, though from a different perspective (AFAIK they aren't bound by ASL compatibility):
the bug they filed with the greenmail project (no response): https://sourceforge.net/p/greenmail/bugs/8/
the Redhat bugzilla issue: https://bugzilla.redhat.com/show_bug.cgi?id=1059805
the "Fedora hosted" tracker issue: https://fedorahosted.org/fpc/ticket/392
The third link has lots of info, though mostly related to issues around whether it was okay to accept an effective partial fork of Apache James.

from what i can tell, even files copied verbatim from Apache James, w/o any modifications, have the exact same LGPL copyright header - which smells like a straight up IDE generated header mistake to me.

Hoss Man
added a comment - 02/Jul/14 23:45 FWIW:
everything but the source headers seems to agree that greenmail is ASL licensed...
SF.net license label added by project maintainer: http://sourceforge.net/projects/greenmail/ -> http://sourceforge.net/directory/license:apache2/
project website: http://www.icegreen.com/greenmail/ -> http://www.apache.org/licenses/LICENSE-2.0.txt
maven pom.xml: http://sourceforge.net/p/greenmail/code/HEAD/tree/trunk/pom.xml#l18
license.txt shpped with project: http://sourceforge.net/p/greenmail/code/HEAD/tree/trunk/license.txt
from what i can tell, even files copied verbatim from Apache James, w/o any modifications, have the exact same LGPL copyright header - which smells like a straight up IDE generated header mistake to me.
there is a feedback page on the icegreen.com domain that links to contact details on another site – i suppose you could try reaching out to the dev that way: http://www.icegreen.com/greenmail/feedback.html -> http://waelchatila.com/pages/consulting.html
greenmail appears to be the epitome of a completely dead project – any resolution of this issue that involves on waiting for developer response / action is probably a bad idea...
Latest code commit: 2010-06-03 http://sourceforge.net/p/greenmail/code/HEAD/tree/
issue tracker contains 8 bug reports going back to 2009-03-15, none of which have ever recieved a comment from any project developer: http://sourceforge.net/p/greenmail/bugs/
most recent mailing list postings from project dev:
latest reply to user list from a dev: Jan 2010 - http://sourceforge.net/p/greenmail/mailman/message/24321407/
latest release announcement list message: Dec 2007 - http://sourceforge.net/p/greenmail/mailman/greenmail-announcement/
project website has a "blog" link that redirects to another domain (same as contact details) where most recent "greenmail" blog is 1.3 release announcement from 2007: http://www.icegreen.com/articles -> http://waelchatila.com/ -> http://waelchatila.com/tags/greenmail/

Thanks for digging into this guys! I have no affinity towards GreenMail, other than it was very easy to use from JUnit. Is CDDL 1.0 suitable? If so, this MockJavaMail project looks promising too: https://java.net/projects/mock-javamail

Timothy Potter
added a comment - 03/Jul/14 13:55 Thanks for digging into this guys! I have no affinity towards GreenMail, other than it was very easy to use from JUnit. Is CDDL 1.0 suitable? If so, this MockJavaMail project looks promising too: https://java.net/projects/mock-javamail

It's suitable, but in the second tier in terms of desirability. Really only matters if license is the deciding factor for some reason though. It's suitable if it's correctly handled in NOTICES and such.

Mark Miller
added a comment - 03/Jul/14 14:01 Is CDDL 1.0 suitable?
It's suitable, but in the second tier in terms of desirability. Really only matters if license is the deciding factor for some reason though. It's suitable if it's correctly handled in NOTICES and such.

CDDL 1.0 is fine. The limitation here is that it must be listed correctly in the NOTICE.txt file. We have other JAR files with this license, the most common one is servlet-api.jar, but there are also others. See the NOTICE.txt:

JavaMail API 1.4.1: https://glassfish.dev.java.net/javaee5/mail/
License: Common Development and Distribution License (CDDL) v1.0 (https://glassfish.dev.java.net/public/CDDLv1.0.html)
JavaBeans Activation Framework (JAF): http://java.sun.com/products/javabeans/jaf/index.jsp
License: Common Development and Distribution License (CDDL) v1.0 (https://glassfish.dev.java.net/public/CDDLv1.0.html)
Jersey Core: https://jersey.java.net/
License: Common Development and Distribution License (CDDL) v1.0 (https://glassfish.dev.java.net/public/CDDLv1.0.html)
Servlet-api.jar and javax.servlet-*.jar are under the CDDL license, the original
source code for this can be found at http://www.eclipse.org/jetty/downloads.php

Uwe Schindler
added a comment - 03/Jul/14 14:07 - edited Is CDDL 1.0 suitable?
CDDL 1.0 is fine. The limitation here is that it must be listed correctly in the NOTICE.txt file. We have other JAR files with this license, the most common one is servlet-api.jar, but there are also others. See the NOTICE.txt:
JavaMail API 1.4.1: https://glassfish.dev.java.net/javaee5/mail/
License: Common Development and Distribution License (CDDL) v1.0 (https://glassfish.dev.java.net/public/CDDLv1.0.html)
JavaBeans Activation Framework (JAF): http://java.sun.com/products/javabeans/jaf/index.jsp
License: Common Development and Distribution License (CDDL) v1.0 (https://glassfish.dev.java.net/public/CDDLv1.0.html)
Jersey Core: https://jersey.java.net/
License: Common Development and Distribution License (CDDL) v1.0 (https://glassfish.dev.java.net/public/CDDLv1.0.html)
Servlet-api.jar and javax.servlet-*.jar are under the CDDL license, the original
source code for this can be found at http://www.eclipse.org/jetty/downloads.php
See also: http://www.apache.org/legal/3party.html (Category B: Reciprocal Licenses)

Here's a patch for trunk that removes the use of GreenMail, so unfortunately, at this point we're no better off from a unit testing perspective.

I tried the aforementioned mock JavaMail stuff and it is too forgiving / assumes too much to be useful for testing a lot of the MailEntityProcessor logic.

Also, I found the hupa mock sub-project in Apache James and that looked promising, see: http://james.apache.org/hupa/apidocs/org/apache/hupa/server/mock/package-summary.html. However, after spending several hours trying to get that to work, I had quite a mess trying to work around issues. For instance, the MailEntityProcessor depends on imapFolder.getType to return a valid value, but the hupa MockIMAPFolder only returns 0 (just throws the given type during create away). GreenMail was very clean and allowed for testing all functionality needed by MailEntityProcessor, hupa seemed buggy / half implemented.

At this point, I have bigger issues to concentrate on. Let me know if we think we want to commit this patch as-is with @Ignores in the unit test. The improvements to the MailEntityProcessor would be nice to have.

Timothy Potter
added a comment - 07/Jul/14 22:12 Here's a patch for trunk that removes the use of GreenMail, so unfortunately, at this point we're no better off from a unit testing perspective.
I tried the aforementioned mock JavaMail stuff and it is too forgiving / assumes too much to be useful for testing a lot of the MailEntityProcessor logic.
Also, I found the hupa mock sub-project in Apache James and that looked promising, see: http://james.apache.org/hupa/apidocs/org/apache/hupa/server/mock/package-summary.html . However, after spending several hours trying to get that to work, I had quite a mess trying to work around issues. For instance, the MailEntityProcessor depends on imapFolder.getType to return a valid value, but the hupa MockIMAPFolder only returns 0 (just throws the given type during create away). GreenMail was very clean and allowed for testing all functionality needed by MailEntityProcessor, hupa seemed buggy / half implemented.
At this point, I have bigger issues to concentrate on. Let me know if we think we want to commit this patch as-is with @Ignores in the unit test. The improvements to the MailEntityProcessor would be nice to have.

At this point, I have bigger issues to concentrate on. Let me know if we think we want to commit this patch as-is with @Ignores in the unit test. The improvements to the MailEntityProcessor would be nice to have.

+1 to commit. I have looked through the patch and these are nice improvements. Let's commit this and open another issue to add a test.

Shalin Shekhar Mangar
added a comment - 13/Jul/14 06:55 At this point, I have bigger issues to concentrate on. Let me know if we think we want to commit this patch as-is with @Ignores in the unit test. The improvements to the MailEntityProcessor would be nice to have.
+1 to commit. I have looked through the patch and these are nice improvements. Let's commit this and open another issue to add a test.

Steve Rowe
added a comment - 14/Jul/14 19:49 Let's commit this and open another issue to add a test.
+1
Alternatives to greenmail mentioned on the dev mailing list thread I started with the five ASF projects mentioned on LEGAL-206 :
javamail-mock2: https://github.com/salyh/javamail-mock2
SubEtha SMTP: https://code.google.com/p/subethasmtp/

Timothy Potter
added a comment - 14/Jul/14 19:59 javamail-mock2 actually looks very promising! I'll kick the tires on it tomorrow AM and regardless of outcome, will commit to trunk (with or without unit test)

jmlucjav
added a comment - 14/Jul/14 21:04
SubEtha SMTP: https://code.google.com/p/subethasmtp/
SubEthaSmtp works really well, but, I guess here it is of no use, as he needs an Imap server, and subetha only does smtp, right?