Plain text for fields

While I was trying to beat one the oldest opened issue on xapian module IIRC: Index uploaded files, I end up figuring out that it could be useful to have a way generate a plain text representation of any field.

So, I have started a sandbox that have enough code to get a plain text version of file fields: Plain.

I have seen other modules that do that kind of conversion, but they are not in D7(PDF To Text, HTML2Text) or have a completely different approach(File Framework), so I started one from scratch.

It would be great of someone can point me to any other efforts I maybe have not seen or comment about this.

This could help to search engines to index better content, for example on xapian I would only need to implement hook_node_update_index() and use just one line there using this module:

Comments

Cool initiative! I am actually working on something similar called the Converter module, which has a side effect of being able to convert various documents to plain text for search indexing. My overall goal is to provide a solution that is backend agnostic so that core Search, Apache Solr, Search API, etc can use it. In addition, it can convert documents to various formats in an effort to compete with platforms such as Sharepoint. Would love to collaborate if interested.

My overall goal is to provide a solution that is backend agnostic so that core Search, Apache Solr, Search API, etc can use it

That was exactly my motivation to start plain project instead of embed that code on xapian module.

I would like to know your opinion on this:

Converter module description states:

This module provides an API for converting files to and from various formats. The method of conversion is pluggable, so different backends such as unoconv, Apache Tika, and others can be used with a consistent interface.

That means it is only about files.

Plain module description states:

This module provide a way to represent drupal information(fields) as plain text.

Awesome! Look forward to speaking more about this. Although Converter is geared towards files, it's underlying parsers do accept raw input so you can even convert HTML to plain text, an MS word document, etc. I don't think I have coded anything to allow that, though.

I will make sure I download the plain module and try it out, look under the hood, etc.