It is possible by ingenuity and at the expense of clarity... [to do almost anything in any language]. However, the fact that it is possible to push a pea up a mountain with your nose does not mean that this is a sensible way of getting it there. - Christopher Strachey (NATO Summer School in Programming)-

Well, it may be interesting if different syntaxes are used, like in crawlers/mashups/webservices for example, where one can use "regexp" of course, but "wildcard expressions" too.I do not know if I already have spoken of these softwares :- For regexps : "Regex coach" http://www.weitz.de/regex-coach/, made in Germany it rocks !! And especially for Steffen : it is in Lisp ! It could be integrated into RM as a standalone tool, a bit like the "ANOVA calculator"...- For large scale analysis : Hadoop at http://hadoop.apache.org/core/. I have read that on that project they are looking for machine learning environment : "pig" http://incubator.apache.org/pig/ and "mahout" http://lucene.apache.org/mahout/ . It may be a licensing issue...- For efficient crawling : http://lucene.apache.org/nutch/ . Fetches, links and contents are separated (thus allowing link analysis and web mining). Massive crawls are split into segments; the architecture relies on Hadoop, and there is a "Nutch scripting language" that could be used to specify a particular crawl into RM. I switched to that tool after using Websphinx, much too slow !! It uses JDK and Tomcat, but be aware of good versions...

The two last softwares are in Java. I wonder if Webharvest can be used together with Nutch, it could give many different combinations for many types of crawls (scrapbooking, search&index, subject crawling, dead link analysis, sitemap design, etc...)

If you are thinking of implementing a scripting language, you may as well implement a meta scripting language Then you can have Lisp or Regexp or what ever scripting language of the day for very little additional effort

Cordially,

-Digital Dude-

"We have very few inferior people in the world. We have lots of inferior environments. Try to enrich your environment." -Frank Loyd Wright-

I definitely recognize the value of OMeta, it is exactly the path I am taking at the moment : practicing pattern matching to understand formal grammars. The point is that I do not know all the scripting languages synthetized with OMeta, but according other posts with Steffen, is there any choice to take between genericity, "visual simplicity" and time consumption of a scripting language ?By "visual simplicity" I mean that RapidMiner suggests a way to "work and see" which is its main value, a data format and management. Is it aimed at matrix or statistics computing ? I mean, in this case, are there "ready to use" grammar files for matching "R" scripts for instance ?The genericity is necesary for preprocessing phases, to cope with data heterogeneity; there, the "meta-matcher" would be wonderful, no doubt, as a swiss knife. But referring to the current poll, the feature desired for next release is "speed in computing". What is to be said about a "formal grammar-based" and "object oriented" scripting language ? Is there any benchmark on this aspect ?I know that on FreeMind open source project, they have a scripting feature, the point is that they have stability/file security/speed issues...And users participating in writing new script files are not that often !

My suggestion about Hadoop or Nutch was a subtle answer to other posts, where a user complained that the crawler was a bit too tedious, too slow, that I confirm . Since RM team is preparing a "chain analysis" plugin, the "linkDB" of Nutch should be very interesting. The MapReduce CPU cluster map management and the specific data storage format of Hadoop seemed interesting to me for the "computing speed/load" requirement...

Two more operators related to OMeta (in category "/meta/") :- MashupGrammars : with a list of grammars to import and a set of OMeta rules to mix them (see previous post for the link towards moserware's blog). For instance, mixing regexp and XPath grammars, or regex and wildcard expressions, etc...- SpecializeGrammar : Here is the "object oriented" flavour of OMeta, for instance using regexp only matching numbers and rejecting letters, or the other way round.

If such operators were to be written, examples would be probably needed as well as in the tutorial...?

In any case, I am convinced that it is the same formula that is working tremendously well : visualize what you have just written to have a 'closed loop', to verify and tune the lines of code. In RM, you can switch from code to results, in Regex Coach you can match a pattern against a sample while visualizing the grammar tree, all that on the fly : in such operators as above, it would be needed to have a "show preview" button and casually a wizard, wouldn't it ?

Of course, from developer's point of view, it may need to create another type of object (as ExampleSet or ClusterModel) which would be "SyntaxScript", where a specialized or mashed-up grammar is named, stored or loaded. Then, for each RapidMiner operator using regexps, there should be a parameter "load grammar file".

Thus, the three operators suggested until this post, MetaScripting, MashupGrammars and SpecializeGrammar should be in /meta/, why not in "/meta/IO/", and used exclusively for pattern matching ie either Webscraping or general data preprocessing, but neither for matrix computations nor "branch programming"...Said differently, it should not be OCaml, for instance, but rather a powerful set of I/O and preprocessing operators...

What do you think of that ? Tell me if it is desperately useless Jean-Charles.

Hi all,interesting topic. Metalanguages ... fascinating . But where is the connection to data mining? I have the slight feeling we are loosing the original target out of sight...I'm already curious about your next ideas

Hi all,interesting topic. Metalanguages ... fascinating . But where is the connection to data mining? I have the slight feeling we are loosing the original target out of sight...I'm already curious about your next ideas

As written in the wiki for "Regular Expressions" these are the two use case for such grammars as Regexp or XPath. But there are other pattern matching grammars, XQuery in Webharvest, wildcard expressions in Websphinx crawler, etc...These grammars are not equivalent : Regexp are complicated but powerful, generic; XPath allows to walk along logical trees and in my sense manage better multiple matches than regexp, and so on.Thus the idea of customizing "pattern matching" grammars to customize the two "connection points" above, to fit to specific IO/preprocessing issues.

Thus this type of scripting language would not be dedicated to data mining core computing, as expected for "R" for instance. I was talking of Regex Coach in Lisp, but this may be a bit confusing, since there is no Lisp to write at all, just focusing on regexps to be designed.

For AttributeSubsetProcessing, there should be a "show preview" button in front of the "regexp" field. Whenever there are zillions of attributes to filter, such pattern matching functions may become critical and useful...

Hello guys,I think that an useful feature could be integrate the Rapidminer API with Maven, like Weka is integrated: http://wwmm.ch.cam.ac.uk/maven2/weka/I know it is a transversal feature, but nowadays Maven is a standard in J2EE projects.

Hi,I'm quite unfamiliar with maven and the link does not describe what I can do with it at all? As far as I know, maven is a building tool link ant. So I'm a bit confused, what I would gain, if I integrate a data mining api into a building tool?

I want to add: It eases the dependency management (in my point of view the most important feature), the building itself can be performed with Maven or with ant (there is, as far as I know, also an Ant-Plugin for Maven).

If you have 1-2 hours time, I recommend the first two chapters of Better Builds with Maven. The first one explains in detail what Maven is for, the second contains sufficient information for daily usage.

I've skimmed the thread and didn't see my issue raised. I am concerned about using the update process to search for plug-ins as it has some security implications for my company.

I was prepared to talk about it in some detail when I realized that a post that I made on one of your competitors' site concerning forced registration covered all of the same issues. With that in mind, here is the relevant text from the two posts:

Quote

I am a senior IT architect for a large financial services company. We have roughly 65,000 employees. My conservative estimate is that this tool looks like it could be quite useful for 5% to 10% of our staff.

Unfortunately, I cannot recommend this kind of tool if it requires external authentication to run. Further, I cannot recommend such a tool for use here if it requires the establishment of a connection through our firewalls. The security folks would have my head on a platter, and rightly so.

If I cannot run your software without registering, can someone please explain the reasoning behind requiring registration?

Quote

I am not willing to publish the name of my company on a forum visible to all. If you're interested in contacting me for more information, you have my email address from my registration information.

However, I will note that the financial services company that I work for is based in the U.S. and thus falls under the regulatory and auditing scrutiny of a whole host of Federal government and other agencies. Off the top of my head, I can think of:

OCC SEC FINRA FDIC Federal Reserve Board PCI BSA

Each one of these organizations (and several others!) send their auditors crawling through our financial records, computer systems, and business practices every single year. Last year's financial meltdown have motivated these auditors to become far more aggressive in how thoroughly they scrutinize everything that we do. (Rightfully so in my opinion.)

Now that I've explained the regulatory environment that we face, let me address why registration is such an issue. With respect, requiring online registration of individual copies of software for simple use places it outside the realm of software that we can use because it violates our security policies. Not even Microsoft is allowed to sell us software that "phones home."

The reason behind this statement in our security policies is quite simple. There is no feasible, cost effective way for us to determine which connections are supplying just registration information and which ones are supplying far more detail about our computing environment. This information is regarded as confidential because knowing someone's hardware and software mix can be leveraged to reduce the time necessary to hack targeted systems.

Worse, the existence of such a connection could theoretically be used to delve into any information that may be stored locally. Since the services that we provide require that we have a deep and intimate view of our customers' confidential financial information, we are ethically, morally, and legally required to make every effort to avoid even the merest suggestion of a possible leak.

There is no way that we could implement software with this kind of "phone home" requirement without drawing the ire of an auditor from one of the regulatory agencies. That is what makes your software, regardless of how attractive I personally think it might be for my company, off limits for us and every other financial services company in the U.S.

I know that the health care industry faces even more scrutiny than the financial services do, so my guess is that they have similar requirements for protecting patient confidential information. That locks you out of two very large pools of potential customers for your services.

With all that said, I think that an optional registration similar to that used by OpenOffice would probably pass muster.

If you wish to discuss this off line, please don't hesitate to contact me via email. I am willing to continue the debate here if you would prefer. Frankly, I think this conversation is a healthy one and should stay public as long as we can keep this relatively anonymous.

=== end quotes ===

The need to control our computing environment in order to meet our regulatory obligations requires us to maintain full control over what is deployed on our end users' PCs. Our IT department must be the sole source of all software management. We must be able to deploy software on our schedule /and/ be able to roll back if and when we choose to.

I can tell you that I have personal knowledge of two vendors that were rejected this year because they refused to give us this capability.

However, I recognize that plug-ins are a great way to introduce new functionality for a minimal cost. The good news is that in general we make a distinction between plug-ins that provide that additional functionality and the primary software executables. This is especially true when we can mitigate the risks in one or more ways.

The first would be for us to provide a "gold" repository of plug-ins that have been vetted by our Information Security department. This is by far our preferred method. Is there is a simple configuration file change that would allow us to force our users to go to such a repository rather than back to yours?

The second would be for us to put your company through a security and financial audit to verify that the security around your repository is such that the potential for malware to creep in is minimal. That's a path that I'm reluctant to take because as you might imagine, it can be time consuming and therefore expensive to complete.

The third path is not one that I think is of much value to our end users. That would be to block access to your plug-in repository and not allow any to be installed. I have a sneaking suspicion that at best, such a situation would create some unhappiness and friction between our IT department and the people actually using your software on a day to day basis.