I must say that I haven't seen any of those problems but then I'm not running a business mail server. I've left my zmtrainsa scripts at the initial install run frequency, once per day and I would have thought that five minute intervals would cripple a system with a lot of users that get a ton of spam. Did you use the spamassin rules in rspamd or just it's default configuration? I haven't seen any DNS problems on my server despite the high lookups it's doing, I assume you use a caching name server on your network or is this just using ZCS dnscache? I haven't seen anything about IPv6 in any of my log files and I'm just about to put my IPv6 record on DNS for the ZCS server, I'll keep an eye open for any problems on that front.

Something isn't right on my test server so I need to dive deeper into the training script to begin to understand why. I followed your script verbatim and everything appears to be working out of the box. My configuration is vanilla with the exception of adjusting the scores required for spam and reject. I would think our configuration is identical given I followed your instructions including not messing with the configuration. I did upgrade the software to the latest on centos 6 given the comments by vstakhov about "bugfixes especially on the tokenization". That did make a difference in our experience. As an aside, I am not a fan of rejecting spam back to the sender which I think is the default action. That isn't a problem for the single zimbra/rspamd installation but if you have front-ends that didn't block it or you want to deliver it to another system to score later then you are bouncing back to yourself for re-delivery. So my belief is that in production we would run this on the front-ends and not on the zimbra mailbox server. For the testing it works because I just want to learn to trust the software and its decision process. I like the ideas in rspamd but believe html5 will own rspamd and SA with html obfuscation in the future. Our view is that rspamd feels like a super grep so the rules are very simplistic at present. In fact, I don't even know how to get rule/score updates without updating the software automatically. Hmm...That realization says I am not very far along. LOL

This might be the best explanation: Here is the entire message in the body.

But I digress. There is no reason that rspamd couldn't work better once we customize some of the rules for our needs.

Our load factor appears to be stable at 0.0 or 0.1 for the past few hours ... Because it was not a problem for the first 4+ weeks, I eventually need to track the root cause but we are in no rush to production here. Bind/named is running locally on the zimbra server so no I am not running the zimbra dns stuff. I have my resolv.conf pointing to 127.0.0.1 and I haven't looked to see if rspamd even uses it. Changing the socket pool size doesn't appear to be involved here. I probably would have missed the high cpu over 2 hours but got lucky that the test environment has some threshold reporting for this server. It doesn't happen very often but that it can happen means I need to understand why.

Note: this guess to bayes training HAS NOT been confirmed for root cause of why rspamd shows Running and 99% CPU. That it happens sometimes (3-4) over the past 7-8 days might be an important clue. Normally, rspamd stays in a sleeping state which is what I expect in such a idle machine.

Another idea. We haven't cleaned the junk folder. We observed our users not doing that in production so we have left every message on the zimbra+rspamd instance. Is it safe to assume that this behavior and the zimbra training script doesn't cause a slow down over time?

Well, it is quite common that if you do strange things you get strange results. First of all, Rspamd is not SA and you should not expect it to work as SA. Why do you need to relearn the existing messages so often? It is trivial to check mtime and do not use not updated files by a simple `find` pipeline.

Secondly, if you use the default sqlite3 backend for statistics, then I have bad news: it is terrible and it won't be fixed ever. The only reason why is it still default is that I'm trying to be conservative with the default settings. For all new setups Redis is *strongly* recommended (and it is mentioned in the FAQ). Since the upcoming Rspamd 1.7, Redis will be the default backend and sqlite will be eventually deprecated.

Rspamd HTML parser is NOT an advanced grep as you have claimed: it is heuristic HTML *parser*, meaning that it is aware of HTML semantics, tags, encodings and so on and so forth. However, it indeed lacks CSS support and I have some samples from the wild where CSS tricks were used to poison statistical methods. In future, I plan to implement some sort of CSS parsing to stop that.

Finally, in your comparison of the message scan results, the only meaningful rule is BLACKLISTED_COUNTRY which definitely involves some custom configuration. This is also possible to do with Rspamd via multimap module which can blacklist countries, specific ASNs and so on with dynamic maps support and other features. I understand that this knowledge is a bit hidden inside Rspamd documentation but this could be improved in future (and I would appreciate any help in this task).

First thank you for your reply. We are following Bill's configuration at present because this is a Zimbra forum and haven't deviated from those instructions. We call a training script via cron as is the zimbra practice. I haven't looked at it to know if it only submits new or resubmits everything. My guess and hope is it would only be new training given users practice of not deleting junk folders. I believe in Zimbra we move it to a special spam account which is trained to get around this problem. We have not claimed the script to be the root cause but it uncovered an edge case that we needed to understand for production services. I also saw an outgoing dns lookup spike via ipv6 which I am still looking into during the same time window.

vstakhov wrote:Rspamd HTML parser is NOT an advanced grep as you have claimed: it is heuristic HTML *parser*, meaning that it is aware of HTML semantics, tags, encodings and so on and so forth. However, it indeed lacks CSS support and I have some samples from the wild where CSS tricks were used to poison statistical methods. In future, I plan to implement some sort of CSS parsing to stop that.

My concern is also with html obfuscation and CSS is only part of the problem ... but I would be more interested in your thoughts toward html 5 since any tag can have all the attributes... ie. fonts, colors, etc. Its complicated for determining inheritance... If you don't build a proper parse tree then how do you know which objects inherited which attributes? SA only handles this a little so we have been updating the code base to understand this more. Note: we are not convinced a full html 5 parser is the solution given the performance implications but we have seen some really difficult targeted business email and ip reputation isn't helping as much as we would like. I am glad to hear that you are focusing on bayes poisoning methods. As an aside, some of our patches have been accepted into the next release of SA. It a huge problem.

Sorry for my late reply. The I made to changes to zmtrainsa were just to implement the same functionality using rspamc to train rspamd A/S rather than SA or the now defunct DSPAM. it's always been my understanding that the system Junk & Not Junk folders were automatically emptied during the two zmtrainsa cron jobs that run overnight, I don't believe that anything I've changed would affect that and I've also confirmed that by manually looking at those two accounts in the Admin UI.

Always amazed how stupid mistakes uncover operational knowledge and failure modes. I am not sure we have the root cause because our alerts require 2 hours of pegged cpu before we got that warning email. I could run your training script by hand and it finished very quickly was the normal behavior from our observation. The output I posted showed the training finished immediately even with a sqlite3 db and 7K-9K of training data. In production, one would train and cleanup step wise on some optimal training size/frequency window that works best for their hardware and user community... which for us is once per day LOL

Unfortunately I'm not able to hammer my server/rspamd as it's just a personal mail server and therefore not much volume but I'd be surprised if it borked at any great load as it seems to be used by some large sites. FWIW, I did implement Redis on my server although that probable doesn't make much difference to me. Yes, I'd think that the training schedule would be something that each ZCS user would have to determine what suits them. I'll look forward to the final analysis of rspamd in your environment, is your profile up to date and are you still using CentOS6? I tend to keep my servers on the most recent version of CentOS so I'm on the latest CentOS7 version but again, I wouldn't have thought that would make any difference. I'm glad that my modifications to zmtrainsa are working for you.

Have a good week-end, what's left of it.

{EDIT}I guess I should have asked if your test machine is also on CentOS6 or CentOS7.

phoenix wrote: is your profile up to date and are you still using CentOS6? I tend to keep my servers on the most recent version of CentOS so I'm on the latest CentOS7 version but again, I wouldn't have thought that would make any difference. I'm glad that my modifications to zmtrainsa are working for you.

{EDIT}I guess I should have asked if your test machine is also on CentOS6 or CentOS7.

Yes my profile is accurate and I am still on centos 6. It's stock which is what I am using for our rspamd trial. Probably centos 6 until 11/30/2020. I've had a few centos7 machines in production for the past few years (DNS, openvpn access servers, owncloud, etc) but not for zimbra yet. centos 6 has earned a level of trust here. I haven't noticed much difference between centos 6 and 7 for how we use them. Both have been equally reliable for us. That the first UNIX source code I ever modified was version 7 init.c tells you how much inertia I could have toward systemd.

How did you see rspamd being used with a multiple host zimbra architecture where the MX's are not on the same machine as the mailboxd? Do you see your zmtrainsa connecting remotely to the rspamd on the MX's or some other method such as replicated redis, etc?

It would be kind of interesting if zimbra sites could access remote/external bayes db's like we currently do with blacklists. One could compare against different bayes db's and score them individually and in aggregate for more accuracy with local training. I wonder how accurate this would be and at what performance/latency cost? Would be an interesting market for zimbra sites to profit or share from the accuracy of their users system training. Probably would want to weight different sites to handle accuracy variants. Would the training increase the poor systems over time with less false positives and false negatives? Doesn't need to be perfect but there have been times when I wished we had a little more statistical help with some scoring. Now, if we add a blockchain to this somewhere we can write our own ticket. LOLHmmm... https://arxiv.org/abs/1512.09327

it seems rspamd seems to be a copy cat of dspam on steroids with a lot of addons, or at least does many things like dspam.oh on that note, no it is dpsam, at least they took most of the code renamed the bins and even left the training flags the same are you kidding menot even an official fork, wtfwell in that case it will not work on zimbra for the same reason dspam never did work on zimbra (not with a lot of additional help)

allright what im abnout to write is mainly about the statistical module. the policy modules and most other stuff is unaffected and should work as expected.

see those self learning systems need 2 things. 1 alot random data and a lot of accurate data.so you need to train em with clean hams and spicy jams also question is if you wanna fitler serverwide, userbased of a combo of both (baseline serverwide + user)

in any case the more homogeneous your userbase is the better. if youre an email provider with hundreds or tousands of different users well autolearn wont work as much as you would anticipate.

the key factor to make it work is a working connection between learning and the spam folder.if zimbra informs the training service about every change (moved into moved outta spam) then you have a chance, well it doesnt and never did properly.rspam(dspam) needs to be informed about moving out or into spam folder to proper unlearn wrongly learned stuff

second about server load, bad news but it can be hefty if its trained properly. even a small database easily has 400megsand it will take its toll at every message.

and yes DO NOT EVER USE ANY SQL BACKEND period.it wont work cant work. the dataload is simply to big, you need to use the hash databse

but here is the next quirk. one thing never resolved in dspam, there lot of issues with that hashdatabase so be prepared for one or more tears.

now if you wanna get it working well there is no make afew adustments in the config and it will run.first you should learn what its actually really does, learn about the tokenizer and the classifiers. i recommend the osb tokenizer but you need a lot of data there.

second classifiers is the problem described above. mixing those never really worked well.depending on your userbase it wont ever

if youre setting up for a single entity (with similar types of mail) it will work if trained properlysince zimbra aint gonna help here you need to train yourself

also make use of the spamtraps. make heavy use of them. they work.

bottom line, it is in fact dspam what we see here in a new project.with the same flaws.so if you setup for one entity you can make it work, and if it does it can perform absolutly well

if youre setting up as an email provider, move on, never look back i dont think its worth the time youll need and the steady config adjustments to justify thatyoull need an assload of classifiers and you need to adjust em to new customers, youll need a lot more training in total and at the end youll have a broken giant hash database and a lot of false positives.

the idea of dspam is awesome, on a single server for one entity it even works. but in bigger setups it would need a whole different aproach and a couple years of development to really make it work.

ps: holy shit they even stole the term "neural networks" from our good old nuclear elephant