- open the keywords file and read it into a hash. - read each email file in the emails directory. - open each email file and split it into individual words, or use a regular expression. - test if each individual word exists in the keywords hash. - print the result.

Here is an example, the approach I would take, but perhaps beyond the complexity you require due to its heavy use of Path::Tiny, map and grep. Use it as a basis.

There are two possible approaches (with several variations of each). In the first, you search the email for each 'word' in the text file. In the second, you search the text file for each word in the email. I cannot recommend either one without knowing more about your problem.

How many words are in your text file? Are you concerned about processing speed?

How do you define 'word' in the email? Is it any contiguous string of characters which match an entry in the file? Is it a string of characters surrounded by whitespace? Anything else? Must we be concerned with case? word-wrap? plurals? tense? ...

Do you plan to extract the text of the email before you start?

Does your email contain non-ASCII characters?

UPDATE:

Chris has provided an example of the first method. Here is an example of the second:

Code

C:\Users\Bill\forums\guru>type ewh006.pl use strict; use warnings;

my $email = \do{ my $mail = << 'END_EMAIL' }; Here is some arbitrary text. It really does not matter much what it says -its just email. END_EMAIL

I am a bit new to perl, especially perl, I have prior experience to an extent with python and bash. However, your methods seem a bit complex for my knowledge.

I have gotten started based off reviewing your submissions to my post' and have been trying a different, maybe simpler route.

So far I have read the PHISHING TERMS (shown in the screen shot) into an array. I need to now figure out a method for using that array go find matches in the Sample_Email.txt file (shown in the screen shot as well). As for the PHISHING TERMS documents, I may shorten that list to just one instance of each word, instead of the variations, because I'm not even sure I need that.

I see from your screenshot you have each line from your keywords file in the @data array. Your array will contain data you don't need, your next step is to extract what you need. You don't need each case variation of each word, it is probably suitable just to later use the i modifier on a regular expression to match words in every variation of case.

This is very similar to what you have done so far with a couple of improvements. - It uses the recommended three way form of open, with lexical scoped filehandle variable and error statement. - It filters each line with map to extract what you need ( the first word that follows ") " ).

Check screenshot for output.

I suggest you read through Bill's solution carefully and understand what it is doing at each step, it is a general solution to your overall task, further requirements can easily be fitted in. Consider each of his questions too, they are important in deciding which approach is best for your needs, including any necessary fine tuning to make the code production ready.

The five steps you provided are the right train of thought. Reading the entire email in ( slurping ) is easier to process in your case, particularly on such small files. You'll find splitting won't be as accurate as using a regular expression, take for example words that end in a fullstop, this isn't even a necessary step with Bill's solution. Perl has an excellent loop system that avoids the need for explicit counters to iterate over data in typical cases.

There are of course solutions that incorporate grep and/or awk, though Perl is just as capable, this is a great little task to help you develop your Perl skills.

It is hard to add to Chris's excellent reply. However, I can point out that your example does not contain any of the special cases that I mentioned before. Both solutions should give the same result for this example. It is important to consider all of those cases and create examples which test all the ones which matter. You do not want the pointy-haired boss to be compromised because you ignored a possibility. This is especially bad if it is difficult or impossible to fix in the implementation you have chosen. Good Luck, Bill

What I have right now is counting instances but only within that one file (EMAIL).

Your code as it stands simply tallys each line in the email, effectively counting duplicate lines. This is obviously not what you want, you want to work with "words" not lines.

Quote

My main problem is comparing the two files and matching the KEYWORDS vs. instances of those keywords found in the EMAIL.

How do I make it take the email and compare it to matches within the KEYWORDS file?

We have provided a couple of solutions to this, and separately code to help you parse your keywords file now we have seen its format. I understand they aren't particularly beginner friendly solutions, but I was hoping you might be able to apply some of their features in your own code and we could work from there.

You have your own style you are set on using, I will stick to it as not to further confuse things. Using your current code as a basis, and trying to keep the process and notations straight forward, here is a working solution that you should be able to follow:

Its not without its limitations. Its not as efficient as it could be, that regular expression could be improved to support word boundaries (assuming you don't want to match keywords inside words), etc.

For reference purposes, here is a solution on par with the approach I might take, by building a hash of keywords, then use that to build a second hash of matching keywords:

That sounds correct. I'm guessing your keywords/phishing terms file has empty lines in it. With warnings enabled, I would have thought it would complain about an uninitialized variable. If this assumption is correct you can skip empty lines using something along the lines of next if $line =~ /^\s*$/. Otherwise attach your keywords and email files and we can test your actual raw data.

P.s. if and when you come to processing multiple emails, the code will definitely need a rework, it really needs to pre-process the keywords beforehand like our other solutions do. Feel free to ask for further help if you get stuck.