Reviews

Download ATPM 12.05

FileMaking

Text Parsing With FileMaker

Pardon my extended absence from this column. Life just has a way of
getting away from one at times, and the work load piles up. But
hopefully I’m back to a normal schedule again.

As mentioned in the last column I wrote regarding the direction of this
tutorial series, we’ll now cover some basic text parsing functions. For
beginning FileMakers, parsing text is one of the most useful skills to
have as well as one of the first that questions are asked about.

What Is Text Parsing

Text parsing is the process of applying a calculation to a string to
pull out some other string within it that is useful or to determine if
some condition is true regarding the string.

This comes up many times in FileMaker, or for that matter, any database
application. I’m currently working on a project where data is received
from an outside vendor as a text file, where each line is a record and
each field in the record is a particular number of characters in the
line (such as, characters 13 through 20 store the creation date of the
record, 21 through 30 the product ID, etc.). This format is called a
fixed format file because each field is stored in a fixed number of
characters. Text parsing calculations allow me to extract the data into
separate fields in my database.

Name Extraction

For example, proper database design dictates that information about
people should have their different names (first, middle, last, etc.)
separated into different fields: FirstName, MiddleName, LastName.
However, perhaps you need to import data from another system where all
three of these pieces of information were placed in a single field and
you want to break them out to their individual components.

FileMaker includes a wide range of text functions that can be used to
help with such chores. You can view a list of these text functions by
creating a calculation field and choosing “Text Functions” from the
View menu.

However, a better way to become familiar with the available functions
is to look through the online help documents that come with FileMaker.
While in FileMaker, click Command-Shift-/ to bring up the help system, and
then click on Contents in the top left of the window, followed by
Function Reference Category List, and then Text Functions. Click the
link for each of the available text functions to become familiar with
them.

For text parsing, the most useful functions will be Left(),
Right(), Middle(), Length(), PatternCount, and
Position().

So let’s begin by establishing some rules on how one would extract
individual names from a full name field. My own name is a good example
of a simple full name, with a single word for each of the first, middle, and
last names: “Charles Edward Ross.” It is made up of a single word for
each name (a word in FileMaker is a string of characters separated by
whitespace, such as a space, return, or tab character). We will begin by
extracting just the first name. In this case, the first name can be
extracted by getting all of the characters before the first space. So
the first thing we need to figure out is where the first space occurs,
which is done with the Position() function.

Position( FullName; " "; 1; 1 )

Assuming we have a text field called FullName, the above will return
the number of the character that has the first occurrence of a space. In
the case of my name, this will be 8.

Now that we know where the first space occurs, we want to get all of
the characters before it. For this we use the Left()
function, which returns characters in a string beginning from the left
side for the number we specify. The number of characters we want is 1
less than the position of that first space, so our calculation becomes:

Left( FullName; Position( FullName; " "; 1; 1 ) - 1 )

If CharlesEdwardRoss is in FullName, the above will return
Charles. If FullName contains ChuckEdRoss, the above will return
Chuck.

Now, FileMaker does have more powerful functions available to us that
we could use. For instance, I could do the above with the following as
well:

LeftWords( FullName; 1 )

However, this tutorial is not only about how to do things in FileMaker,
but also how to think about what you’re trying to do. I want to show
you how to think about the result, and using both Left() and
Position() were more instructive toward this than using
LeftWords().

Debugging Our Name Extraction

As we have our FirstName calculation, it will work so long as the first
name is the first word. But what if it isn’t? What if the FullName
field contains Mr.CharlesEdwardRoss? Our function will return
Mr. instead of Charles. So let’s next account for the possible, but
not necessary, presence of a title before the first name.

First, let’s consider which titles we want to account for. Off the top
of my head, I can think of, Mr., Mrs., Ms., Miss,
Dr., and Rev.. We want to check if the first word of
FullName is any of these, and if so, get the second word. If it’s
not present, get the first word.

We’ll begin by creating a variable to store the list of titles.
Unfortunately, FileMaker doesn’t have a built-in list or array
construct, but we can fake one by using a return separated string.

We begin this calculation by setting a number of variables, including a
list of the possible titles that a full name could begin with. For ease
of readability, we also set the FirstWord and
SecondWord variables.

Then we use a Case() function to determine which word, the
first or the second, we should use. The Position()
function will return a 0 if the search string (the first word) is not
in our searched string (the list of titles). Take a close look at the
parameters of the Position() function by checking the
documentation. We are looking for the first word of the full name being
in our list of titles.

We can simplify this calculation a bit. First of all, our test
expression in the Case() function doesn’t have to check
for the value being greater than 0. FileMaker interprets any numerical
value that is not 0 as being True, so the first thing
to simplify the function is to remove the inequality expression:

The second thing we can do is realize that using Position
as a test for one string simply containing another string is probably
going to be a common need. That implies that we could create a custom
function for this purpose, called perhaps Contains().
We’ll create this custom function to take two parameters, a text
parameter and a searchString parameter (see the article on creating
custom functions).
Here’s the definition of our custom function:

PatternCount( text; searchString; 1; 1)

After defining this, we have the following for our current FirstName
calculation:

Parsing Out the Last Name

Now that we can extract the first name with some reliability, let’s
work on the last name. We’ll begin with a calculation similar to that
of the first name calculation, accounting for a possible suffix. In this
case, we’ll assume that any existing suffix appears after a comma, as
in CharlesEdwardRoss,Jr. or CharlesEdwardRoss,Ph.D.. For the
time being, we’ll also assume that there can only be one suffix.

If FullName contains a comma, then the above will return the second to
last word in FullName. If it does not have a comma, then the last word
is returned.

Note the use of WordCount() to determine which word is the
second-to-last word. As you would expect, WordCount() return the
number of words in a string, so WordCount("CharlesEdwardRoss,Jr.") would return 4. Subtracting 1 from this gives us word
3, and using MiddleWords() to begin at word 3 and get 1
word returns "Ross".

Refining Last Name Extraction

Again, there’s an exception to the rules we’ve covered for last names:
Not all last names are a single word. Some last names are two words, as
in McGowan (my wife’s maiden name) or vonNeumann. Other last
names might begin with de or O (without an apostrophe). Again, off
the top of my head, I can think of Mc, Mac, O, De, and
Von. I’m sure there are others, but that will suffice for now.

So here is our newly defined rule for last names. Assume that the full
name has n words in it. If word n is not a listed suffix, and word n–1
is not a listed last name word, the last name is word n. If word n–1 is a
listed last name word, the last name is words n–1 through n. If there is a
suffix, shift the last name (and the tests for a separate last name
word) to the left by one word.

Sounds complicated. Let’s see what it looks like. First let’s add a
variable to hold the possible words that a last name can begin with:

We’ve surrounded our last name prefixes with spaces because we only
want to find them if they are separate words. We don’t want to catch
the word before the final last name if it’s O’Grady, only if it’s OGrady.

Note that this isn’t a perfect extraction of first name and last name.
There are many other exceptions we could account for, a suffix without
a comma, or three-word last names (delaHoya). But this should give
you plenty to begin with for your own text parsing routines.

Email Verification

Text parsing isn’t always pulling out a piece of a string. Sometimes
you need to know if a string satisfies a condition. A classic example
is if the string is a valid e-mail address. You can’t verify that it’s
valid in the sense that e-mail sent to the address will arrive at an
e-mail server someplace, but you can verify that it looks like a valid
address.

A valid e-mail address looks something like xxx@xxx.xxx. So, it has
some characters, followed by an @ symbol, followed by some more
characters, a period, and two or three more (i.e., .com or .uk).
There are some more restrictions, but these will suffice for our
tutorial purposes.

First, an e-mail address is valid if it has the @ character in it
only once, and if it’s not the first character.

The first line checks that the first @ symbol is not the first
character. The second condition checks that there is only one @
symbol in the string. Parentheses enclose a compound third condition
made up of two separate conditions, one checking that the
fourth-to-last character is a period and the other checking that the
third-to-last character is a period. If either of these is true, the
third (compound) condition returns true. So an e-mail like
chuck@chivalrysoftware.com will validate, but
@chivalrysoftware.com, chuck@chivalry@software.com and
check@chivalrysoftware.c will not validate.

Again, this isn’t a complete validation of e-mail addresses, but it
should give you an idea of the thought process needed and the functions
required to do a more complex validation.

• • •

I think that will do it for us this month. As an exercise in using the
skills covered in this column, I would suggest writing the routine to
extract the middle name from a full name, including the possibility
that the middle name isn’t provided (i.e., return an empty string when
FullName contains CharlesRoss). Until next month, happy FileMaking!

I cannot thank you enough for putting up such a clear, concise and understandable article on how to separate out names from within a full name field. I have a database with over 1000 records and was thinking about doing it by hand.

I searched the net and most of the hits were "Buy a subscription and I'll tell you how to do it". I would have paid someone to help, but those sites looked a bit fishy.

Then I ran into yours. Thank you, once again, for empowering me. You have no idea how thrilled I am that I could do this -- though to many it would seem so simple.

Hello and a big "Thank You" to you for your assistance and wisdom, Chuck. I've got what my mom would have called a 'Mell-of-a-Hess' name parsing project.
The problem is the variability used to express the full name and street address information. Below is an example of the various ways Full Name is expressed in this data set from the public domain:

I'm pleased with this info. However, I have a different type of problem. Maybe you can help me solve this using Filemaker.

What I want to be able to do is to have Filemaker check if a text-field contains a specific string. For example, I want to check whether the text-field contains the word "trousers" just somewhere in the text.

I was looking for a function that does something like: IF FIELD CONTAINS "trousers" THEN ... etc.

However, I do not seem to find such a simple CONTAIN type of function.

PatternCount ( FieldName ; "trousers" ) will return the number of times 'trousers' appears in the field.

The drawback is that it only looks for the pattern as a string not a word. So if you were looking for the word 'File' you would get a result if the word 'FileMaker' was in the field. There are a number of techniques you could use if you need a word match. I'd consider adding a leading and trailing return, substituting all the spaces and punctation for returns, and then PatternCounting for the word bracketed by returns:

For more complex issues such as that brought up by Paul, where you want to find a word that is by itself and not within another word, I use the SmartPill PHP plugin that offers the availability of regular expressions. However, using it does require a knowledge, of course, of both PHP and regular expressions.

I'm wondering if there is a way to split a paragraph with several sentences into several different fields. Strange idea, perhaps, but here's why:

I am using FMP to download a weather forecast from the web... from there that information is exported into a markup language and displayed on digital signage. The problem is that sometimes the forecast is too long to fit on just one screen. Since this is all automated, I need a way to break up the information into more than one chunk through a script or calculation.

(Of course, I can do a calculation based on word count, but this breaks the information in the middle of a sentence.)

If the sentences don't contain abbreviations, this can be relatively easily done. Without abbreviations, you can assume that a period terminates the sentence. You'll then want to find the last period before the maximum length of your substring and get everything before that. The calculation would look something like this.

Let(
[
MaxString = Left( String; 100 ); // Get the first 100 characters of the original string.
PeriodCount = PatternCount( MaxString; "." ); // Find out how many periods are in the substring.
PeriodPos = Position( PeriodCount; "."; 1; PeriodCount ); // Get the position of the last period.
Substring = Left( MaxString; PeriodPos )
];
Substring
)

If your strings do have abbreviations, this will get a lot more complicated, but is still possible.

Is this thread still active? I am wondering if it is possible to trim all of the text in a field (and there will be varying amounts of text/words), everything UP TO a specific word.

I am trying to help various users around the world to define a path to a picture that will be inserted into their container. So a user inserts a picture (reference only) from a shared Dropbox folder, and now I want other users to be able to insert that same picture by changing the filepath to reflect their personal path to Dropbox. One user may have a filepath of imagewin://C:users/JohnSmith/Dropbox.... and another might be imagemac:/Y:/Dropbox....and many other variations on that theme.

I have a calculation field (Image Filepath Set) that uses: MiddleWords (Image Filepath Set; 5; 25 ) and that works for everyone who has the first filepath (the same number of folders), but not for anyone else. I just used 25 in the calculation because it was a big number, and I wanted to include all possibilities. Overkill, I know.

I just want something that finds "Dropbox" wherever it occurs, and then deletes everything in front of it. It could delete Dropbox, too, I could build that into the calculation.

I find calculations like this (using Let variables), with intermediate variables easier to understand, than a single long equation. The version provided by Paul does work, but also removes "Dropbox" from the resulting string.

Testing this in FileMaker, I got "Dropbox/image.jpg" for "imagewin://C:users/JohnSmith/Dropbox/image.jpg" and "Dropbox/otherimage.jpg" for "imagemac:/Y:/Dropbox/otherimage.jpg".

Add A Comment

Full NameE-mail Address (will not be published)Comment (You may use hyperlink, italic, and boldface tags. Use a blank line to start a new paragraph.)