SPLIT METHOD IN STRING CLASS

i could not understand split method in string properly, i have looked thorough this forum , i got this example ,but i could not understand the explanation 1) String str = " apples"; String s[] = str.split("\\w*"); for (String i:s) System.out.println("Token" + i + "Token");

It's really worth having a browse of the source code of java.util.regex.Pattern to get a clear understanding of what's going on here (String.split() calls Pattern.split()). I'll try and explain what's going on in words, but looking at the code is probably more helpful at this point -- it's attached at the end of this post (limit = 0 for the default String.split() call).

In the case of " apples".split("\\w*"), the regular expression matches three times ("" at the start of the string, "apples", and "" at the end of the string -- Chandra Bhatt's analysis above isn't quite correct); so you get three fragments added to the output: "", " ", and "". The split() algorithm then adds an extra fragment to account for the remainder from the end of the last match to the end of the string: in our case, that's just the empty string again. Finally, the algorithm prunes those two empty strings ("") from the end of the array of results -- the pruning removes all the empty strings from the end of the results up to the first non-empty string. (Note: this pruning doesn't happen if you pass in a non-zero limit parameter to the split method).

In the case of "apples".split("\\w*"), the regular expression matches twice ("apples" and "" at the end) to give fragments "" and "". Another empty string is added to account for the remainder, but all three ""'s are pruned at the end, resulting in an empty array as output.

Finally, "apples ".split("\\w*"): the regular expression matches three times, "apples", "" and "", to give fragments "", "" and " ". The empty string is again added to the end of the outputs for the remainder, but is pruned off at the final stage (and that's the only one that's pruned).

Matt -------------------------------------------------------------------------- ("" at the start of the string, "apples", and "" at the end of the string -- Chandra Bhatt's analysis above isn't quite correct); -------------------------------------------------------------------------- can you allobrate this please ? In the starting of apples there is space ,but how this "" is comming ,i think it has to come " "

Thanks Anil Kumar

Chandra Bhatt
Ranch Hand

Joined: Feb 28, 2007
Posts: 1707

posted Apr 28, 2007 05:55:00

0

Matt says,

In the case of " apples".split("\\w*"), the regular expression matches three times ("" at the start of the string, "apples", and "" at the end of the string

I find the above lines missing something...

IMHO, " apples".split("\\w*"), the regular expression matches 0 occurrence in the very beginning of the " apples" and then space. By default split() skips the last blank string "", as the API says.

The second argument of the split() is helpful to tell the "limit".

Thanks, cmbhatt

anil kumar
Ranch Hand

Joined: Feb 23, 2007
Posts: 447

posted Apr 28, 2007 05:58:00

0

Hi Chandra

The trailing empty string are removed but here it is not like that why

Anil, the following code shows where the regular expression matches. Remember: split() outputs the bits between the matches (and before and after the first and last matches respectively), but trims empty strings from the end of the output.

anil kumar
Ranch Hand

Joined: Feb 23, 2007
Posts: 447

posted Apr 28, 2007 06:28:00

0

Hi Matt I have tried your program i have understood,But when i tried the same thing i am getting different o/p(SEE THE SPACE BETWEEN THE TWO TOKENS) Why? See below line1 This is the only thing i could not understood since morning

OK, step 1: where does the regular expression match? The program I pasted above shows you:

Let's call them matches 1, 2 & 3.

Step 2: What are all bits before, between and after the matches? Well, before match 1 (i.e. "apples"), we have nothing, so output 1 = "". Between match 1 & match 2 we also have nothing, so output 2 = "". Between match 2 & match 3 we have a space, so output 3 = " ". Finally, after match 3 we have nothing, so output 4 = "". OK, so far we have:

Outputs: 1 = "", 2 = "", 3 = " " and 4 = "".

Step 3: Pruning: when called with no limit argument, split() removes all the empty strings at the end of the output, so this becomes:

Outputs: 1 = "", 2 = "" and 3 = " ".

(If you'd used str.split("\\w*", -1) instead, you'd get all of the strings without any pruning.)

-- Matt

anil kumar
Ranch Hand

Joined: Feb 23, 2007
Posts: 447

posted Apr 28, 2007 07:02:00

0

Now i have understood Thanks you Matt and Chandra for your value time and response

And chandra May first week starts from tuesday so i don't know your exam date

In the case of " apples".split("\\w*"), the regular expression matches three times ("" at the start of the string, "apples", and "" at the end of the string -- Chandra Bhatt's analysis above isn't quite correct); so you get three fragments added to the output: "", " ", and "".

i am unable to understand how the bolded part is mathcing.please explain.

and in case of "this is to test" "this" first match "" second match ,m not understanding how this is coming.

"apple" :- in this String literal "apple" there are six blank strings ""

All above discussion concluded that non-matching trailing blank strings are chopped of by the split method until you pass Limit as second argument to the split method.

The latest question was regarding "".split("x*"); that returns ><, I mean one blank string.

It is only the non-matching trailing blank string that is chopped off by the split method. What is returned by this is just leading blank string. What the pattern says is find 0 or more occurrence of x.

I think, I may confirm this by this example:

Example #1:

Output: >< > <

Trailing blank is chopped off.

Example #2:

Output: >< > < ><

This is because of the second argument (Limit) we have passed to the split(...) method.

I think if you read the post, I have just posted above carefully, you will get that. What couple of examples I have given are just for that case only.

Thanks,

sharan vasandani
Ranch Hand

Joined: Feb 22, 2007
Posts: 100

posted May 08, 2007 01:58:00

0

am sorry but its not clear to me what do you want to say by this line.

It is only the non-matching trailing blank string that is chopped off by the split method. What is returned by this is just leading blank string.

in previous post matt has said all empty strings are pruned till a non -empty string is encountered, in our case there is no non-empty string so still why its printing "><"

Chandra Bhatt
Ranch Hand

Joined: Feb 28, 2007
Posts: 1707

posted May 08, 2007 02:13:00

0

We have pattern "x*" that says 0 or more occurrence of x. Remember 0 occurrence will do there too. So therefore spilt() has to return the tokens following the Pattern as a sort of delimiting sequence. I can think why confusion comes, it is because there is only blank string, but that can't be discarded by the split; what is returned by the split, we can say that is leading string (although that is trailing too (source of confusion)).

It that blank is followed by any other char literal that are constituting the string to be split, in that case only split would have chopped the un-matched trailing blanks, as I did in couple of examples in my previous post.

To get all the unmatched trailings pass the second parameter Limit negative for all or positive for the limit how many times it should be applied.

Thanks,

sharan vasandani
Ranch Hand

Joined: Feb 22, 2007
Posts: 100

posted May 08, 2007 03:13:00

0

stil not clear.

according to mat

Finally, the algorithm prunes those two empty strings ("") from the end of the array of results -- the pruning removes all the empty strings from the end of the results up to the first non-empty string

all empty strings are removed until a non-empty string is encountered.

Chandra Bhatt
Ranch Hand

Joined: Feb 28, 2007
Posts: 1707

posted May 08, 2007 04:23:00

0

Hi Sharan,

No issue to worry about.

Anyways, what do you think about this issue; how is this done? I think you should just manipulate the code, try using several modifications, split with second argument, with some positive values, -1 and all. You tell me how the things are happening there.

This is far better way as I think.

Keep it up!

Thanks,

sharan vasandani
Ranch Hand

Joined: Feb 22, 2007
Posts: 100

posted May 08, 2007 05:09:00

0

i know passing -1 will not prune any empty strings but will print them all.

but am confused between these two,

In the case of "apples".split("\\w*"), the regular expression matches twice ("apples" and "" at the end) to give fragments "" and "". Another empty string is added to account for the remainder, but all three ""'s are pruned at the end, resulting in an empty array as output.

I thought trailing empty strings are discarded, so the output should be nothing... ?

It looks like you may have found a bug -- or at least, an undocumented exception condition. From the source code, it looks like if there are *no* matches for the delimiter, it will just return the original string as an array of size one.

It doesn't even bother to check to limit parameter, or call the part that removes the trailing blanks.

What I think is happening is as follows: the split() JavaDoc says that, "If this pattern does not match any subsequence of the input then the resulting array has just one element, namely the input sequence in string form." However, this is implemented in Sun's code (pasted a few messages back) by testing if the index variable == 0. That would normally indicate no matches had occurred, however, it's also the case where the string itself is empty and there is a zero-length match.

My suspicion is that this is a Sun bug, in that the spec states that trailing empty strings will be discarded.

-- Matt [ May 08, 2007: Message edited by: Matt Russell ]

Matt Russell
Ranch Hand

Joined: Aug 15, 2006
Posts: 165

posted May 08, 2007 06:42:00

0

Originally posted by Henry Wong: Oops, I was wrong. This exception condition is documented in the JavaDoc... It looks like if there are no matches for the split delimiter, then the limit part of split (and any side effects) is not even applied. Henry

So, thank you very much for the fruitful discussion. Finally I feel like being able to predict the output of the split-method() in absolutely every case. :-) Unfortunately I am not able to state the same for all these parse-Methods around. Lots of work still to be done...

I was referring to the matching of the delimiter too -- * matches 0 or more: so x* matches even though there are no x's to match. It's quite possible I'm being dense and missing something, though ;-)

Interesting. You are absolutely correct.

From the source code, it does look like a bug. Apparently, it is checking to see if an internal variable (index) is not changed (to determine no matches). This variable starts of as zero, and ends up as the end of the last match -- which in this case is still zero.