First let’s have a look at those two Defined Names then, and we’ll start with the simpler of the two, i.e. Arry1, which is:

=ROW(INDIRECT("1:"&LEN("α"&$A1&"α0")-1))

This is a relatively straightforward construction used to generate an array of integers. The only feature which will require some explanation is the concatenation of the string in A1 with, at the start, a single non-numerical character (“α”) and, at the end, a string consisting of a non-numerical and a numerical (“α0”).

This explanation I will leave until the dissection of the main formula. For now, it is sufficient to note that the above resolves to:

The basic premise of this set-up is to test for numbers within the string in A1 by reference to the ASCII character codes. This is by no means the only way of testing for numericalness, which we would normally do using e.g. ISNUMBER, though it does have the advantage that it offers us a means by which we can simultaneously test for non-numbers within the string.

In order to identify each of our substrings of numbers we must effectively locate two different sets of points within the string: one set corresponding to each of the positions where our numbers begin, the other to each of the positions where our numbers end.

The first of these sets will thus give us our start_num parameters to pass to MID, and a simple subtraction between members of both sets will give us the corresponding num_chars parameters.

What’s more, recall that I declared that this set-up should also be able to cope with any decimal values within the string. As such, if we were to use some sort of construction which involved testing for Boolean FALSE returns to an ISNUMBER query in order to determine these start and end positions, then any decimal points within the string could lead to incorrect results.

Perhaps fortuitously, the ASCII codes for the digits 0-9 range from 48 to 57 and that for the full-stop/decimal point is 46. Hence, if we first eliminate from our string any characters corresponding to ASCII code 47 (which is the forward slash – “/”), then we will be in a position to state with confidence that any character within our string which has an ASCII code outside the range 46-57 is neither a number nor a decimal point.

Now, rather than perform two separate tests on each character within the string – i.e. one as to whether its ASCII code is greater than 45 and another as to whether its ASCII code is less than 58 – a nice trick to know is that this is precisely equivalent to performing a single test as to whether the absolute value of 51.5 minus its ASCII code is less than or equal to 6.

A further point to note is that, in order to detect the start and end positions of numbers, we are effectively looking for pairs of characters within the string, where either the first in that pair is a non-number and the second a number (which will give us the start of our numeric string) or the first in that pair is a number and the second a non-number (which will give us the end of our numeric string).

Of course, if the very first or very last characters in our string are equally to be tested according to these criteria (which indeed they need to be), then we need to ensure that there exists some character(s) either preceding or succeeding them in the string. We do this by first concatenating suitable strings to the start and end of the entry in A1; hence the “α”& and &”α0” that you can see.

With these thoughts in mind, let’s now return to the breakdown of Arry2, which, using our string in A1, will become:

The next step is a technique worth remembering. Since we wish to consider pairs of characters from within the string, we effectively need to compare the character in position 1 with that in position 2, the character in position 2 with that in position 3, and so on and so on.

In order to do this, we need to generate an array of values to be passed to MID for its start_num parameter, viz: {1,2;2,3;3,4;4,5;5,6;…}, etc.

Since Arry1 – {1;2;3;4;5;6;7;8;9;10;11;12;13;14;15;16;17;18;19;20;21;22;23;24} – is a 24-row-by-1-column array, then, by performing an addition of this array and a 1-row-by-2-column array – {0,1} – we know that the resulting array will be a 24-row-by-2-column array, the 48 entries of which being the results of adding each of 0 and 1 in turn to the 24 values in Arry1. Hence the above becomes:

we can see that the first highlighted pair of results {0,1}, corresponding to {“1″,”;”}, is due to the fact that “1” is a number and “;” is not.

The second highlighted pair of results {1,1} corresponds to {“>”,”@”} since both “>” and “@” are non-numbers.

The third pair {1,0} is generated by the fact that {“@”,”4″} is an array consisting of a non-number followed by a number.

And finally the fourth pair {0,0}, corresponding to {“4″,”5”}, is by virtue of both “4” and “5” being numerical.

With that in mind, we now need to have a means by which we can differentiate these four pair-types, which are equivalent to:

{0,0}: both values in the pair are numerical
{1,0}: the first value in the pair is non-numerical and the second numerical
{0,1}: the first value in the pair is numerical and the second non-numerical
{1,1}: both values in the pair are non-numerical

Obviously the two in which we are interested are the middle two of the above, since it is they that effectively tells us where our “change points” are in the string, from non-number to number and vice versa.

In order to achieve this differentiation, I chose to first multiply each element in this pair by a different value (the first by 2, the second by 1) and then use some matrix multiplication. to sum the results for each pair.

If we continue the evaluation of the formula, then, after multiplying by the array {2,1}, we have:

which produces, after we resolve the MMULT (I’ll highlight the results corresponding to our examples above):

{2;0;1;2;0;0;0;1;3;2;0;0;0;1;2;0;1;2;0;1;3;3;3;3}

The point being that each of these four values from 0-3 can only have come from one of the pair combinations I outlined above.

For example, the only way to get a value of 3 is by summing a 1 and a 2. But to have had an array consisting of a 1 and a 2 after multiplying by {2,1} means that our original pair must have been {1,1}, i.e. both values in the pair are non-numerical. Hence a 3 means precisely that.

A value of 2 can only have come from multiplying {2,1} by (1,0}, i.e. from a pair consisting of non-numerical followed by numerical.

And so on. The end result being that we now have an array of values – Arry2 – from which we can conclude with certainty at which points within our string we change from number to non-number and then back again to number. And so we are most of the way there.

We need to look at the two parameters being passed to MID here. I’ll start with that which defines the position within the string, i.e. the start_num parameter, which is given by:

1+SMALL(IF(Arry2=2,Arry1),COLUMNS($A:A))

Here you can see that we are creating an array based on values from Arry2 being equal to 2, which, as we have just demonstrated, corresponds to a pair of characters consisting of a non-numerical followed by a numerical.

and things should be slightly clearer now, since, if we look at the non-FALSE entries in this array, we have 1, 4, 10, 15 and 18, which readers may wish to manually verify are indeed the starting positions in our string for each of the desired number extractions.

Hence, for the formula in B1, in which the COLUMNS construction will be COLUMNS($A:A), which is simply 1, this will resolve to:

The num_chars parameter for MID, which gives the length of the string, is determined by a slightly more complex set-up, though in essence follows a similar logic to that outlined above for the start_num parameter. This is given by:

We see here that we are this time interested in values from Arry2 which are equal to either 1 or 2 – since, as we know, these two values correspond respectively to a numerical/non-numerical and a non-numerical/numerical pairing within our string.

Effectively, in order to determine the length of each of our extractions, we need to calculate the difference in positions between successive occurrences of each non-numerical/numerical and numerical/non-numerical pair, since clearly those very points represent the start and end of each grouping of consecutive numbers.

Again, readers may wish to verify that the non-FALSE entries in the above array correspond to the positions within our string at each point where there is a change from non-number to number and vice versa.

All that is left to do is to derive a construction for the k parameter of our SMALL function such that, when copied across to the right, we are able to extract pairs of values from this array, the first pair being the 1st and 2nd values (1 and 3), the 2nd pair being the 3rd and 4th values (4 and 8), and so on. We can then subtract the 1st in each pair from the 2nd to give our desired lengths.

Hence, for the formula in B1, for which the COLUMNS clause will be COLUMNS($A:A), which is simply 1, our formula will become:

and resolving the SMALL function, which note is being passed an array of, not just one, but two values for its k parameter:

=SUM({1,3}*{-1,1})

which of course is:

=SUM({-1,3})

i.e. 2, as required.

Finally, then, putting all this back into the original formula in B1, we have:

=IFERROR(0+MID("α"&$A1,2,2),"")

i.e.:

=IFERROR(0+MID("α81;8.75>@5279@4.=45>A?A;",2,2),"")

and readers can easily verify that this resolves to:

=IFERROR(0+"81","")

i.e. 81, as desired.

A useful set-up, then, which can be applied to strings containing all types of character. In practice, the task of extracting such groups of consecutive numbers from a mixed string is not that uncommon, so this technique is not of a purely theoretical nature.

Another instalment to follow shortly, in which I will look at extracting numbers from a string based on certain conditions.

Related

Post navigation

29 comments

I like the fact that 8.75 is considered as a number, and that 4. is translated into 4, I was however looking for a way to count the numbers within the string rather that to use the IFERROR, thus just to satisfy a personal curiosity. The concern I encounter is that how to exclude 4.4.4 and accept 4.4… And if we were to use this with numbers such as 1,000.00 how to make it understand that this is a whole number and not just a 1 and a 0…

An interesting task, and one which I haven’t yet put much thought into. As such, I think you’ve convinced me to make it the topic of a post here!

Can we first just clarify what should be counted? Is it just the two features you give, i.e. decimal points and thousand separators which need to be somehow resolved correctly? Or are there any other features which could be ambiguous?

For example, would you want to see a count related to a date string, e.g. 12/3? If so, what count? 2? 3? Or do we need to also exclude anything that could possibly be interpreted as a date from the count? So strings such as “September 2014” would not register a count for the “2014” part? This could be extremely problematic, more so since any comprehensive solution would have to take into account differences in regional language settings, an almost impossible task!

Also, re the thousand separators, is this just for the comma version? Or do we need to also consider cases where the thousand separator is “.”, and conversely the decimal “point” is “,”?

If you can let me have a definitive list of what should be included in the count, what shouldn’t be, and also what counts should be attributed to each number (are we counting individual numbers or “groups” of numbers? Is “1234” a count of 1 or 4?) then I can start investigating potential solutions.

I think that ambiguous strings composed of integers with either . , – / or : are the main concern.

1,000.00 and 1.000,00 are the same value depending on which side of the pacific you are located (to overly simplify the regional scheme), but maybe we could consider only values without separator for the thousands thus simplifying the issue and limiting furthermore to a full stop punctuation ‘.’ for the decimals.

The way I see it ‘accounting’ formatting, since it is within a complex string, will be the thing throwing off the model, as 1,234.050 could be considered as a number (1234.05) and not as a series of 3 numbers such as 1 then 234 then 5.

I would suggest the following as acceptable:
1
1.
1.1
.1
1000.0

Individual numbers in group are counted as one as long as logically built. 1234.050 is logical while 1,234.050 should be ignored, 1,23,4.050 isn’t and neither is 1234.05.0

Now for the dates, we have a slight concern. 12/4 could be 12 divided by 4 it could be the 4th of december of the present year or 41977 or 12 and 4… the same applies to 12-4, or 12:4 as it could mean 0.502777777777778. Maybe here we could apply a simple rule, either values entered as date are dates or not, but then and again what about 12/13 and 13/12? One is a date the other is two numbers. And 12:3.75 could the a time…

I am still unsure of how to handle time and dates or partial dates.

I have some protein codes in excel and those are complex string of date-time-origin-solvent-extraction_model but since the number of integers is fixed, it does not require a PhD in order to decipher them. Your sample 81;8.75>@5279@4.=45>A?A; is definitely more convoluted.
Here 12:4.75>@5,279@4.=11-4>A?A; could be a concern as the values may lead to ‘double entendre’.

I fear I might not have delimited parameters but rather just enumerated concerns.

Indeed! The issue is perhaps not so much formulating a solution, but defining our criteria to begin with!

Do you want me to have a go at forming a generalisable solution which counts all numbers in a string based on the conditions you outlined above, though ignoring any concerns about potential date strings? So that 14/9 would simply be a count of 2? Likewise for 12-4 and 12:4?

I believe it will be an interesting process, further more you’ll be able to get rid of IFERROR. I definitely am looking forward your ideas, and will do my best to bring it to the test. Have a nice weekend XOR LX.

Actually, before I begin this new investigation, it’s just struck me that I may already have the gist of the solution in this post. I mean, is there anything to prevent us from simply adapting the solution given here to generate a count, rather than a list of each number present?

Looking back at these comments, I don’t see any significant additions/amendments to the criteria I used: dates are excluded, decimals remain the same, etc.

“is there anything to prevent us from simply adapting the solution given here to generate a count, rather than a list of each number present?” No you are correct, the count as of this moment wil follow the same rules implied in the original post.

So ‘yes’… But I was thinking that we could try in two parts, first to indeed simplify the query search by excluding ambiguous values (complete dates, partial dates, time that may be entered with either of /.,:)

Then eventually to accept the possibility of more complex strings with a combination of dates, time value, fraction decimal and to be able to recognize them as such, to count them as such and extract them individually without the need of using IFERROR… Or am I pushing too far the issue?

“But I was thinking that we could try in two parts, first to indeed simplify the query search by excluding ambiguous values (complete dates, partial dates, time that may be entered with either of /.,:)“

Not sure what you mean – sorry. As it stands, the query is far simpler if we don’t first “exclude” such strings. I mean, to write something into the formula which first detects all the myriad combinations which could possibly be interpreted as a date just for the purpose of then excluding them from the count would be an enormously complicated task, if at all possible. As such, it would not at all be simplified!

“…and extract them individually without the need of using IFERROR.“

Again – not sure I understand. The IFERROR in the formula is simply so that blanks will be returned in cells where necessary as the formula is copied to the right. It does not play a part in the actual detection of numbers within the string.

hehehe we misunderstood each other. IF the formula could indeed identify and separate all possible variation of strings (date time…) then it would be great.

e.g 1JA41409290829SCH2CL2CHCL3R80-20%2542-0.82 here is composed of only six important data, a date time record, a ratio with the corresponding solvents a starting value and an ending value, the name of the solvent (an alphanumeric suite) the batch number.

It may also be entered as follows:
12MC1-141001-800SC2H6O3461-1.098 (different lab) But this one is not as much an issue as most values are in fact separated.

As for iferror, a count would render the use of iferror unecessary, right? Although this is purely to satisfy a thirst for knowledge…

Ah, but as I said, this is a near-impossible task, at least in general. Perhaps in your particular case a solution could be tailored to work, one in which the date strings are limited to a small subset of possibilities, i.e. of the form dmmy (is this correct?) or ddmmy.

But to design such a solution which takes into account all possible alphanumeric combinations which could possibly be interpreted as a date string…

Even attempting to delineate such a list of possibilities becomes, after a while, a mammoth task. You could first start with the obvious, such as dd/mm/yyyy and dd-mm-yyyy, but then you may want to incorporate US variations, i.e. mm/dd/yyyy and mm-dd-yyyy. After that you may consider simply ddmmyyyy, or ddmmyy, or yymmdd.

And then of course you start to consider cases in which the month is not a numeral, but the actual wording, e.g. January, or Jan. And then what if we want our solution to work irrespective of the language involved? Do we design a solution which can work for all month names in all possible language versions of Excel?

And then what about your particular case? Who would have thought that 12MC1 could mean 12th March 2001? If that’s a possibility, then we have to accept that so too are 12MRH01, 12MCH2001, 12MH1, etc., etc., etc., an almost uncountable number of permutations in itself.

I’m afraid that you may be confusing the particular with the general here. It seems that your work has led you to a specific problem regarding string extractions, where each of the extractions is of a form particular to your case. As such, and like I said, it may be possible to develop a solution which works for those strings. However, to extend that to a generalisable solution which works for all strings containing dates, ratios, fractions of all possible descriptions and varieties is, as you can see, a Herculean, if not impossible task.

As such, and if you are indeed seeking a solution to your particular problem, it may well be worth you posting on one of the Excel forums around, in which case if you let me have the link I’d be happy to have a go at solving it there.

However, as you can gather from your visits to this site, here we are more interested in developing those generalised solutions which you hinted at, and not so much in particular, esoteric cases. And since I feel that, in this case, such a generalised solution is simply not possible, I will have to pass on any attempts to develop one, unfortunately.

Yes exactly, differentiating all possible combination would prove quasi impossible as users may opt for illogical or let’s say non-conventional way of entered/ recording data. I not looking for a way to extract data provided in previous post, that was just to illustrate the difficulty to COUNT numerical strings within complex alphanumerical strings.

That is what I was interested in, how to count / identify them, and as you underlined several times, this is gargantuan work…

For the example provided we have no difficulty identifying what it refers too. Just to answer your query about the date value, the date is entered in descending order, hence ‘yearmonthdayhoursminutes’ so 1409290829 would be the 29th of September 2014 at 8:29 AM…

Of course. A UDF will always be the more efficient and flexible choice here, and I agree with you, in general, when you state:

“Most of these more complex cases can still be solved using formulas, but the required formulas are often long, complicated, hard to understand and do not adapt well to changes in the data.”

though I might argue that the solution I give here is neither overly long (creating a few Defined Names helps, and is not overly time-consuming), nor susceptible to error if the data being interrogated changes. True, I couldn’t so easily dispute the claims that my formula is “hard to understand” and “complicated”, though I hope that my attempts at a deconstruction at least go some way to remedying those drawbacks!

Besides, I’m well aware that a lot of the material I post here could – in comparison to the equivalent VBA solution – be described as being “complicated” (or perhaps “convoluted” is a more appropriate choice). In my opinion, however, that is not to say that the formula-based solutions here are of a purely theoretical interest, too complex to have any genuine practical applications. On the contrary, although in the vast majority of cases there may well exist a more efficient, more elegant and more concise solution using VBA, there is nevertheless much of practical value which can be achieved using worksheet formulas alone (more than most realise, in my opinion), and, once you learn a few techniques and, crucially, grasp the means by which Excel operates with regards to arrays (both of which I try to make part of my teachings here), some of the things which once were complicated suddenly become quite straightforward.

At least that’s what I hope!

Many thanks for your contribution. And I might add that I’ve found much of interest over at fastexcel, so many thanks also for sharing your knowledge.

So in the end I decided for simplicity’s sake to substitute out blanks, commas, and dollar signs from the raw string before passing them to my function. I could do it in the function, but that’s a job for another day.

Also note that for some reason I used OFFSET inside of ROW to generate my arrays. I don’t know why I did that…it’s a poor choice in that if someone inserts rows OR columns at the top or left of the sheet, this will break. I’ll have to replace these with INDIRECTS or preferably an INDEX version in due course.

Before I attempt to dissect that beast ( 🙂 ), my first question would be: how did you determine the limits of what you wish to be considered for return here?

Specifically, tackling dates seems to be setting yourself an enormous task: you say that your formula will extract “all numbers that the local version of Excel deems as a number”, but by that did you account for strings such as: