Binary search

Binary search
You are encouraged to solve this task according to the task description, using any language you may know.

A binary search divides a range of values into halves, and continues to narrow down the field of search until the unknown value is found. It is the classic example of a "divide and conquer" algorithm.

As an analogy, consider the children's game "guess a number." The scorer has a secret number, and will only tell the player if their guessed number is higher than, lower than, or equal to the secret number. The player then uses this information to guess a new number.

As the player, an optimal strategy for the general case is to start by choosing the range's midpoint as the guess, and then asking whether the guess was higher, lower, or equal to the secret number. If the guess was too high, one would select the point exactly between the range midpoint and the beginning of the range. If the original guess was too low, one would ask about the point exactly between the range midpoint and the end of the range. This process repeats until one has reached the secret number.

The Task

Given the starting point of a range, the ending point of a range, and the "secret value", implement a binary search through a sorted integer array for a certain number. Implementations can be recursive or iterative (both if you can). Print out whether or not the number was in the array afterwards. If it was, print the index also.

There are several binary search algorithms commonly seen. They differ by how they treat multiple values equal to the given value, and whether they indicate whether the element was found or not. For completeness we will present pseudocode for all of them.

All of the following code examples use an "inclusive" upper bound (i.e. high = N-1 initially). Any of the examples can be converted into an equivalent example using "exclusive" upper bound (i.e. high = N initially) by making the following simple changes (which simply increase high by 1):

The algorithms are as follows (from Wikipedia). The algorithms return the index of some element that equals the given value (if there are multiple such elements, it returns some arbitrary one). It is also possible, when the element is not found, to return the "insertion point" for it (the index that the value would have if it were inserted into the array).

The following algorithms return the leftmost place where the given element can be correctly inserted (and still maintain the sorted order). This is the lower (inclusive) bound of the range of elements that are equal to the given value (if any). Equivalently, this is the lowest index where the element is greater than or equal to the given value (since if it were any lower, it would violate the ordering), or 1 past the last index if such an element does not exist. This algorithm does not determine if the element is actually found. This algorithm only requires one comparison per level.

The following algorithms return the rightmost place where the given element can be correctly inserted (and still maintain the sorted order). This is the upper (exclusive) bound of the range of elements that are equal to the given value (if any). Equivalently, this is the lowest index where the element is greater than the given value, or 1 past the last index if such an element does not exist. This algorithm does not determine if the element is actually found. This algorithm only requires one comparison per level. Note that these algorithms are almost exactly the same as the leftmost-insertion-point algorithms, except for how the inequality treats equal values.

The line in the pseudocode above to calculate the mean of two integers:

mid = (low + high) / 2

could produce the wrong result in some programming languages when used with a bounded integer type, if the addition causes an overflow. (This can occur if the array size is greater than half the maximum integer value.) If signed integers are used, and low + high overflows, it becomes a negative number, and dividing by 2 will still result in a negative number. Indexing an array with a negative number could produce an out-of-bounds exception, or other undefined behavior. If unsigned integers are used, an overflow will result in losing the largest bit, which will produce the wrong result.

One way to fix it is to manually add half the range to the low number:

mid = low + (high - low) / 2

Even though this is mathematically equivalent to the above, it is not susceptible to overflow.

Another way for signed integers, possibly faster, is the following:

mid = (low + high) >>> 1

where >>> is the logical right shift operator. The reason why this works is that, for signed integers, even though it overflows, when viewed as an unsigned number, the value is still the correct sum. To divide an unsigned number by 2, simply do a logical right shift.

Both solutions are generic. The element can be of any comparable type (such that the operation < is visible in the instantiation scope of the function Search). Note that the completion condition is different from one given in the pseudocode example above. The example assumes that the array index type does not overflow when mid is incremented or decremented beyond the corresponding array bound. This is a wrong assumption for Ada, where array bounds can start or end at the very first or last value of the index type. To deal with this, the exit condition is rather directly expressed as crossing the corresponding array bound by the coming interval middle.

C++'s Standard Template Library has four functions for binary search, depending on what information you want to get. They all need

#include <algorithm>

The lower_bound() function returns an iterator to the first position where a value could be inserted without violating the order; i.e. the first element equal to the element you want, or the place where it would be inserted.

int*ptr = std::lower_bound(array, array+len, what);// a custom comparator can be given as fourth arg

The upper_bound() function returns an iterator to the last position where a value could be inserted without violating the order; i.e. one past the last element equal to the element you want, or the place where it would be inserted.

int*ptr = std::upper_bound(array, array+len, what);// a custom comparator can be given as fourth arg

The equal_range() function returns a pair of the results of lower_bound() and upper_bound().

std::pair<int*, int*> bounds = std::equal_range(array, array+len, what);// a custom comparator can be given as fourth arg

Note that the difference between the bounds is the number of elements equal to the element you want.

The binary_search() function returns true or false for whether an element equal to the one you want exists in the array. It does not give you any information as to where it is.

bool found = std::binary_search(array, array+len, what);// a custom comparator can be given as fourth arg

do (n = 12)-> odds = (it for it in[1..n]by2) result = (it for it in \(binarySearch odds, it for it in[0..n]) \whennot isNaN it) console.assert"#{result}"is"#{[0...odds.length]}" console.log"#{odds} are odd natural numbers" console.log"#{it} is ordinal of #{odds[it]}"for it in result

Output:

1,3,5,7,9,11 are odd natural numbers"
0 is ordinal of 1
1 is ordinal of 3
2 is ordinal of 5
3 is ordinal of 7
4 is ordinal of 9
5 is ordinal of 11

The following solution is based on the one described in: C. A. Furia, B. Meyer, and S. Velder. Loop Invariants: Analysis, Classification, and Examples. ACM Computing Surveys, 46(3), Article 34, January 2014. (Also available at http://arxiv.org/abs/1211.4470). It includes detailed loop invariants and pre- and postconditions, which make the running time linear (instead of logarithmic) when full contract checking is enabled.

is_sorted (a:ARRAY[INTEGER]; low, high:INTEGER):BOOLEAN-- Is `a[low..high]' sorted in nondecreasing order?require a.lower<= low high <= a.upperdoResult:= across low |..|(high -1)as i all a [i.item]<= a [i.item+1]endend

has (a:ARRAY[INTEGER]; key:INTEGER; low, high:INTEGER):BOOLEAN-- Is there an element `key' in `a[low..high]'?require a.lower<= low high <= a.upperdoResult:= across low |..| high as i some a [i.item]= key endend

Factor already includes a binary search in its standard library. The following code offers an interface compatible with the requirement of this task, and returns either the index of the element if it has been found or f otherwise.

This version is designed for maintaining a sorted array. If the item is not found, then then location returned is the proper insertion point for the item. This could be used in an optimized Insertion sort, for example.

This has the array indexed from 1 to N, and the "not found" return code is zero or negative. Changing the search to be for A(first:last) is trivial, but the "not-found" return protocol would require adjustment, as when starting the array indexing at zero. Aside from the "not found" report, The variables used in the search must be able to hold the values first - 1 and last + 1 so for example with sixteen-bit two's complement integers the maximum value for last is 32766, not 32767.

Depending on the version of Fortran the compiler supports, the specification of the array parameter may vary, as A(1) or A(*) or A(:), and in the latter case, parameter N could be omitted because the size of an array parameter may be ascertained via the SIZE function. For the more advanced fortrans, declaring the parameters to be INTENT(IN) may help, as despite passing arrays "by reference" being the norm, the newer compilers may generate copy-in, copy-out code, vitiating the whole point of using a fast binary search instead of a slow linear search. In this case, INTENT(IN) will at least prevent the copy-back. In such a situation however, preparing in-line code may be the better move: fortunately, there is not a lot of code involved. There is no point in using an explicitly recursive version even though the same actions may result during execution because of the overhead of parameter passing and procedure entry/exit.

Later compilers offer features allowing the development of "generic" functions so that the same function name may be used yet the actual routine invoked will be selected according to how the parameters are integers or floating-point, and of different precisions. There would still need to be a version of the function for each type combination, each with its own name. Unfortunately, there is no three-way comparison test for character data.

The use of "exclusive" bounds simplifies the adjustment of the bounds: the appropriate bound simply receives the value of P, there is no + 1 or - 1 adjustment at every step; similarly, the determination of an empty span is easy, and avoiding the risk of integer overflow via (L + R)/2 is achieved at the same time. The "inclusive" bounds version by contrast requires two manipulations of L and R at every step - once to see if the span is empty, and a second time to locate the index to test.

Imagine a test array containing the even numbers: 2,4,6,8. A count could be kept of the number of probes required to find each of those four values, and likewise with a search for the odd numbers 1,3,5,7,9 that would probe all the places where a value might be not found. Plot the average number of probes for the two cases, plus the maximum number of probes for any case, and then repeat for another number of elements to search. With only one element in the array to be searched, all values are the same: one probe.

The point of this is that the IF-test is going to initiate some jumps, so why not arrange that one of the bound adjustments needs no subsequent jump to the start of the next iteration - in the first version, both bound adjustments needed such a jump, the GO TO 1 statements. This was done by shifting the code for label 2 up to precede the code for label 1 - and removing its now pointless GO TO 1 (executed each time), but adding an initial GO TO 1, executed once only. This sort of change is routine when manipulating spaghetti code...

It is because the method involves such a small amount of effort per iteration that minor changes offer a significant benefit. A lot depends on the implementation of the three-way test: the hope is that after the comparison, the computer hardware has indicators set for various outcomes, so that the necessary conditional branches can be made through successive inspection of those indicators, rather than repeating the comparison. These branch tests may in turn be made in an order that notes which option (if any) involves "falling through" to the next statement, thus it may be better to swap the order of labels 3 and 4. Further, the compiler may itself choose to re-order the various code pieces. First Fortran (in 1958) had a FREQUENCY statement whereby the programmer could indicate which paths were the more likely - for the binary search, equality is the less likely discovery. An assembler version of this routine attended to all these details.

Some compilers do not produce machine code directly, but instead translate the source code into another language which is then compiled, and a common choice for that is C. This is all very well, but C is one of the many languages that do not have a three-way test option and so cannot represent Fortran's three-way IF statement directly. Before emitting asservations of faith that pseudocode such as

That is, does the compiler make any remark, and does the resulting machine code contain a redundant test? However, despite all the above, the three-way IF statement has been declared deprecated in later versions of Fortran, with no alternative to repeated testing offered.

Incidentally, the exclusive-bounds version leads to a good version of the interpolation search (whereby the probe position is interpolated, not just in the middle of the span), unlike the version based on inclusive-bounds. Further, the unsourced offering in Wikipedia contains a bug - try searching an array of two equal elements.

There are also functions sort.SearchFloat64s(), sort.SearchStrings(), and a very general sort.Search() function that allows you to binary search a range of numbers based on any condition (not necessarily just search for an index of an element in an array).

J already includes a binary search primitive (I.). The following code offers an interface compatible with the requirement of this task, and returns either the index of the element if it has been found or 'Not Found' otherwise:

bs=. i. 'Not Found'"_^:(-.@-:) I.

Examples:

235681011151920 bs 116235681011151920 bs 12Not Found

Direct tacit iterative and recursive versions to compare to other implementations follow:

If the input array is sorted, then binarySearch(value) as defined here will return an index (i.e. offset) of value in the array if the array contains the value, and otherwise (-1 - ix), where ix is the insertion point, if the value cannot be found. binarySearch will always terminate.

@interfaceNSArray(BinarySearch)// Requires all elements of this array to implement a -compare: method which// returns a NSComparisonResult for comparison.// Returns NSNotFound when not found-(NSInteger) binarySearch:(id)key;@end

@interfaceNSArray(BinarySearchRecursive)// Requires all elements of this array to implement a -compare: method which// returns a NSComparisonResult for comparison.// Returns NSNotFound when not found-(NSInteger) binarySearch:(id)key inRange:(NSRange)range;@end

recursive binary search
Key 0 not found
Key 1 found at postion 1
Key 4 not found
Key 7 found at postion 4
Key 11 found at postion 6
Key 12 not found
iterative binary search
Key 0 not found
Key 1 found at postion 1
Key 4 not found
Key 7 found at postion 4
Key 11 found at postion 6
Key 12 not found

SPARK does not allow recursion, so only the iterative solution is provided. This example shows the use of a loop assertion.

All the code for this task validates with SPARK GPL 2010 and compiles and executes with GPS GPL 2010.

The Binary_Searches package is shown first. Search is a procedure, rather than a function, so that it can return a Found flag and a Position for Item, if found. This is better design than having a Position value that means 'item not found'.

There are two versions of the package provided, although the Ada code of the two versions is identical.

The first version has a postcondition that if Found is True the Position value returned is correct. This version also has a number of 'check' annotations. These are inserted to allow the Simplifier to prove all the verification conditions. See the SPARK Proof Process.

procedure Search (Source : in Array_Type; Item : in Item_Type; Found : out Boolean; Position : out Index_Type);--# derives Found,--# Position from--# Source,--# Item;--# post Found -> Source (Position) = Item;-- If Found is False then Position is undefined.

The second version of the package has a stronger postcondition on Search, which also states that if Found is False then there is no value in Source equal to Item. This postcondition cannot be proved without a precondition that Source is ordered. This version needs four user rules (see the SPARK Proof Process) to be provided to the Simplifier so that it can prove all the verification conditions.

subtype Array_Type5 is Binary_Searches.Array_Type(Index_Type5);subtype Array_Type7 is Binary_Searches.Array_Type(Index_Type7);subtype Array_Type9 is Binary_Searches.Array_Type(Index_Type9);-- Needed to pass an array literal to Run_Search.-- SPARK does not allow an unconstrained type mark for that purpose.

This algorithm is tail recursive, which means it is both recursive and iterative (since tail recursion optimizes to a jump). Overflow is not possible because Ints (64 bit) are a lot bigger than the max length of a list.

Sorted values: L(1,3,5,7,9,11)
Not found: 0
found 1 at index 0
Not found: 2
found 3 at index 1
Not found: 4
found 5 at index 2
Not found: 6
found 7 at index 3
Not found: 8
found 9 at index 4
Not found: 10
found 11 at index 5
Not found: 12