The iOS bug chase

This article tells a story of chasing an iOS bug – a bug hidden so deep that it
required many different skills and debugging on different levels to identify it.
I think every native mobile app developer (not only an iOS developer) will find
this text interesting. Non-mobile developers may find it an intriguing read as
well.

Bugs

A mobile device is a fully-fledged computer these days and as such it always
does what it has been programmed to. If something fails, this is because a
computer did things exactly as it was commanded to. We – as humans – make
errors, which is part of our nature. Without errors, we would not be able to
learn.

I don’t want to insist on it, Dave, but I am incapable of making an error.

HAL 9000

In a novel by Arthur C. Clarke, a potential AB-35 unit crash could be detected
before it even occurred. In the real world, we diagnose bug symptoms using
various tools:

crash loggers — crashes and non-fatal errors,

tracking tools — detecting user flow anomalies,

remote configuration — possibility of disabling problematic features.

Often, these tools are sufficient to analyze an issue. But occasionally, they
can hardly detect if anything is wrong or, in the case of a very complex
problem, even the most sophisticated tools are of no help. One should be aware
that even a single undetected non-fatal error can affect the experience of
thousands of users. That is why here, at
Allegro, we have to
be very serious about all potential issues.

This article describes an iOS MapKit bug analysis and its resolution, starting
with source code, through network stack, down to the assembly, and ending up in
San Francisco.

The Bug

One of our quality assurance engineers reported a strange bug. No one but him
was able to reproduce it. The problem could be seen in the process of selecting
a parcel machine for shopping delivery. Apparently the problem occurred on a
single device only.

The
MKMapView
controller had trouble displaying a map. Internet connection was working fine
on the device. Restarting or reinstalling the app did not fix the problem.

After playing with the bug for some time, the issue suddenly disappeared. The
situation was terrifying. Our iOS app has tens of thousands of daily active
users. Even if the problem persisted for a small percentage of that volume, it
could prevent users from selecting a parcel machine. This would block shopping
for a large number of buyers — the last thing we really wanted. Any attempt to
break MKMapView again failed (i.e. simulating a poor network, MKMapView
stress testing, etc.). Suddenly, when the bug re-appeared on two other devices,
we had a chance to catch it.

Strangely enough, the iOS Maps app showed the same symptoms, so did all the
third party apps installed. This seemed to be a bug in iOS, but we could not
ignore it anyway.

Each time you start ignoring a bug, you end up looking just like this
owl.

Now seriously… Although we could not fix the bug directly in iOS, we could at
least try to bypass it, so that it would no longer occur in our app. The
analysis began.

The Code

I tried the most basic level of debugging — logging an error. MapKit provides a
handy delegate method:

-[MKMapViewDelegatemapViewDidFailLoadingMap:withError:]

so I implemented it with some NSLog logging inside.

I also added a breakpoint there, so I could debug and inspect the error in
depth using Xcode Variables View. The breakpoint was reached almost immediately.

The delegate method invocation was caused by a GEOErrorDomain domain error.
Its userInfo was a singleton dictionary, a single array of underlying errors
under the SimpleTileRequesterUnderlyingErrors key. Each underlying error had
two values in its userInfo dictionary: the first one under the HTTPStatus
key and the second one under the NSErrorFailingURLStringKey key. It was a
clear sign that a network error was a direct cause of failure to display map
tiles.

The Man In The Middle

When it comes to the analysis of network communication, one of the simplest,
yet most powerful tools you will ever need is mitmproxy. Never
heard of it? You should really check it out, as it can save you hours of
debugging in the future. In this case, I only used it to intercept network
requests, but mitmproxy has many more features (e.g. scripting).

I started to intercept network traffic and displayed the map to trigger its
network activity. Mitmproxy showed a lot of map tile requests.

There were a lot of requests finished with 410 response code, indeed. But
wait… what? 410?

410 Gone
Indicates that the resource requested is no longer available and will not be
available again. This should be used when a resource has been intentionally
removed and the resource should be purged.

I compared two tile requests – each corresponding to the same tile with the
same x, y and z coordinates, but the former finished with the 410 code
and the latter with the 200 code. Filtering the mitmproxy flow list with the
style=1.*&z=14&x=8962&y=5377 limit filter gave rewarding results.

Only one map tile request parameter looked suspicious – that was the v
parameter. I was 99% certain that the v stood for some kind of version
number. An analysis of a long-time request log confirmed my suspicions about
the v — the parameter went up every few minutes while I was browsing Apple
maps.

It was great! Imagine a world without the v parameter and a user browsing a
map region and the region being edited at the same time. That would result in
serious glitches. Map glitches are the last thing the car driver wants.

The question was: what caused v to increment? A couple of requests happened
in between the 410 and 200 responses, just while the v was being changed.

One request looked particularly suspicious and that was the request for
/geo_manifest/dynamic/config. It was also the only request that retrieved some
serious data. Unfortunately, the response inspection revealed binary data with
neither 11040529 nor 0xA87711 (in any endianness). Even though the new v
value was not clearly visible in the geo_manifest data, it could still be
present there. Anyway, it would be hard, if not impossible, to understand
binary data of an unknown format. But I still had a few more tricks to use.

The Machine Code

MapKit.framework was the one that should understand geo_manifest, so the
obvious option was to look for this understanding in the framework code. The
MapKit source code is obviously not publicly available, but there were two
things that helped overcome that obstacle.

Secondly, Hopper makes decompilation nothing but
pure pleasure. Hopper is such a powerful, yet simple and intuitive tool that
any person, even one without knowledge of assembly or Mach-O, could easily
analyze any executable. The basic Hopper usage scenario is as simple as that:

Use “Read Executable to Disassemble” and wait until Hopper processes the
binary.

Use the symbols panel to find the method of your interest.

Use “Show Pseudo Code of Procedure” to see selected method logic.

What method to look for in order to find a geo_manifest trace? The obvious
choice was to filter symbols using the geomanifest filter at first, and that
was it!

GEOResourceManifestManager caught my eye. Unfortunately, no method for that
class was visible, only an external symbol reference
_OBJC_CLASS_$_GEOResourceManifestConfiguration. This meant that MapKit used
another framework underneath. I listed shared libraries of MapKit dylib using
otool:

From among many other dylibs used by MapKit, GeoServices.framework looked
like the obvious owner of GEOResourceManifestConfiguration.
GeoServices.framework is a private system framework, no wonder I had never
heard of it before. So I tried to inspect the GeoServices dynamic library
using Hopper. I used GEOResourceManifestManager as a symbols filter and
Hopper showed a bunch of GEOResourceManifestManager methods. One of them was
the method:

-[GEOResourceManifestManagerforceUpdate]

Once again, I was very lucky.

The Hack

By pure coincidence I got another device affected by the maps problem. Having
the knowledge of GeoServices.framework internals, I could run the debugger
and try to perform some magic.

Then, a miracle happened — mitmproxy showed a request for
/geo_manifest/dynamic/config followed by a nice bunch of successful tile
requests.

I was so close, yet so far from the fix. This was a one-time device state fix,
a symptomatic treatment, not a fix to the root cause of the problem. But I
tried to do it at least to see if the whole investigation was not totally wrong.

Later, using another affected device and just to play around, I ran the test
app with the following code:

Note that using a private API is a violation of the iOS Developer Program
Agreement. Any app found using a private API is rejected by Apple. Even if such
app passes the Apple review, for example by hiding selectors with simple
ROT13, the app can be unstable.
Checking for respondsToSelector:? Still unsafe, because any private method
behavior could change anytime or cause a trap after detecting an illegal flow.
Do not ever try to release such code!

The Radar

The investigation showed one thing — the bug was clearly in iOS, affecting the
whole system and could not be properly fixed in the app. The only thing that
could and should be done, was to file an Apple bug
report (aka radar). An external user (non-Apple
employee) can only see bug reports reported by himself and no one else, so it
may also be a good idea to file an openradar
so that everyone else can find it and see its status. This way, “the 410
MapKit” issue described above was reported as rdar://25267344. The issue was
also described on the Apple Developer
Forum.

The Engineer

WWDC is full of sessions about the latest iOS topics, but the real value lies
in the labs. I visited a MapKit lab to ask what was going on with
rdar://25267344. I met an Apple engineer and told him about the “410 MapKit
issue”. He opened the Radar on his
iPhone and searched for the bug report. As it turned out, there was a lot of
comments under the bug report – comments seen only by Apple engineers. He told
me that my bug report helped to capture a 4-year old bug regarding incorrect
410 HTTP status handling and the bug was fixed in iOS 10 beta 1.

Shortly after that conversation, I received an update regarding the bug report
— it asked me to test the issue in iOS 10 beta and to let Apple know if it
still occurred. My first thought was: “It will really be hard to test this
non-deterministic bug…”, but was it? The engineer told me the bug was about
incorrect 410 HTTP status handling, so I thought I could reproduce the 410
status codes using mitmproxy. I just had to write a simple mitmproxy script
that would change the response of every tile request by replacing the status
code with 410.

Then, by adding the script to mitmproxy, I could test the map behavior in iOS
10 beta 2 (latest beta at that time).

Mitmproxy changed the status code of each tile request to 410. Once the first
tile request finished with 410 status code, geod daemon immediately updated
its manifest requesting fresh /geomanifest/dynamic/config. It worked just as
expected! The bug was resolved!

What could go wrong?

A bug chase is often a long and hard process. In the one I have described, luck
was a big contributor to success, because – as usual – many things could have
gone wrong:

the issue could just not have occurred on our test devices at all,

maps API could have been secured with SSL-pinning,

Apple could have ignored the report for such an ephemeral bug,

the investigation could have gone in a wrong direction,

the investigation could have required jumping through a decompiled framework call hierarchy — it is often very easy to get lost there.

Summary

Download Hopper, play with the trial version and add Hopper Personal License to your wish list.

File Apple bug reports — the Bug Reporter is not /dev/null, the whole Apple staff are just waiting for your reports.

“Stay Hungry. Stay Foolish.” + Learn internals… internals of everything — this will make the Force strong with you.

This was a happy-ending story — the bug has been resolved the right way, Apple
Maps will once again work seamlessly and the Allegro iOS app will provide the
best user experience. Nothing will stop our users from shopping. Or so it
seems… Crashes happen and we examine each crash report very carefully.
Maintaining the iOS app of the largest e-commerce business in Poland is a
challenge, but developers at Allegro do their best to protect users from any
obstacle to shop.