Open Source Software in the Geographic and Location Industries: A Case Study

By
Philip Abrahamson

Open source software promises an attractive range of benefits:
it is free, vendor independent, and open to continual evolution and improvement
by a global development community.

For the geographic and location industries, the question is: does open
source software work in practice? To offer some possible answers,
we're going to provide a practical case study of software development for
the geographic and location industries using open source products and operating
systems. We'll describe our experiences and provide some commentary
on where other companies could benefit from the introduction of open source
into their services or products.

Window PainsSo what prompted us to start using, let alone developing, open source
software? In 1995 we needed a web site [1] to publish our Windows
software, an Internet based, three-dimensional map viewer, and made the
simple decision to go with the cheapest web account we could find.
Web accounts based on open source operating systems such as Linux [2] and
BSD [3] were half the price of Windows accounts, so we started on BSD in
complete ignorance, and as we needed to extend our web site's functionality,
were forced to extend our knowledge of the open source ecosystem.

Over time our Windows client development started to run into problems.
Our product was ambitious even by geographic software standards, and its
development time frame was too straining for a small company. What
seemed like a constantly changing operating system environment under Microsoft
forced constant changes to our software that started to wear on our patience.
On the other hand, our Web site development was easy and convenient.
Once over the initial open source learning curve, we realized that any
software written to the Unix platform would, to all intents and purposes,
last forever. While Windows would crash hourly during our C and assembly
programming, our server accounts on BSD, and later Linux, never crashed.
It wasn't long before we started asking ourselves whether we should develop
on some form of open operating system also, and installed Linux on one
of our desktops. Eventually it came time to formulate a new business
plan, and we gravitated naturally away from our inconveniences with Windows
and toward open source operating systems and software.

Development GoalsOur new plan was to develop a Location Search for the Web. The
software would be open source, the operating system would be open source,
and the tools would be open source. We finally felt secure that whatever
time we invested in it would not go to waste due to some sudden operating
system change. Any competitors we would attract would be forced to
compete on their merits, in the open environment of the Web, and as a bonus
we would no longer have to write "what ifs" about Microsoft moving into
our application space in our business plans.

The business plan was to make money from "Location Advertising" [4],
which we predicted would eventually be more valuable than ordinary advertising,
because of its improved 'targeting', as advertisers call it. We charted
the business weakness of banner advertising, and the growing strength of
"keyword" text advertising, and adapted our Location Advertising plans
accordingly. Open source software was a perfect fit for us because
our business model would be strengthened if we gave the software away for
free, and it would provide a powerful competitive edge in future.

"Mobilemaps" [1], our Location Search, would allow people to find information
on Web pages that were geographically identifiable as local to them.
We needed to "spider" the Web, crawling from one Web site to another, indexing
the information, and then "geo-code", or geographically locate, each Web
site. We needed to provide a map of the user's area, and plot all
the Web sites on the map, as well as list their titles and descriptions
like a regular search engine.

(Click on graphic for larger image)

Choosing Open Source ToolsOur operating system development platform would be Linux, which we
already knew to be considerably more stable than Windows. However,
because of the open nature of the software we would use, our product would
run seamlessly on Windows, and other Unix versions also. This cross
platform compatibility would provide us with an edge over "Windows only"
competing products.

For our Web server it was the easy choice of Apache [5], the world's
leading Web server, with more share of the market than all other Web servers
combined, including Microsoft's IIS. It has been our experience that
text based configuration files for server products such as Apache (which
we would recommend for Windows users also) are much more convenient than
complicated menu systems, such as those used in IIS. Before making
a series of changes you can keep a copy of your old configuration, so if
you make a mistake you can revert back to it in a moment. Installing
new servers takes no time at all; simply upload an old text file.

Our language of choice was Perl [6], and where necessary we were prepared
to optimize using C. While Java has been gaining a lot of attention,
Perl has the same or better cross-platform capabilities as Java, and is
truly open source, providing no concerns about its future. Perl allows
C programmers to get up to speed in a short space of time, offers excellent
productivity for experienced programmers, and has an active development
community which contributes open source libraries to a common Web repository
called CPAN [7]. We selected open source CVS [8] for version control,
because it is popular and supports graphical interfaces as well as the
command line. It has a reputation for stability, but its functionality
is sometimes considered limited compared to some professional Version Control
systems, such as Bitkeeper [9] which has been selected by the Linux Kernel
developers.

We favor an iterative development methodology, because while developing
a radically new product, it is difficult to predict the problems ahead
of time, and better to find them quickly during prototyping. We decided
to build a rapid search engine prototype using the open source MySQL database
[10], with the intention of replacing it with our own code later.
We were so impressed with MySQL's performance and stability, however, that
we've kept it as an integral part of our engine since. It has helped
minimize the quantity of "low-level" C code in our product, and enabled
the majority to be "higher-level" Perl, which is faster to develop with.

The latest release of MySQL, version 4.1, includes Geographic database
capabilities, which we imagine will see heavy use in a wide range of Geographic
and Location based services and products (MySQL offer a closed source license
also) in future, but these capabilities were not available during our development.
An older open source database, called PostgreSQL [11], has had geographic
capabilities for some time, but we were unfamiliar with PostgreSQL and
were looking more for performance and convenience than serious functionality
(a PostgreSQL user might tell you this is unwarranted stereotyping).

Solving the ProblemsWhile a built-in database geographic capability was a tempting prospect,
we eventually came to realize our Location Search would require a unique
geographic and text algorithm to be truly scalable.

Traditional search engine's, like Google [12], are designed to have
an approximately constant look-up time irrespective of their database size,
but earlier geographic search engines such as Northern Light's [13] appeared
to have a weakness here. To achieve scalability Google must perform
most of their sorting of results ahead of the search itself, which can
be accomplished by "relevance scoring" words in Web pages and presorting
those scores. The question we faced was how to do this for a geographic
search which involved proximity? Proximity is constantly changing
depending on the location of the searcher, and if we needed to calculate
it for every result at every search, our engine would not scale at all.
Initially in our prototype we did what other database vendors have done,
and limited our results to a defined area, but as the size of our database
grew, the area we could search had to keep shrinking to maintain our speed.

Following further research, we realized we would need to use what is
called a "space filling curve" or "Peano code", named after the 19th century
mathematician who invented them. This curve allows presorting of
proximity by forming one continual index that runs through space in a zig-zag
pattern. The problem was that the curves proved inaccurate by themselves
under certain conditions. An elegant solution was to overlay two
space curves on each other, but slightly offset them, and use the best
results from either. Currently we combine proximity with a traditional
word relevance score in our sort order, so that both closer and more relevant
results are found. In this way we succeeded in creating a Location
Search engine that is as scalable as a traditional search engine.

Mapping SolutionsInitially our prototype was a pure Location Search, but we wanted to
visually map the search results to provide users with a picture of where
the results were. We prototyped this on our old three-dimensional
map technology, but needed a regular two-dimensional map-server for general
use. Fortunately our need was met by the timely discovery of the
open source MapServer [14] product developed in Minnesota.

While MapServer provides a functional street-level map server, and we
can recommend it, the big expense in providing maps over the Internet is
usually the data rather than the map-server. North America seems
to be alone in adopting the policy that government data funded by the tax-payer
should be made available for free to the tax-payer. In European countries,
for example, this data is sold for substantial ongoing costs by Governments,
which effectively cripples start-up geographic software companies in these
territories. In order to adequately demonstrate, and effectively
develop our system, we needed to use free TIGER data [15] from the United
States Census Bureau.We decided to showcase our technology with Californian
Web pages, but we could have picked any other American geography.
An additional mention goes to the venerable open source GRASS GIS package
[16], that proved useful while manipulating some of our Shape file data.

Harvesting Web PagesWe would have welcomed an open source Web spider, but could not find
one, although we have heard of others becoming available more recently
such as Grub [17]. The Perl LWP libraries provided us with a considerable
head start in developing our own, and we made use of Perl's TK window libraries
to add a simple, cross platform graphical user interface. Currently
different Web sites, or "Portals", using a copy of our Mobilemaps software
must spider independently of each other, with some overlapping effort.
To improve this we are designing distributed spider functionality that
allows different Portals to combine their results, and collaboratively
spider the Web, each becoming an authority on Web pages in their own local
area. Unlike the Looksmart design, there will be no central server,
and the spidered data will be openly available for anyone to use.

Determining LocationsA key component of our Location Search is its "geo-coding" capability,
which could be used as a stand-alone open source tool. We needed
to identify street addresses that were published on Web pages, and then
convert those addresses into geographically coded locations, in degrees
of latitude and longitude. In both the United States and the United
Kingdom at least, there are recognizable zipcode, or postcode patterns
that an autonomous Web spider can find. In the United States there
are also certain patterns to addresses that help to exactly identify what
part of the seemingly random text on a Web page is an address. Once
an address is extracted from an American Web page it can be matched against
the TIGER Census database to identify its location. When a United
Kingdom address is found it can either be cheaply matched against a post-code
database, or matched against a more expensive complete list of geographically
positioned addresses. The powerful "regular expression" text matching
capabilities of the Perl language were very helpful in this respect, and
we used MySQL to store the address databases to match against.

User convenience has marked the popularity of search engines that offer
a single search box, so we developed "Sox", a "Single search box", to provide
a single input box for Location Searches, and enable a user to type "los
angeles wedding photography" and have it do what they expect. This
is one of the rare areas of our code that needed the speed of the C language.

Business Models and ResultsOpen source software typically favors service oriented business models,
although MySQL is a notable example that successfully combines an open
source license with an alternative commercial license. IBM is a typical
example of a company that profits from open source in its consulting services
arm, where Linux provides them with superior profit margins. Mobilemaps
is based on an advertising service model that offers a free sign-up to
its distributed Location Advertising network in exchange for a small percentage
commission on advertising "clicks". While it is of course possible
for companies deploying Mobilemaps software to use whatever advertising
they wish, Mobilemaps offers the immediate convenience of not having to
write a Location Advertising system, and is ultimately likely to offer
better income to Portals due to economies of scale. Before embarking
on an open source project with commercial backing, we recommend finding
a suitable business model that will add value to the entire community who
might use the software, and not just the original investor.

Mobilemaps has successfully released its software after approximately
three man-years of development. It's Location Advertising network
is in the process of deployment, and time will tell whether its open source
business model is competitive. Open source software is an evolutionary
creature, and there is no knowing what uses may be made of our software
in future by the development community. For example we see both its
search and advertising translating easily from the wired Web to the wireless
Location services industry. The great strength of open source development
is the ability to make use of others' contributions, as witnessed by the
numerous open source software packages that helped build our system.
We see this as the most practical network development model, and are confident
that future advances in open source Location Search can now take days or
months rather than years.