Thursday, February 28, 2008

If you could do geospatial analysis 50 to 100 times faster …

… than you can today, what compelling new things would this enable you to do? And yes, I mean 50 to 100 times faster, not 50 to 100 percent faster! I’m looking for challenging geospatial analytical problems that would deliver a high business value if you could do this, involving many gigabytes or terabytes of data. If you have a complex analysis that takes a week to run, but you only need to run it once a year for regulatory purposes, there is no compelling business value to being able to run it in an hour or two. But if you are a retail chain and you need to run some complex analysis to decide whether you want to buy an available site for a new store within the next three days, it makes a huge difference whether you can just run one analysis which takes two days, or dozens of analyses which take 30 minutes each, allowing you to try a range of assumptions and different models. Or if you’re a utility or emergency response agency running models to decide where to deploy resources as a hurricane is approaching your territory, being able to run analyses in minutes rather than hours could make a huge difference to being able to adjust your plans to changing conditions. There may be highly valuable analyses that you don’t even consider running today as they would take months to run, but which would have very high value if you could run them in a day.

If you have problems in this category I would be really interested to hear about them, either in the comments here, or feel free to email me if you prefer.

Update: I wanted to say that this is not just a hypothetical question, but I can't talk about any details yet. See the comments for more discussion.

20 comments:

In a world where everyone seems to only care about building another spinning globe it's nice to see that someone still cares about geoprocessing. From my perspective there are two areas in need of attention:1) Large overlay analysis - unions, interesects, tabulate areas, etc. using massive datasets.2) Automated feature extraction using GB to TB of remotely sensed data (iamgery and LIDAR.The lack of both these capabilities make working in urban areas, where the datasets are very large extremely inefficient. Try summarizing 0.5m resolution land cover data by parcel for a large city. One cannot even do such an operation in ArcGIS without tiling the data.

Oil hit $103 a barrel yesterday. Many inactive wells are now once again economically viable, especially with directional drilling improvements. So maybe try processing all existing well locations with respect to lease info to find opportunities. Lot's of data vendors there in Denver to work with. Maybe something similar to spotfire http://spotfire.tibco.com/ but with a cluster on the back end. Also HyperTable might be an alternative to Hadoop on EC2?

@brian: your thoughts on divide and conquer are on the right lines, but the technology approach is different from what you talk about - but I can't talk about specifics yet. But I can assure you it is feasible now and will be coming to the market soon!

@jarlath: your first area is definitely in the scope of what I am looking at now, the second is interesting for the future but probably needs a little more work.

@kirk: thanks, this is the kind of thing I'm after, more specific applications with a high business value.

Maybe I should clarify a little that I'm really talking about database-centric analysis - the types of problem where a large portion of the analysis could be formulated in spatial SQL (spatial and non-spatial selection, buffers, intersection, aggregation, etc, etc). And I'm looking for specific applications involving large datasets (millions or billions of records) where there would be a high business value in being able to run these types of analysis 50 to 100 times faster.

I would try to merge large wetlands datasets. Not exciting but would be good for my wind farm NEPA prescreening siting analysis. Pretty lame but something I can't do with current computational restrictions.

I've actually thought about this quite a bit; it seems that nobody's thought really big yet... a few off the top of my head:

One application is doing extremely detailed spatial search across huge datasets. E.g, a small-business location service: Find me locations in the US that are up for lease, zoned commercial, adjacent to a grocery store and have no existing liquor store within 300 meters. Another example e.g. for UN delivery of emergency potable water supplies: areas with remote-sensed surface disturbance up-watershed from population centers.

Another is summarization and analysis: Summarizing spatial correlations at ery high resolutions nationally or globally over demographics, crime, education, etc would be a sociologist's dream; slightly different datasets and it's perfect for epidemiologists... Super-useful derived datasets could become possible: E.g. a high-res national/global travel-time map (e.g. from nearest major metro or from nearest airport etc) that's accurate to the same degree as those created very painstakingly by-hand for urban centers...

As well: the more independent variables you can cram into a statistical modelling / machine learning problem, the better; which boils down to fast geospatial analysis. (In particular learning features that are 'nearby', 'upstream', etc are currently very limited by the lack of good/fast analysis).

It will be interesting to see what happens in this space. A lot of this stuff is possible but extremely expensive with no off-the-shelf way of doing things.

Moore's law is on our side, but to be honest I also feel that there have been major shortcomings on the computer science/algorithmic front: Compared with e.g. video game graphics algorithms or what Hollywood FX shops are up to, or even academia, the GIS community is really quite behind. (Both of these share huge overlaps with GIS in terms of subject matter...)

If I had a venture capitalist behind me I would be targeting mobile communications to leverage geospatial capabilities against dynamic/precision marketing. Providing such a capability to 1.8 billion cell phones might call for some horsepower on the backend.

Dan, thanks for the detailed and thoughtful response - you are thinking on the sort of scale I had in mind! I agree with your observation that we haven't had the same degree of heavyweight computer science innovation in the geospatial field as many others, which is one reason I'm especially excited by the technology I'm looking at in this area. Thanks to everyone else too for the thoughts so far.

Awww, shoot. Am I going to have to make a serious contribution? Alright, how about this:

Bzzzzzzzzzzzzz! Is it an Azul server appliance, which provides transparent acceleration to existing Java apps across the network by doing the heavy-duty infrastructure stuff like garbage collection in hardware instead of in a software VM, thus reducing your server requirements by a factor of ten while improving your performance by a similar amount.

First a secret Facebook app, now a secret analysis technology... Who knew that blogs and secrets would go so well together?

"Big science" research teams are potential customers, especially if you eliminate assumptions about fast geospatial analysis applying only to very large datasets. There's also a fast processing need when running the same analysis many times on a moderately sized dataset. For instance, it's common in exploratory spatial data analysis to run hundreds of thousands of iterations of the same routine but with slight variations of the analytical parameters in each iteration, or to run an identical analysis against a dataset that's slightly geographically altered each time in order to measure margin of error. Faster processing enables more iterations and reduced (or better understood) margin of error. For instance, see the summary of Michael Choy's work in the article at http://www.geospatial-solutions.com/geospatialsolutions/article/articleDetail.jsp?id=61481&sk=&date=&pageID=3

Similarly, large sensor arrays generating scientific data, such as the ocean floor "VENUS/NEPTUNE" project (see a good explanation of the project at the Barrodale Computing Services web site) involve large and growing data volumes to which researchers apply multiple analyses. This sort of dataset is used by a large community of scientists for different analytical purposes, so acquiring a costly but very fast analysis engine could be justified by group needs and would centralize what might otherwise be a difficult-to-maintain distributed management and processing setup.

Another (dark side) use for a fast geospatial analysis engine would be military applications such as predictive battlefield analysis. Who knows -- it might end the US occupation of Iraq sooner.

@Jonathan: yes, I'm not normally a secretive person, it has just worked out that way :)! And the Facebook app is not secret any more, I just haven't had time to discuss it on the blog yet, though will get to that soon. I appreciate all your thoughts though, thanks!

@Tartley: I'm really not at liberty to say whether any guesses are correct or not just yet ... perhaps if you use whereyougonnabe to identify when we will be in the same location shortly you can try to buy me a beer or margarita as appropriate, and see if my poker face changes as you guess :).

I was just playing with some LiDAR gridding and attempting to adapt the hadoop EC2 ami scripts for some batch processing. Sounds like you are seeing the same thing as some others in the EC2 realm. Configurable supercomputing on a budget could be an interesting business model. Lots of big data in geospatial!

Doing the math on the solar analysis for Los Angeles County. I hope this math is all correct. 4,084 square miles = 10,577,511,443 square meters or 10,577,511,443 1 meter pixels. A 1 year analysis at 30 minute intervals would be 17,520 intervals. This equates to 185,318,000,481,360 calculations.

Good question, there are a few things that I often have to leave going over night, or the weekend, or the week I'm on vacation...

Clipping of vector data (particularly polygons) using multiple clippers. That task takes up huge overhead. Also, area-of-interest raster trimming (padding the outside with 0's) with large datasets - where only a small portion is needed takes forever. I wish there was a way to remove the unneeded portions without performing an encode operation.

Peter Batty

About me

Peter Batty is a co-founder and CTO of the geospatial division at Ubisense. He has worked in the geospatial industry for 25 years and has served as CTO for two leading companies in the industry (and two of the world's top 200 software companies), Intergraph and Smallworld (now part of GE Energy). He served on the board of OSGeo from 2011 to 2013 and chaired the FOSS4G 2011 conference in Denver. He serves on the advisory board of Aero Glass. See here for a more detailed bio. You can email Peter at peter@ebatty.com, and can see videos of some of his conference presentations here.