File size

File size

File size

File size

File size

139.6 MB

Blaise Aguera y Arcas is an Architect of
PhotoSynth, which is a super cool 3-D image "tourism" application that enables a new methodology for exploring related groups of images using a complex imaging algorithm developed in part by Microsoft Research. Here, Charles sits down with Blaise to learn
the details of the What How and Why of this highly innovative distributed 3-D image processing application. PhotoSynth enables a whole new way of exlporing images! Wow.

This technology is outrageously cool. Listen carefully to what Blaise talks about in regards to distributed imaging over the network. Imagine navigating through a 3-D image composed of a collection of images created by thousands of people all over the
world. The implications here are gigantic. I'm still interpreting what I learned. It's amazing.

Charles, I'm still a little confused as to the first part. He stated that as the images are navigated (zooming in and out), they are simulatenously being opened and closed? Is he implying that each picture has several versions of differing quality? I don't
think he is, because that would see ridiculous to store 20 versions of 20 differign qualities for every 1 image on your PC. Could you please explain what is taking place behind the scenes as the user navigates the canvas of canvas's?

What I think is the best part of all this is not the ideas in general. Breaking images into pieces (like a pyramid) to zoom in and out of images efficiently or stiching photos together to give a 3D like experience have been around for a long time. 2 jobs
ago we used a fish-eye lens and ViewPoint to do this. WHAT IS MOST AMAZING (to me anyway) about this stuff is that what it generates is in 3D space, but most importantly any joe schmoe can take pictures with their digital camera and use this. That takes
the audience for this type of thing from not very many to like half the planet (just a guess). This is huge! Then take the distributed imaging that Charles mentioned on top of that and the implications for what all this could affect is mind blowing.

Microscopy is certainly an area where this could be highly useful. Consider also astronomy... Navigating galaxies and other celestial bodies will never be the same!
C

Astronomy. Hmm. That raises an interesting question which I don't believe was addressed in the interview. Of course this type of image-manipulation will work great with highly-diversified imagery. A building front, or St. Peters basilica.. You have so many
unique aspects of those buildings, dare I say it is easy to line up the images.

Compare that to an evening sky, where every star is very much visually-ambiguous. Of course inspecting satellite closely will give us the higher value of diversity, but from a distance, wouldn't this be too much of a feat for the application to properly distinguish
one star from another? I suppose this problem may even exist in Microscopy, too. Obviously viewing a colony of bacteria will look very much the same in many different locations.

10 years ago I used an application will build panoramas from images taken along a horizontal plane. Each image had to be highly unique in order for the software to line up the images, and sometimes we got
fuzzy couplings - especialy in places where threes were dominant. Perhaps I'm allowing my old knowledge to govern my understanding of this new technology too much

I am still wondering how exactly it shows "versions?" of the images as you zoom in and out, if it's drawing from a single image...does it only display a portion of the image's binary?

I can only guess, but I think that's exactly what's happening. When zoomed out, you only need to load e.g. every 10th pixel of the image. If you zoom in a little you have to add e.g. every 5th pixel and so on, until you are on zoom level 1:1 which displays
all pixels. At the same time the portion of the image in view gets smaller and smaller, so the overall amount of data that has to be in memory basically stays the same.So you either see all of the image in low resolution or a small portion of the image in high resolution. The tricky parts of course are to decide when to load what and how to do it quickly.

I wonder if someone has already thought about combining this with
Microsoft Max. That would be an interesting new twist on the whole 3D presentation stuff.Though the problem with Max currently is that there haven't been new versions for a very long time. It once started as a showcase application for WPF, but according to the Max forum people are very upset that the last supported WinFX CTP is the January(!) CTP.

I am still wondering how exactly it shows "versions?" of the images as you zoom in and out, if it's drawing from a single image...does it only display a portion of the image's binary?

I can only guess, but I think that's exactly what's happening. When zoomed out, you only need to load e.g. every 10th pixel of the image. If you zoom in a little you have to add e.g. every 5th pixel and so on, until you are on zoom level 1:1 which displays
all pixels. At the same time the portion of the image in view gets smaller and smaller, so the overall amount of data that has to be in memory basically stays the same.So you either see all of the image in low resolution or a small portion of the image in high resolution. The tricky parts of course are to decide when to load what and how to do it quickly.

But the amount of data cannot stay the same over the entire experience. He's got, roughly 100, pictures on his screen. As he moves any one of them to a higher Z-index, that picture has to display more and more of it's full binary-value. As he moves it to a
lesser Z-index, it displays less of its binary-data. That would require the user to first traverse through its entire binary-source, select only what the user needs to keep in memory for that moment, and then display the selected information. However, if you're
going to be zooming in and out rapidly, then it would be best to have the entire pictures memory in cache, but then, you would need EVERY picture's highest-resolution binary-value in cache always, to be able to rapidly navigate from one picture to the other,
and quickly modify the z-index of any given image...that simply cannot work, it would utterly kill your processing speed.

Imagine using this technology for "touring" 3-D image compositions of planetary bodies, like Titan. The usefulness of this technology for scientific imaging is incredible. Astronomy, biology, ecology, you name it. I am so excited right now.

﻿Imagine using this technology for "touring" 3-D image compositions of planetary bodies, like Titan. The usefulness of this technology for scientific imaging is incredible. Astronomy, biology, ecology, you name it. I am so excited right now.

Believe me, I share your excitement We just gotta get people up there with their digital cameras, and put a few statues in place to catalyse there snapshop-fever

Oh, what of this. I was just thinking. I used to live in Arizona, several years ago, and you'll notice that the further away you get from a mountain, the more its figure stays the same regardless how far left you go, or how far right. I wonder how the software
handles perspective, and distance on this magnitude.

Buildings, which have unnatural forms will look rather different depending on where you are. But mountains tend to retain their figure when you are further away, walking left or right.

I am still wondering how exactly it shows "versions?" of the images as you zoom in and out, if it's drawing from a single image...does it only display a portion of the image's binary?

I can only guess, but I think that's exactly what's happening. When zoomed out, you only need to load e.g. every 10th pixel of the image. If you zoom in a little you have to add e.g. every 5th pixel and so on, until you are on zoom level 1:1 which displays
all pixels. At the same time the portion of the image in view gets smaller and smaller, so the overall amount of data that has to be in memory basically stays the same.So you either see all of the image in low resolution or a small portion of the image in high resolution. The tricky parts of course are to decide when to load what and how to do it quickly.

But the amount of data cannot stay the same over the entire experience. He's got, roughly 100, pictures on his screen. As he moves any one of them to a higher Z-index, that picture has to display more and more of it's full binary-value. As he moves it to a
lesser Z-index, it displays less of its binary-data. That would require the user to first traverse through its entire binary-source, select only what the user needs to keep in memory for that moment, and then display the selected information. However, if you're
going to be zooming in and out rapidly, then it would be best to have the entire pictures memory in cache, but then, you would need EVERY picture's highest-resolution binary-value in cache always, to be able to rapidly navigate from one picture to the other,
and quickly modify the z-index of any given image...that simply cannot work, it would utterly kill your processing speed.

How the heck is he doing this!? Blaise! Come to my rescue!

Remember that zooming in and out happens gradually. When zooming in, you first see a blurry version of the image until loading of the additional details is complete. The faster you zoom in the blurrier (is that a word?) the image gets. Look at Google Earth,
I think it's similar.

﻿Imagine using this technology for "touring" 3-D image compositions of planetary bodies, like Titan. The usefulness of this technology for scientific imaging is incredible. Astronomy, biology, ecology, you name it. I am so excited right now.

You know how they use cameras to go in for really precision surgery through a really small incision? It would be bad assed they took pictures of the problem the surgery was supposed to fix from different angles before and after the surgery so the patient could
see what was done and also for students in the future to study.

Going along the astronomy angle, but exactly opposite, it would be really cool to see things in 3D at the molecular or atomic level.

I believe Google-earth actually stores blocks of images, from several different altitudes, and performs a double-pass on them while rendering, which isn't what this video demonstrates (from my understanding).

Very briefly, the multires issues raised by jsampsonPC and others basically revolve around the question of how an image can be represented in a way that allows the whole thing to be accessed at very low resolution, or a small part to be accessed at high resolution,
or anything in between. Also, each such access needs to involve limited server disk IO and limited processing on the client side. The ideas first used to implement this efficiently and elegantly were developed in part by my old thesis advisor, Ingrid Daubechies.
Do searches on "wavelets", "wavelet image compression", and "JPEG2000" to learn more. We don't use JPEG2000 (or, in fact, wavelets) in Photosynth, although when Seadragon was an independent company we did use JPEG2000. The basic ideas are similar, though.
The Seadragon collection model generalizes these multiresolution concepts to collections of images (or other, non-image visual content, like text and vector maps), not just single images. Again, the design requirement was to allow massive collections of massive
images to be opened remotely without too much work either for the server or the client.

Astronomy and microscopy applications: YES. Among many others. We hope to get to a point where there's an open platform to develop on, so that these ideas and applications can be developed the right way by the people who really understand what should be made
and how.

Multi-resolution images! Utterly brilliant. It seems so obvious after the fact, like all great ideas.

Aside from ignoring coarse-grained backgrounds which finer-grained images don't need to store, doesn't scalable vector graphics allow you to make those various resolutions? There must be a lot of room for compression, I wonder if you can expand on that (I'm
only 20 minutes into the video however).

I'm a former neuroscience student and I think this is very interesting stuff and can imagine the idea being applied to many things.

Ultimately, you could have a unified realtime 3D model of the entire world, and you it would be completely robust and accurate, verified by millions or billions of video cameras. And you could still stream it to anyone, because the bandwidth optimizations.
All you need is cameras everywhere. Privacy issues there, but surely in a few years we could have a non-realtime model of the entire world.

Charles wrote:Microscopy is certainly an area where this could be highly useful. Consider also astronomy... Navigating galaxies and other celestial bodies will never be the same!
C

Astronomy. Hmm. That raises an interesting question which I don't believe was addressed in the interview. Of course this type of image-manipulation will work great with highly-diversified imagery. A building front, or St. Peters basilica.. You have so many
unique aspects of those buildings, dare I say it is easy to line up the images.

Compare that to an evening sky, where every star is very much visually-ambiguous. Of course inspecting satellite closely will give us the higher value of diversity, but from a distance, wouldn't this be too much of a feat for the application to properly distinguish
one star from another? I suppose this problem may even exist in Microscopy, too. Obviously viewing a colony of bacteria will look very much the same in many different locations.

10 years ago I used an application will build panoramas from images taken along a horizontal plane. Each image had to be highly unique in order for the software to line up the images, and sometimes we got
fuzzy couplings - especialy in places where threes were dominant. Perhaps I'm allowing my old knowledge to govern my understanding of this new technology too much

Jonathan

PSWas anybody able to convince Blaise that he needed a c9 account?

How could it be more difficult to line up stars, which are clear points of light, especially with high-res telescope photographs? If they can do it with window sill corners (which seems a lot more visually ambigous to me), why not stars? Besides, you just
need to put up a frame of reference, like a grid, don't you?

How could it be more difficult to line up stars, which are clear points of light, especially with high-res telescope photographs? If they can do it with window sill corners (which seems a lot more visually ambigous to me), why not stars? Besides, you
just need to put up a frame of reference, like a grid, don't you?

This isn't the only problem I see with Astronomy. Of course these are mere hurdles can will be surpassed, but Astronomy is very time-specific, unlike buildings.

St. Peters Basilica is not under construction, nor does it matter what time of year you go there. It will always be there, and it will always be the same.

On the other hand, suppose I snapshot the moon, and then you snapshot the moon. Can we overlay our images? No. There are many variables that you have to take into consideration.

In Astronomy, and I'm sure everybody will agree, Location and Timing are very crucial. More than likely, everybody has a unique image of the moon (from the US, from Japan, from Scandanavia, etc.). You cannot simply overlay pictures of the moon

you are missing an important ingredient in both your understanding of astronomical digital photos and PhotoSynth...

As of today, there are millions of photos in various databses of astronomical objects. So, let's say there are 10,000 images of Galaxy X taken at different times from different angles from different telescopes or the same telescope (like Hubble). This is very
much in the realm of PhotoSynth compositional requirements: they are digital images of the same object taken from different points of view at different times. Consider the examples in the video...

you are missing an important ingredient in both your understanding of astronomical digital photos and PhotoSynth...

As of today, there are millions of photos in various databses of astronomical objects. So, let's say there are 10,000 images of Galaxy X taken at different times from different angles from different telescopes or the same telescope (like Hubble). This is very
much in the realm of PhotoSynth compositional requirements: they are digital images of the same object taken from different points of view at different times. Consider the Basilica example in the video.

C

I dont' believe I'm mistaken. Notice the example you went to immediately was a Galaxy captured by Hubble - this will obviously display a high level of unique compositions. You will find that it's rather easy to overlay GalaxyX onto GalaxyX. You may have to
rotate it 23degrees here and there, but you will get it.

My examples, a large area of smaller lights (stars in this example) offer little-to-no uniqueness. If you are going to line up a snapshot of stars with their true location, you would have to survey ALL of the star photos ever taken (unless they have a frame
of reference, such as GalaxyX).

Mapping the sky is something that is being done today, with images. There will of course be a default perspective, the Earth...

I'm fully aware of that I am either wrong, or not communicating my thoughts clearly enough. Mapping the sky is done, yes, but how is it done? With reference points, not here on earth, but out there. "Look, there's Orions Belt", or "Isn't the Moon beautiful?
What's that little star beside it...there on the right?".

What I'm saying is, if you give the computer some random picture of a cluster of stars which doesn't contain a easily-indentifiable reference (The Moon, Saturn, etc) then the computer will have to take your little image, and compare it to every single image
of stars that exists... Afterall, maybe the stars you caught on camera are actually the ones just to the side of the moon, but you didn't include the moon in the picture.

Do you understand what I mean? To be overly-simple, a cluster of white dots doesn't really represent any unique characteristics which we, or computers, can immediately point to and say "That's over there". Before you mention constellations, keep in mind that
this only works when all of the required points (or most of them) are viewed".

Where is Blaise? I wanted him to relieve me of my ignorance... In all seriousness, I still don't fully comprehend the magnitude of this applications potential. It blows me away already the things I DO comprehend. I could be wrong about some of my thoughts, and I don't want to come off as an annoyance.

Blaise is very good at explaining things, he's an excellent interviewee.Because I'm not a programmer and graphics or more my thing, this kind of product just make more sense to me than some of the other things we get to see in C9 videos but that's not to say that I don't like the technical side of these interviews. That's the whole
reason why C9 is so great, the interviewers get technical, it's not just a techdemo on tape.

Someday all cameras will probably have GPS built into them and it would definitely add to projects like this.EXIF already has a place where GPS co-ordinates can be stored so it's not like any changes will have to be made to incorporate such information into photos.

﻿To be overly-simple, a cluster of white dots doesn't really represent any unique characteristics which we, or computers, can immediately point to and say "That's over there". Before you mention constellations, keep in mind that this only works when all
of the required points (or most of them) are viewed".

If you take an image from the Earth or a satellite, it has a cone of view. If you know the time and position of the camera on the Earth which takes the photo, you can calculate exactly which star is which.

Could you see stars from the side, as if the cone of view turned 90 degrees off one axis? If you know the distance between stars in the direction of the cone's origin to the center of the cone's base. That is measurable. If you do that you can reconstruct
an image from the side. For the back you could do the same thing, but you don't really know what is in the back the moon say, without a photo from the dark side, but you could fill in the blanks for stars and such, and rotating bodies obviously can be expolated,
so the only limitation is distance and "hidden" bodies that are never in the cone of view of any of our current photos, which is of course a lot of stuff in space.

For microbiology and beyond, if you use MRI, electron microscopy and more, then eventually you could imagine the massive distributed images of all these "visible" things being available so that software is able to reason about what should "fill in the blanks",
similar to how the mind operates when it fills in the blanks, i.e. optical illusions and such), so you could zoom in and out of the entire structure of a organism and cell down to the chemical and atomic level itself and beyond, extrapolated from images and
applying a bunch of physics and graphics allow someone to step through the operations of a cell, watch it develop and witness the energy transfer ... am I crazy?

Basically it seems like a real start to wiring up visual information on the web, which is really deep. 20 years from now I think learning about the world and things around us in detail will be enabled by this.

The searching ideas Blaise mentions is awesome, because imagine searching for whatever or whoever you want and being able to see and analyse it at any resolution, real-time or static. If you think beyond images there are many applications because it's got
a similiarity to the ideas of the mind being multi-resolutional. There's been research into language that points to language and understanding being multi-resolutional in nature (and much older research into language and understanding about multi-resolutional
nature of thought) . An example would be that when you hear a word that you start to process the sound immediately, branching off continuously as more sound "comes in" until there is a match. It doesn't just take a word, then search.

Imagine being able to call someone's name and the 'net locates them and you just speak to them, and it can also locate them visually and you are there in your nano-skin suit connected remotely, but it feels like you're there. On the other end, a billion nano-bots
gather from the dust and form a shell of "you" (SecondSkin (TM) ) that senses everything around it in realtime, but handles the data, because it filters out information in a mult-resolutional way, using the net to compute information determined worthy of
note, like a touch, or a fast moving object in the field of view, in anticipation of you reacting, as humans do to see what's about to smack you in the head. Now you can move around and touch and hear and see. Smell and taste would be further out because
it involves chemicals or stimulation of neurons, but heck, if we ever do sci-fi stuff like this then we'll probably have cracked that nut a long time ago.

If you take an image from the Earth or a satellite, it has a cone of view. If you know the time and position of the camera on the Earth which takes the photo, you can calculate exactly which star is which.

Could you see stars from the side, as if the cone of view turned 90 degrees off one axis? If you know the distance between stars in the direction of the cone's origin to the center of the cone's base. That is measurable. If you do that you can reconstruct
an image from the side. For the back you could do the same thing, but you don't really know what is in the back the moon say, without a photo from the dark side, but you could fill in the blanks for stars and such, and rotating bodies obviously can be expolated,
so the only limitation is distance and "hidden" bodies that are never in the cone of view of any of our current photos, which is of course a lot of stuff in space.

Richard, you are correct in everything you said basically. It's commonly known that we can find distance through parallax trigonometry, however, jpegs on the internet using don't contain information about the Yaw, Pitch, and Roll of a camera when it took a
picture. Nor are we given the coordinates of the photographer from the camera either.

IF we had all of this information, it would be no problem matching imagery up. BUT WE DONT. That's why I'm saying I don't think the technology can be implemented in the Astronomy field with regards to random pictures of stars which don't contain an identifiable
reference point (like a satellite).

Perhaps in the future cameras will give us all this information, but until then, i cannot see this technology working for some aspects of Astronomy. For other parts of Astronomy, this would work wonderfully, such as in Charles' example of running around on
Titan, etc.

IDEA!I was in the restroom, looking in the mirror and washing my hands when all of a sudden I got an idea regarding this technology. It's simple:

"Why just photos?"

Imagine navigating around your photogallery, err, walking around St. Peters Basilica via a series of images all found within the "StPeterBasilica" directory. But guess what, there aren't just JPEGS in there sucka, there are WMV, MOV, and plenty of other video
formats - and those videos have A start-frame!

Suppose the FIRST frame of a movie was also surveyed when when you're walking around. You look to the front of the Basilica, click a tiny camera icon, and whammo...you're linked up to a 20 second movie of a child feeding birds infront of the Basilica, with
the backdrop matching the relevant photography.

i can't resist! i must post this: this is a great video about great and fascinating technology! for me one of the most interesting applications of this might be a more adhoc kind of assessment of a given location: lets say you need to know more about the
geometry of a place of interest as quickly as possible. video footage or a detailed plan might not be available but chances are always higher to aquire some photos or maybe shoot a couple photos yourself.. the barrier of photos is just so much lower compared
to detailed cartography methods. then you go ahead and import the stack of photos into the application. using reference heights (which would be easy to define, take normed window heights for instance) the application might be able to determine the 3d position of each spot in the location.
you could perform measurements based on the reconstructed 3d model in a visual way without ever having used a measuring device (good old meter band (or whatever they're called in the US..), laser distance measurement or others).
you could in no time acquire detailed geometrical knowledge of a location. that 'd be interesting for all kinds of professions.

again.. i can understand why the demo was voted no.1 at microsoft recently. really great advances we're seeing here.

jsampsonPC wrote:I had thought about something slightly different. What about all the old photos I find of my great grandparents standing by some barbershop door, or down at their local foodmart.

What would the results be if I put one of those in It would be neat to see locations from 80-100 years ago. Imagine how many photographs of Disney there are, spanning decades.

This could be used as a historical learning device, too.

Cool, Virtual Time Travel!

Completely! Feed this into some goggles as a full fisheye panoramic device and you could actually walk around the place, watching the video unfold before your very eyes. That would be absolutely amazing.

Another way of using this technology might involve capturing digital stills from video. Apps like Movie Maker already have scene detection algorithms built in, and video is likely to cover a large portion of a location. Once the video is converted to still
images, Photosynth can use the same techniques to construct the 3D landscape.

The combined effort would make a killer data visualization and organization interface inside Windows Explorer, potentially making media/document thumbnails "windows" into the actual data, and allowing in-situ viewing/editing of content directly in Explorer,
with a seamless transition to the user's application of choice via a few clicks if necessary.

what about a predictive algorithm for those low res pictures, a mathematical equation to help the computer guess whats between the dots, like extrapilation of information based on the sourrounding pixels.

The technology is amazing but I would like to see some proof that this works in real-life circumstances too (I have not seen this c9 video yet, sorry if blaise explained it all )

My questions:

- How does photosynth deal with different light conditons? Not only day and night (which is fine, you can have two 'worlds' one is day, other is night) but 11am and 5pm.. Clouds and sunshine, rain and snow, wear and tear. Paintjob and graffitis. - How much work is put on the user to group the right photos together? - What cluster do you think would make the most sense for a global database of 'photosynth space'? Country? City? District? etc? You are not going to match all pictures in the world against all.. are you? - As a research company, when are you publishing the algorithms used to match pictures together? - Is the demo WPF based? (I see DirectX mentioned with seadragon so maybe not. Whe are you not yet on the WPF bandwagon? )

that's all for now. Blaise, I couldn't find a biography of you, although I saw your work related to the Gutenberg Bible.. So cool what technology enables. Probably the Princeton Physics and Applied Math background helps Were your born in the US or in Europe?

As I expected, most of my questions were answered by Blaise. My other question is that are you already leveraging the 1.0 final draft of WMP (Windows Media Photo)? Not sure what is available in that space today..

If you are planning to really publish an API for all this technology (something I didn't expect), create well-designed PIOs (Primary Interop Assemblies) for the API. Well, if you don't then I will and push it out into the public commons but still, sometimes
I am shocked at how different groups at MSFT design COM APIs in 2006 with absolutely NO consideration in regards to how that would work and interoperate in a managed environment. Even if the COM API uses constructs often used in the unmanaged world such as
in/out parameters, shared buffers, etc., it is not a hard work to come up with a nice wrapper / interop class. Been there, done that.

Anyways, can't wait to see this in action and see how it deals with spaces that are not as wide and open as St Peter's. How would this work with an apartment that is put up for rent? 5mpx images for all rooms, all details. I bet you thought about that too.

The technology is amazing but I would like to see some proof that this works in real-life circumstances too (I have not seen this c9 video yet, sorry if blaise explained it all )

My questions:

- How does photosynth deal with different light conditons? Not only day and night (which is fine, you can have two 'worlds' one is day, other is night) but 11am and 5pm.. Clouds and sunshine, rain and snow, wear and tear. Paintjob and graffitis. - How much work is put on the user to group the right photos together? - What cluster do you think would make the most sense for a global database of 'photosynth space'? Country? City? District? etc? You are not going to match all pictures in the world against all.. are you? - As a research company, when are you publishing the algorithms used to match pictures together? - Is the demo WPF based? (I see DirectX mentioned with seadragon so maybe not. Whe are you not yet on the WPF bandwagon? )

Gabor

new here - i'm a PM working for Blaise on this and thought I'd help out a bit (since he'll be on his way to Siggraph shortly)

Lighting conditions have no bearing on the photomatching - the algorithms look for point features in the photos - the brightness, contrast doesn't matter at all - the photos you see in the demo were ones I took - and I did some retouching of them after the
fact and that made no difference at all. Right now the only thing you can't do to photos is crop them - we rely on focal lengths being true to locate the camera positions. (I've wanted to play around and make a collection that has all tweaked out photos and turn this into an art form - imagine San
Marco in various black/white - high contrast - pushed colors, etc.)

When we do the matching right now we submit a set of photos that we believe should match due to the overlaps included - other than that, there's no effort required on the user to do any manual alignment.

Clusters - yes - all of the above. That's the goal. Of course - there are ways to cheat and not just match points - many photos have text that can be extracted, GPS coordinates baked into EXIF can be used, etc.

WPF - we use DirectX in the viewer, but we definitely use WPF, specifically Photon for encoding images - if you see Blaise's earlier comment on Jpeg 2000, we are using Photon in a similar way since it offers the multi-res capabilities we take advantage of.
-jonathan

﻿Clusters - yes - all of the above. That's the goal. Of course - there are ways to cheat and not just match points - many photos have text that can be extracted, GPS coordinates baked into EXIF can be used, etc.

I can see the GPS EXIF information being used to identify the location the photos where taken by looking up the coordinates in Virtual Earth and having the location inputted into the metadata of the files automatically just like Media Player automatically finds
artist and track names for audio CDs and inputs it into the ID3 tags of MP3s.

Jonathan, have you guys considered leveraging the effects of video in your application? By extracting the first frame of local video files, you could include those in your atmosphere, no? If somebody took a picture of their local School, and somebody else
had a short video of themselves skateboarding infront of the localschool, you could overlay a camera icon on the video location?

Ok, geeky question.. How much processing horspower does this thing take? If I feed it 1000 5MP photos of a tourist trap, how long (roughly) does it take to churn through and create a 3d model like you showed in the demo? What if instead of individual
photos, I feed it a video of a walkthrough of a house? Could it use each frame and build a high resolution 3d model of the inside of the house?

﻿ IDEA!I was in the restroom, looking in the mirror and washing my hands when all of a sudden I got an idea regarding this technology. It's simple:

"Why just photos?"

Imagine navigating around your photogallery, err, walking around St. Peters Basilica via a series of images all found within the "StPeterBasilica" directory. But guess what, there aren't just JPEGS in there sucka, there are WMV, MOV, and plenty of other video
formats - and those videos have A start-frame!

Suppose the FIRST frame of a movie was also surveyed when when you're walking around. You look to the front of the Basilica, click a tiny camera icon, and whammo...you're linked up to a 20 second movie of a child feeding birds infront of the Basilica, with
the backdrop matching the relevant photography.

What do you guys think?

Hi -

I have actually seen this idea in action (done a little differently) - and it was on an Australian produced show called 'Beyond Tommorow' (which I believe is shown by Discovery channel in other countries).

The concept was that you walk around a landmark with a VR headset (with audio) and combined GPS - which switches itself on when you reach particular vantage points. (which it guides you to).

In the example - they chose some old castle in Italy - and used actors to 'reenact' ancient scenes (such as soldiers doing their rounds or arresting someone etc) - so you could get a first hand look at what happened in that place x100 years ago. (complete with
audio etc)..

I can't for the life of me remember what the invention was called (but someone else might remember here?)..

From a practical point of view, I don't think that this is a real concern. Why? Random snapshots of the sky I might take from my backyard won't provide any additional useful info to an astronomical database, since pretty well anything that's visible with the
naked eye would already have been photographed & catalogued by real astronomers (not to mention a huge amount of objects which are not visible).

However, huge collections of photos taken by astronomers (both professional & amature) do exist - and guess what? They're already tagged! Every object in the collection is catalogued and been assigned coordinates (though there are competing astronomical coordinate
systems - I remember that from the Jim Grey video in which he discussed Skyserver, among other things). So images can be grouped according to the coordinate metadata associated with each photo. So essentially, they'd just have to point this sucker at the Skyserver
library (with mods to allow it to use the coordinate info) & let it rip...

The PSynth apps are myriad. Uh, its not about the photos as end products, it's the apps and the apps are not merely economic at their core. Brain-up to genetic imagery in motion, a strand of dna observed, tilted, twisted and fully appreciated. Brain-up.:O

Is this technology also able to create a 3d version from pictures of a person? I would guess this is more difficult since it's practically impossible for a person to keep the same position between 2 pictures at a different angle.

Is this technology also able to create a 3d version from pictures of a person? I would guess this is more difficult since it's practically impossible for a person to keep the same position between 2 pictures at a different angle.

Oh that will be cool! I wanna do that right now! I'd imagine it's a simple task of just standing there for a second while a friend walks around you and takes 30 pictures or so.

Heh, I see a medical/portrait art piece, "Dental Checkup" to be realized in the near future...

Say Niall, that invention is called "Augmented Reality" yeah?

and Re: Astrophotogrametry. The program, "Celestia" is out there working wonderfully right now. Not based on photolocation but uses photos as texturemaps on coordinate located 3d models of planets, stars and galaxies.

I have been waiting for this for a few years, reading all the "computer vision" whitepapers as research continues.

I am currently in a project which aims to use 3d photogrametry to build a simulation of an art gallery. You just saved me from having to invent the software myself!!

I'm loving the

IDEAS!!

Time travel is going to be the big one. Time Traveler in a geowall based holodeck.. (dual polarized usb projector for xbox360?)

anyhow, I'm curious if there are more apps from this team on the drawing boards? VideoSynth being an obvious one. Pixel tracking from video proccessing yeilds point cloud results similar to PhotoSynth. Ya know.

PhotoSynth would make the idea front end for a new media quickening.

In my gallery simulator project, we are striving to provide the user with a rich newmedia experience inside the 3d model. Videos that can be watched, paintings that have amazing zoom and artist metadata, an authentic sound environment, etc.

PhotoSynth is a step in that direction eh?

I would LOVE to use it sooner then later. I'll definitly use it for an exhibition in October. Is there a special artist beta I can use today?!? Please release this as soon as possible.

I liked what someone was saying about not having an ecomomic core, it bases this work more in creativity. But you know, once you can fly around into stores, clicking somthing to buy it will be the hot eComerce way!

Another idea I can share from the Gallery Simulator project is an idea that opening events would be webcast. You just pixel track the camera person through the environment and path the feed to a moving "perspective window". This will be a new media technique
for live news casts as well. (imagine a Google Earth.. er Microsoft Earth, with gps+video tracking of a live reporter in the feild, when they cover the next hurricane you can zoom out and see where other networks are reporting from or zoom WAY OUT to see the
weather data)

Live Video perspective windows will be unspeakably cool when I'm flying around the football stadium deciding which "camera" to watch. From the cool instant replays they had last season, I think they're allready using this tecnology.

Get that launched in time for the SUPERBOWL ya'll. Microsoft partners with NFL. Did I earn my beta yet??

﻿Is this technology also able to create a 3d version from pictures of a person? I would guess this is more difficult since it's practically impossible for a person to keep the same position between 2 pictures at a different angle.

This technology is already in effect, and has been for well over a decade. This is the very technology toy-manufacturers use when developing models of super-heros, action-figures, etc.

I saw a demonstration on it; really exciting

My wife works in the dental field, and tells me of very interesting software they use for constructing crowns. She works on a large military base, so they have some pretty neat toys.

She just told me that when building the crown, they take a few pictures of the offending teeth with an intraoral camera, cut the teeth, retake a second set of photos of the modified teeth, and then feed the imagery into a computer.

At this time, a laser carves the new crown at a 1:1 ratio - pretty neat, huh!

Fantastic - this could be such a good way of recording archaeological sites than the current 2d photos and drawings, I am guessing the moment any archaeologists see this they will be knocking down your door - well done with that ..

we're heading off in a bit to a 4 week vacation in Europe... I guess we'll up our expected digital camera shots from the 10,000 level to 100,000 or more... to have some pix to experiment with when we get PhotoSynth.

You're very welcome Blaise. We were bowled over by the video. We are currently building a new site which offers virtual interactive tours down streetscape panoramas from the world's most famous streets. Coming soon are London's Oxford Street and NYC's 5th
Ave, and there are already a few beta streets on there right now. You can even click on the shop doors and 'walk inside' their online stores.

Would people like to see Superhighstreet.com's and Live Labs' technology merged? Perhaps now is an ideal time to discuss this further with you guys at Microsoft, that is if you agree there's an interesting partnership possibility there Blaise? Our mutual
goals seem to be the same - the ultimate dream of an augmented reality where people from Indonesia can walk down a street in London and see all the 'sites'.

We'd be very interested in feedback from anyone here on our site's usability and potential future ideas too. We're working hard to release Oxford Street within a couple of weeks, so join our updates list for a spam-free announcement then.

Keep up the amazing work. Can't wait to see the beta. Drop us a line if you like the idea of beating Google to it, as I hear they are working on a similar thing to A9's photo-street mapping (not quite what we're doing, as ours is of course higher resolution
with sound effects etc.)[A]

Hi , its about time microsoft got involved , I have been looking for a platform to build out plamashow.tv as a visual HDTV channel experience to see and review any kind of product in detail tagged with audio or text descriptions. Have looked at kakadu
software that is based on jpeg2000, hope we can test a beta test server/client soon.

Forget all the possiblities for medical uses or astronomy. When I saw this video (and later checked out the website), this whole....thing (idea, program, heck even icon) screamed Apple! It seems like something Apple would have done. It makes me a little
sad MSFT (er, Seadragon) did it, and not Apple. But I am happy nonetheless (and very much impressed) that Microsoft did do something like this. It is truely stunning. But what I would really like to see is for this to open up somehow/get ported to other platforms.
I don't care if it goes opensource, but I would just LOVE to use this instead of iPhoto (yet still have it integrate with all the other apps). Obviously, I am a Mac user, but that doesn't mean I have to hate everything else. It really feels like something
I would see on a Mac. Blaise, if you're listening, I hope that if it can't be brought over to the good side that it at least gets the integration it deserves on it's own platform. I think that's one thing MS could really learn from Apple (well, NeXT actually)
and that's the fact apps should work together, and it shouldn't be hard for one thing to work with another.

When I saw this video (and later checked out the website), this whole....thing (idea, program, heck even icon) screamed Apple! It seems like something Apple would have done. It makes me a little sad MSFT (er, Seadragon) did it, and not Apple. But I am happy
nonetheless (and very much impressed) that Microsoft did do something like this.

It is truely stunning. But what I would really like to see is for this to open up somehow/get ported to other platforms. I don't care if it goes opensource, but I would just LOVE to use this instead of iPhoto (yet still have it integrate with all the other
apps).

Obviously, I am a Mac user, but that doesn't mean I have to hate everything else. It really feels like something I would see on a Mac. Blaise, if you're listening, I hope that if it can't be brought over to the good side that it at least gets the integration
it deserves on it's own platform. I think that's one thing MS could really learn from Apple (well, NeXT actually) and that's the fact apps should work together, and it shouldn't be hard for one thing to work with another.

Suggestion: If a bunch of photos are stitched together in 3D space, some of which contain GPS info in the EXIF metadata, it should be pretty easy to calculate GPS data for the untagged photos in the mesh and assign them appropriately.

I don't have a GPS and have been searching for an easy way to blast that info into my photos.

Suddenly, taking a lot of essentially free pictures (no film) of a structure with a digital camera makes a lot of sense. After PhotoSynth has been released and is easy to use, it will revolutionize real estate because it will be very easy to show a 3D visit
of a house once a lot of pictures are taken. It will also increase sales of digital camera memory modules (one module for a set of pictures of a structure). It will also affect flickr partially. Flickr mostly has two kinds of pictures, people and structures,
structures part of the pictures will require something like this because now (after PhotoSynth), a set of pictures together are more important than an individual picture.

The phrase "paradigm shift" is often misused, over used and abused. It takes only the slightest imagination and awareness to see that the insights that led to Photosynth are indeed the beginnings
of a paradigm shift, part of the steadily accelerating advance towards the so-called
technological singularity.

While much of its application is embodied and embedded in the web, I believe the strength of the impact of Photosynth and its cloud of applications will be at least that of Google and perhaps even the web itself. It is a bold statement, but it will even rank
higher than the advent of television.

It's not as though pieces of this technology haven't been brewing for quite some time in the minds of many (including my big empty brain). But sometimes magic happens... someone asks "I wonder if..." and proceeds to test their hypothesis. Others recognize the
discovery's significance and the future is changed. Not in a small evolutionary step, but in a huge revolutionary flight. And it comes in a rush and a flash as connections are made in the imagination of another, then another, and still another.

While the web brought us connections and information, and Google brought us needles from the haystack, nothng is more potent than immediately touching the senses directly without a process of interpretation. Photosynth will do that; it will bring us the world,
visually. And even more...

The universe is a hierarchy of emergent systems, including each of us individually and our emergent societies. Photosynth will modify us individually and socially. And even more... it will bring massive and
coherent sight and visual correlation across space and time (omnipresence, or at least omni-sightedness) to whatever that next level of emergence is, or is shaping up to be.

Whew! Got carried away there... um, maybe.

Now, be sure to take photos with those 15 second sound bites. They'll create a wash of 3D-aural delight when the similarly registered bells toll in the cathedral square or the sax man plays his tune as you windowshop down the mag-mile in Chicago. Oh, and tell
them to hurry and get those bio- and enviro- sensors embedded in the next gen cellphones--man does not virtualize by sight alone!

Penn State researchers have "taught" computers how to interpret images using a vocabulary of up to 330 English words, so that a computer can describe a photograph of two polo players, for instance, as "sport," "people," "horse," "polo."

Interestingly, I know for a fact that Blaise gave a demo of the zooming technology to Steve Jobs at Apple HQ in early June 2003 - Steve decided to pass on the technology at the time, saying that Apple was 'already doing that'. I guess we'll see if Leopard's
resolution independence provides anything as cool and smooth as this. I hope it does, but I haven't yet seen anything as good as Blaise's demo.

Remove this comment

Remove this thread

Comments Closed

Comments have been closed since this content was published more than 30 days ago, but if you'd like to continue the conversation,
please create a new thread in our Forums, or
Contact Us and let us know.