Sunday, October 24, 2004

A new report (pdf) on MapReduce, the new programming model use by Google to processing and generating large data sets, his now at Google Labs, I found thoses :

We wrote the first version of the MapReduce library in February of 2003, and made significant enhancements to it in August of 2003, including the locality optimization, dynamic load balancing of task execution across worker machines, etc.

The update of the Google index following thoses was in mid-november and was the most talk about update. Coincidence ?

Fidelity Investments, the world’s largest mutual fund manager, bought $549 million of stock in search engine Google’s IPO, about 23 percent of the shares offered during the initial public offering. Fidelity reported in a filing with the U.S. Securities and Exchange Commission that it now holds 5.21 million Google Class A shares. That’s about 16 percent of Class A stock and 1.9 percent of Google’s total shares outstanding.

The Boston Herald reports

“We don’t comment on buying or selling of individual securities, nor on SEC filings,'’ said a Fidelity spokesman yesterday. Fidelity’s fund managers tend to be shrewd, but on balance conservative, stock pickers. “Fidelity has a very good research department, and a particularly good group of technology analysts,'’ adds John Bonannzio, commentator at Wellesley-based research firm Fidelity Insight.”

The Herald adds that Fidelity could be sitting on a $100 million dollar profit in a month - wow.

This last line is the best! Not $100 but normally at least 325,2 million dollar profit in around 40 days if you look at thoses:

5.21 million actions (buy in august and september) x $110* = $573,1 million
5.21 million x $172.43** = $898,3 millions

* Higher price reach since the IPO at the moment of the news release of the buy)
** Price at closing today
Difference : $325,2 million at least in less than 2 month!!!

Thursday, October 21, 2004

Is it enough to measure success by revenues alone? How about when your company name actually becomes a hip verb, when your site is used for everything from getting the lowdown on possible love interests to finding critical medical information? Or when your IPO becomes the cause célèbre of the business world?

Google is an Internet phenom if ever there was one. But unlike so many earlier phenoms that tanked, Google has revenues. Indeed, they posted $220,000 in revenues in 1999 and $961,874,000 in 2003, an increase over five years of 437,115 percent.
A play on the word googol, which refers to the numeral 1 followed by 100 zeros, Google is now an unlikely household name. The search engine company first opened its doors in 1998, and while in beta mode, answered 10,000 search queries a day. Word got out and in one year, the service was answering more than half a million queries daily and had Red Hat as its first commercial search customer. Oh, and they tossed the beta label, as well.

Tuesday, October 19, 2004

Microsoft founder and chairman Bill Gates must see Google everywhere he looks these days. He must even see Google when he closes his eyes, and enters that lucid dreaming state from which all of Microsoft's great strategies eventually emerge. What he sees at that moment, we imagine, is a Tellytubby landscape that looks a lot like the Windows XP default wallpaper - perhaps with Chairman Bill himself as the sun. But bouncing across this happy vista are the red, green and blue colored balls that have rolled out of the Google playpen.

But after reading the others one, for one of the first time I agree with the comment of a writer of the Reg.

Meanwhile, and this is even more astonishing, digital TV has become an object of widespread derision in the UK. For the first time in its history, the word "digital" has negative brand connotations. Such is the pushback against glitchy digital TV streams, full of drop outs and hiccups, and hard-to-use controls, that people are beginning to clamor for the analog signal to remain on. "Digital" now means "crap", which should give lazy marketeers some pause for thought. TV is becoming associated with the kinds of problems people associated with PCs. It's true a few programme formats lend themselves naturally to some form of interactivity: particularly live TV which invites vox pop polls or comments. But these gimmicks actually get in the way of programming with a conventional narrative pull: such as a movie, a drama or a footie match.

That one make me remember the party I got with friends, not seen for a while, a few day ago. They were around 28 to 45 years olds, and the majority are just new (around 1 to 3 years) with "new" technology. I was shock to see that party focusing on a joystick :

They were completly addick to a Xbox boost with a 200 gig drive and full of MP3, video clips and TV shows record on it with a easy to use GUI program to handle it. Almost 75% of the peoples who was there already have DVD and use and computers at work, but were totally obsessed by the Xbox hack to TV where you can have the total control with a Joystick, and make me a bit sick while the TV and Xbox take the control of the party ;-)

A unknown Microsoft (2000 and up) service is the Indexing Catalog. Not very easy to setup, it build a index like GDS and CDS before you can use it. After that it will show you that form to perform your query :

Well, its fast, and that's it. In a test query I made on MCI versus a normal (slow) search with the research tool provide by Microsoft it found only 8 of the 19 files I got. The MCI missed 7 Word documents and 4 Word backup documents. But it is the only tools I saw for now, who perform also searches in CSS.

The lack of a nice GUI interface, the number of clicks to reach it and a glitch in the Help screen Advanced Query Syntax give me a very bad first impression.

Monday, October 18, 2004

Good news for SEO and SEM industry. In the Web 2 Conference the presentation (powerpoint) made by Gian M. Fulgoni of ComScore reveal that power users of search engines are clearly better clients than light or non-searchers.

Friday, October 15, 2004

I just test the Slogger Firefox extension who save in Html every kind (php, Websphere, asp, etc) files in a folder of your choice to let be crawl by Google Desktop. It work amazingly good. Even that page, made by Websphere, who will never appear in the Google index due to is uncrawlable url and construction, now appear in my Google Desktop query!

With that extension Google Desktop can now see ALL the pages your surfing and you could get back the textual information on a GDS search. Yes!

Here a Google query looking for a few words on the page link before. No URLs pointing to www.entreprises.gouv.qc.ca, because Google cannot crawl that page. But with GDS team up with Firefox and the Slogger extension here I can saw it in my "Results Store on your Computer".

If you install by defaut the Google Desktop and you go check your bank account via the Web, you will have good chances to see all your account informations indexed by the Desktop application.

I you make a search after with your Bank name, you will have a good chance to see in the cached results with all your credit or debit balance at the top your screen. Frightening!!!

You should uncheck this box fastor at least, the Google Desktop should make appear a warning treath each time you enter in a secure zone with https. Specially because because it's not so easy to retreive thoses informations to delete it in your hard disk.

Can anybody tell me why the peoples at Google let this line check in your preferences ?

Update: I know if anyone a bit tech saavy can see all thoses information using my computer, it is just it make it so easy and fast to do it, even without searching specifically for thoses private informations!!!

Here again Google remain vague and don't have a minimum of clarification !

Time to crawl

Twice as much as Copernic to crawl less file!

Partial file crawling

Seem to have the 101k limit like in Web. That's make me really mad.

Html crawling

Yes, but only if your file have a .html extension, if it's .htm you can not see it! That really sucks!

IE only for Web browsing history

Even if its take Firefox and Opera as your defaut Web browser it is not taking the major hot thing in this program : The Web history!

Meta Indexing of your HTML files

No

Date of file indexed

Web History file's date is ok. Every computer files I don't open since the indexing process today have the same date of the 5 august ??? Even when I transfer few months ago 30k files from NT 4 english version with international date setting (dd/mm/yy) to my new computer with a french version of XP with date as (yy/mm/dd) don't make any problem !!!

I think I will stop were, take a cold shower, and try a bit more tomorrow before unistalled it!

Tuesday, October 12, 2004

Seem a smaller database than they talk about, but very good to manage about over optimize and SE spammer sites and very nice results with clean little sites and fair to good results for small-middle (200 - 1k pages) but poorer to big one, and also to governement web sites. The clustering is far from Clusty and the Web site location is just average like all other services (and very hard with the technology to be better for now).

But it is still in beta and really giving very good effective results within the really small delay they got to put that up (and differents from all the other "new" sources) and very good new tools and options to dig deeper with a not so bad user interface (should take a bit more minimalist view here). And with all the good and many advanced search features not seen anywhere else, you should keep a eye on it.

Hope for Google, Exalead will not merge, fusion or make a deal with Vivisimo, because if they can handle a better clustering, fusion databases or crawl them deeper with a little tweak in algo for the biggest sites, the Goog will fall under the hundred soon ;-)

My first thought was Google will comeback with related searches or a clustering engine, but I was not sure enough, because Relpage is also used in some database query languages.
But now, with the screen shot taken two days ago and post today on the Marc Duval blog, we got the proof Google will comeback with it soon.

Exalead supports regular expression patterns. Patterns are introduced by a slash ('/') character. Within a regular expression, '.' is a special character that can represent any character, '*' stands for character repetition, '|' stands for 'or', and parenthesis are used to group characters. '?' is placed the end of a character group to make it optional.

Example:
/s.ren..pi.y/Searches for documents with words that match the pattern S . R EN .. PI . Y -- this can be very useful to finish your crossword puzzles!
/mpg(1|2|3)?/Searches for documents containing any of the following: mpg, mpg1, mpg2, or mpg3.

Friday, October 08, 2004

When you use the Firefox search box Google will add the client parameter to the URL and sometimes another one name rls. when the &client= parameter is there, you will have a lot less results. Here a query for the word "the" (2,5 billions results) and without the parameter &client= you get a 5,9 billions results !!!

If Yahoo really start this project, I think it will be harmful to them, or to RSS if other SE also go in that direction. Hope they will find other way to make money and not corrupting the first easy way to make the web more semantic!

The little history of the Sequoia Capital venture capitalist Michael Moritz who put $12.5 million in Google in 99.

Moritz is too modest to say how much he has made personally from the sale - he is reported to have received up to $280 million in cash, plus stock worth more than a $1 billion - although he dismisses reports he is now a billionaire as "wickedly overstated".

If you could command sites like Google to make one or two big changes, what might you require?

Clearly textual clustering or summarization, which is what we are seeing from companies like Vivisimo, Autonomy, Semio, and Verity. This technology on top of a Google-type search engine, which reaches even further into the rich data sources available over the internet, will make the web even more useful than it is today.

Are there major trends in WM standardization that you think people should be watching more closely?

XML is already the de-facto standard for improving the ability for machines to interpret textual repositories. I would anticipate that XML will become as pervasive as HTML. We in fact may still see a de facto standard emerge.

Hope so, because since 99, Xml seem to me is only use to produce sites from database, but not really to structure the web for a easier comprehension for machine interpretation of the pages produced.

Sunday, October 03, 2004

Only 7% of the sources Topix.net crawls have XML feeds. I'd estimate that only a few hundreds of the top 3,000 newspapers we crawl have RSS support.

After reading the post on the Topix Blog, I'm a bit surprise that no engine, except Yahoo, clearly encourage or stimulate the use of RSS in our sites. RSS is not only useful to News site, but in a lot of sections of the site. Every of thoses pages like the; Press Release, Portfolio, Calendar, Career, New products, Investor and a lot more sections, should have a RSS feed.

You save bandwith, and your clients save a lot of time by not visiting a page not changed since his last visit or by time save to manage the spam he receive in his email box because of the subscribtion of your press release.

Why not give bonus points on ranking, or a special category, or a logo on SERPs (like Yahoo), to thoses who take time to give to search engine a more structured feed. If the crawler of Google see that feed, we don't have to run complex algo to calculate the "weight" of words in the page and save computing time ! Why not encourage thoses strutured feed ?