Update: Microsoft rechristened Widows Live Classifieds as Windows Live Expo on or about 1/4/2004. A new MSN Spaces Windows Live Expo blog, which has two entries as of 1/19/2006, announced the name change.
Steve's post includes links to Greg Yardley's "Microsoft’s ‘Fremont’ a Craigslist competitor" item that includes marketing information about Fremont obtained from a Chinese MSN Places posting. Greg links to Adam Herscher, a Microsoft program manager (PM) who said, inter alia:

Unfortunately, Froogle Local isn't much help yet -- I can't find these products anywhere in Los Angeles county. I guess it's going to take some time and effort to woo over merchants. [I'm also wondering whether this is a market Windows Live Fremont intends to compete in, or if Fremont's goal is just to be a "Craigslist-killer"].I imagine other competing services will crop up over time too. Google Base, which powers Froogle Local, provides a mechanism for merchants to do bulk imports. But in addition to allowing merchants to regularly push this information to Google Base and n other services, it'd make sense to support a standardized pull mechanism too.

Note: Adam removed the first paragraph's sentence in brackets from his current online posts, but Greg had the foresight to add a link to the cached version from which the missing sentence was extracted.

Unfortunately, access to the beta version currently is limited to folks with pre-registered group e-mail addresses in the microsoft.com domain:

Products Listings vs. Classified AdsGoogle got the jump on Microsoft by signing up organizations who could quickly add large numbers of entries by bulk-uploading files to Google Base.

As an example, most of the more than 13+ million Products items come from bulk uploads of ShopLocal LLC information on retail chain stores—CircuitCity, CompUSA, KMart, OfficeDepot, RadioShack, Staples, Target, and Walgreens, as of November 29, 2005. SF Mobile, Inc.'s single store appeared with 2,523 items on November 30. Products listings power the Froogle Local shopping service, which has more in common with yellow pages than newspaper-style classified advertising.

Note: The number of Products items grew by about one million (to 14,480,050) from November 29 to November 30, 2005. The number the dropped to 10,005,921 as of December 2, 2005, possibly as the result of removing spam or inactivating items that don't comply with Google Base's Program Policies.

Google relies on bulk uploads for the majority of entries for common classified item types. Online rental agents/aggregators—such as Southern California's ApartmentHunters (851 of 52,406 entries)—dominate the Rentals category. The New York Times contributes its real estate classifieds to the Housing category (10,610 of 432,575 entries). Similarly, used-car wholesalers and marketers—such as CarCast—upload bulk listings to Google Base's Vehicles category. Item types less amenable to bulk uploads have far fewer entries; for example, the entire Services item type has only 8,567 entries and Events and Activities has 3,991 items.

Note: This entry was updated on December 2, 2005. Most item counts were made on November 29, 2005.

Microsoft appears to be aiming at social networking to drive traffic, and thus entries to the Fremont database. Here's the Fremont promo text quoted from the Chinese blog:

This product represents a unique offering by Microsoft to address the person-to-person marketplace. The product, code-named “Fremont,” is a dynamic new listing service that enables people to easily buy, sell, or swap among friends, co-workers, or the public. Fremont enhances your ability to:

*Connect with those you trust – your messenger buddies and your coworkers,*Locate items in your neighborhood or across the country through integration with MapPoint and Windows Live™ Local (formerly Virtual
Earth),*List easily, instantly, and for free.

Note:Virtual Earth's name changes to Windows Live™ Local. (Pinging local.live.com returns 65.55.241.141; tracert times out at an msn.net router, so prepare for a move in the near future.)

Dan Farber's ZDNet "Between the Lines" blog entry of November 30, 2005, "Google Base and Fremont–signs of the Web 2.0 plateau", quotes Microsoft product unit manager Gary Wiseman, who describes Fremont as "a free listing service, with a bunch of twists to make it very unique, such as integration with social networks, in particular integration with MSN Messenger." (Microsoft promises a new Windows Live Messenger beta is "coming soon.") Farber expects Google Base to "become more of a classifieds engine." CNet's Elinor Mills adds more details from Wiseman in her November 29, 2005 "Microsoft tests classifieds service" article.

Windows Live Fremont appears to encourage manual entries by individuals—typified by Craigslist—rather than the bulk uploads that contribute the majority of Google Base entries (e.g., the 13-million Products entries). According to Wiseman, Fremont posters can limit item visibility to those in "their MSN Messenger buddy list ..., in their MSN Spaces network, or in a specific domain name e-mail group."

Dare Obasanjo's December 2, 2005 "Windows Live Fremont: A Social Marketplace" post goes deeper into Fremont's social networking aspect by likening the service to "the bulletin boards in our [college] dorm hallways."Obviously Microsoft doesn't want Fremont to poach users from the MSN Shopping service, which received a major facelift this year. Thus Fremont appears to be targeted as, in Adam Herscher's original words, a "Craigslist killer." It's not surprising that Microsoft has Craigslist in their sights; according to the Pew Internet & American Life Project, Craigslist is #1 in the "Top classified sites" list posted by ZDNet on November 30, 2005. Craigslist grew their unique audience 156% between September 2004 and September 2005, while second-place Trader Publishing Company grew "only" 90%.

MSN Shopping now uses product feeds from eBay, PriceGrabber and Shopping.com in addition to the feeds it has always used from major merchants. The result is more than 27 million product offers from over 7,000 stores. But MSN goes beyond simply aggregating product offerings.

MSN Shopping has a team of category managers who specialize in particular areas, such as consumer electronics and jewelry, and tailor the search/browse experience to meet the needs of consumers shopping for those types of products. The company also aggressively "cleans" data from product feeds to avoid duplicate offerings.

Inventory on eBay is constantly changing and in order to bring our consumers the freshest catalog of choices possible, we parse, load, classify and match the tens of millions of eBay items on a daily basis. We have invested in building out our software platform to handle such high churn workloads and have expanded our server infrastructure to efficiently and quickly ingest the inventory available on eBay, along with the catalogs of our existing merchants and aggregators. All told, we sift through hundreds of millions of items every day.

Obviously, adding "high-churn" items from product auctions requires real-time updates to the back-end database. It appears to me that latency of the current Google Base back end wouldn't support real-time processing of auction items.

Thus it's certain that Fremont will have far fewer entries and, consequently, page views than Google Base or MSN Shopping. Google's combination of Froogle Local Shopping, a Craiglist clone, and other item types—Events and Activities, News and Articles, Recipes, Reference Articles, and Reviews that aren't related to ecommerce—will result in many more page views and thus greater opportunity for Google Base advertising revenue. However, BusinessWeek's Robert Hof gives Froogle thumbs down in his December 5, 2005 "Froogle: Shopping Made Complex" product review. According to Hof, "Froogle still can't match the leaders [Shopping.com, Shopzilla, and Yahoo! Shopping] in what really matters: quickly researching and finding the just the products you want." Forbes' Rachel Rosmarin is even less enthusiastic in her earlier "Google's Empty Stocking" article about three-year-old Froogle, which is still in beta. Obviously, the problems need fixing if Google wants to give Craigslist or competing retail sites a run for the money.

Local search and shopping clearly is today's hot topic among Web marketers and analysts. Even CNet Networks has climbed on the bandwagon with local shopping features for technology products.

Channel Intelligence provides local availability data for more than 1,600 top-selling products. CNet has a group of participating consumer-electronic retailers —Best Buy, Circuit City, CompUSA, and OfficeMax—similar to that originally reported for Google Base Products listings.

Note: It's likely to be a challenge to select appropriate (related but non-competitive) ads for placement on Froogle Local pages, which are sure to be the largest consumer of Google Base's database server resources.

Open vs. Closed Databases
Craigslist has a much a more granular set of predefined classified advertising categories than Google Base, a pattern that Fremont will follow with a predetermined—probably SQL Server 2005—database schema. Mills quotes Wiseman: "We started this before anyone knew about Google Base. Having seen what Google Base is doing, I don't think they were aiming for a classifieds service. They don't have a taxonomy of listings like we do. They see it as an open database."

The concept of an open database where online users can define their own schema/metadata is interesting, to say the least. Google Base undoubtedly will be of more interest to developers than Fremont, especially if Google releases a Google Base API. Stay tuned for future posts on customizing Google Base Item Type categories and their attribute name/value pairs.

UIs for Open and Closed Databases
Google Base begain life as a general-purpose online database that's intended to serve as a backend to a multitude of Google—and, potentially, third-party—front ends. Thus classified advertisers and buyers aren't likely to be forced to deal with Google Base's current arcane data entry UI.

People have also been saying how Google Base is a limited or comparatively poor user experience (I made some earlier comments along these lines).

After chatting with Google about it, I think the company is not that focused on the Google Base user experience (although there undoubtedly will be refinements). Google isn’t going to create hundreds of competitive vertical experiences around the data it collects.

Ben Charney wrote "Microsoft Testing Its Own 'Google Base'" for PC Magazine on November 29. 2005. This article version sheds no new light whatsoever on the Fremont beta. However, the "full story" on eWeek says, "Microsoft plans the first public test sometime later this month, according to a Microsoft spokeswoman." Obviously "this month" should read "next month," as there was only one day left in November.

Sunday, November 27, 2005

If the lack of real-time addition of your items to Google Base doesn't discourage you from using the beta version of this new service, perhaps delayed disappearance of items that you bulk-upload to Google Base beta 1 will.

You might also be put off by spurious indication of bulk-upload failures and no indication from the Google gods that they don't like your uploaded content.Google Groups' Google Base Help Discussion group has several active threads from beta testers who've lost the entries they uploaded (click here and here for examples.)

Since the original Atom 0.3 XML upload I documented in my "Google Base and Atom 0.3 Bulk Uploads" post, all subsequent uploads of the News and Articles or Reference articles change from Active Items to Inactive Items when published. These uploads were made with second and third Google accounts associated with my sbcglobal.net (DSL) e-mail addresses rather than my usual compuserve.com (dialup) address. Publishing usually takes several hours; I often wait overnight before reviewing item status.

My first two bulk uploads of the same transformed Atom 0.3 XML Atom03Dest.xml file to the second and third Google accounts succeeded. Occasionally, uploads will indicate a Failed status (0 Items, 0 Errors) as a result of "Bad data." Despite the reported errors, the \fdbd page displays Active Items (50) Inactive Items (0) as shown here:

Here's a capture of the Details page for the preceding upload:

The "Bad data" error message is less than informative. There is no valid reason for encountering bad data from a known-good Atom 0.3 XML file, which tested OK in a subsequent bulk upload as shown here:

Here are the first few of the 50 unpublished but Active items:

Label attribute values appear as expected for the News and Articles item type in edit and preview modes.

About 1.5 hours after I uploaded the Modified items, Google Base marked the added items with Published status:

Rechecking the fdbd page the next morning showed that all the above 50 entries had been marked inactive:

What's more, the same 50 entries added to the products list that I bulk-uploaded with a text-separated values file to an account with 69 items also became Inactive:

Attempting to open any Inactive Items list by clicking the Inactive Items(50) link displays the following error page:

Although a message indicates that Inactive Items will be removed in the future, this did not occur after more than 72 hours. The error message when attempting to display inactive items prevents removing them to enable re-uploading the original or a modified file.

There were no e-mail or on-line messages from Google indicating the reason for inactivating these bulk-uploaded items. If the reason for inactivation is item duplication, users should be so advised.

Eliminating Item Duplication with CodeI concluded that Google Base uses a simple hash function or cryptographic hash function to create a unique n-byte field value for each item. After publication, it appears that Google tests items for duplication by comparing hash values with previously posted items. However Google doesn't publish what value(s) they include in the hash, nor whether duplication tests apply to inactive items.I added code to my Atom.xml file transformation application to make minor changes to the Atom 0.3 dateTime element values (issued, modified, and created), as well as to the text of the description attribute (adding
[Modified for Google Base bulk upload on 2005-11-29T07:48:51-08:00] to the end of the context element value with the system date/time to assure uniqueness.Click here to display the 50 News and Articles items uploaded to the Oakleaf_Systems alias.

Interim ConclusionAltering item content appears to prevent inactivation of subsequent uploads of otherwise duplicate News and Articles and Reference Articles items.Arbitrarily and silently inactivating users' bulk-uploaded items that comply with Google Base's Editorial Guidelines and Program Policies indicates to me that Google Base beta 1 isn't even close to useful for its advertised applications. (I don't believe that item duplication violates the Editorial Guidelines' "No Repetition: Avoid gimmicky repitition" rule. I can find no restriction on duplicate items in Google Base's terms of service. My tests indicate the Google even includes inactive items in tests for duplication.

However, Google Base doesn't enforce its Program Policies' "Affiliates: Posting is not permitted for the promotion of affiliate sites or products sold through an affiliate marketing relationship." As an example, operators of "The Mall, Online Store" at http://www.free-poker-tips.org/ have posted 233,348 referral links (as of November 29, 2005) to Amazon.com items.

"Ordinary users" are those who haven't cut the special deals apparently reserved for retail chains—such as CircuitCity, CompUSA, KMart, OfficeDepot, RadioShack, Staples, Target, and Walgreens—or affiliate spammers that bulk upload hundreds of thousands or more items for Froogle Local links.

Question: Why is BestBuy missing from a Froogle Local stores list that includes CircuitCity, CompUSA, RadioShack and Target? Most early trade press articles and blog entries on Froogle Local include BestBuy as major player in the Froogle Local beta.

Probable answer: BestBuy isn't a client of ShopLocal LLC, which supplies the data for products carried by each store of a retail chain. All other stores listed at the bottom of Froogle Local pages are ShopLocal clients.

--rj

Note: As of November 29, 2005, there were 13,745,669 Products items in Google Base. News and Articles reported 8,078 items and there were 32,801 Reference Articles.

Wednesday, November 23, 2005

The Aussie SQL Down Under site features frequent podcasts about new features of SQL Server 2005. Their latest (November 22, 2005) one-hour production features Microsoft Research's Jim Gray discussing the future of SQL Server, LINQ, and T-SQL, among a host of other SQL Server 2005 topics.To make access to particular topics easier, I've logged the WMA version of the podcast with brief descriptions of most major topics:00:00- Introduction, Jim Gray's CV, and how he came to Microsoft Research.05:20 - Why it took five years to relase SQL Server 2005. "Database systems have become ecosystems in which you have the traditional tabular data store, an XML store, data mining, cubes, an extract-transfor-load service, a whole security model, management [applications], and self-tuning [features]."07:00- Unification of SQL Server and programming languages. The SQL Server team expected to ship V.Next in 2003, but underestimated the effort required to unify SQL Server and the .NET Framework. It was a very painful experience.09:30 - Issues with feature currency and large development teams. The currency inside SQL Server is a dataset or a Tabular Data Stream, although we're gradually moving away from TDS toward the Web services model. The T-SQL command-in/dataset out model is today's key to unifying access to relational data, text, and XML.11:15 - Release frequency. Annual releases are very destabilizing but less frequen releases result in huge changes instead of "lots of little ones."13:45 - CLR, LINQ, and "T-SQL is dead" - "FORTRAN isn't dead." Any CLR projgram has T-SQL at its root. T-SQL is loosely typed and late-bound, so it's very easy to write.17:05 - Looseness of T-SQL typing. LINQ is wonderful, but it's compiled and its data definitions are static.19:30 - DB2 and Oracle are much more strict about data typing. T-SQL uses data-type coercion. Jim mentions an ANSI flag to prevent coercion (but I'm not aware such a flag exists.).20:45 - LINQ. "I'm wildly enthusiastic about LINQ." Microsoft isn't very good at supporting embedded SQL because of type conflicts between T-SQL and programming languages. LINQ treats tables as a class; rows as objects. Tables are enumerable; you can do a For Each on a table or answer to a query. Tables are collections, so cursors go away. "The syntax is a little screwy to make IntelliSense work."25:00 - What's the story on DLinq and XLinq? Both will become extremely important to folks who like to program in VB and C#. "It's one of the things that might attract you away from T-SQL because it really [offers] early-binding. The amount of gunk you need to write for ADO.NET to get the null program to work is just disgusting." The big selling point for LINQ is it's so easy to get started.26:45 - CLR types vs. SQL types. "It may seem like a mismatch, but I declare everything to be a SQL type and everything works out great. ... A friend wrote the 'Null Memo,' which was an impassioned plea that we get rid of null values, but we had a group of theory guys who loved three-state logic. We're stuck with nulls."29:00 - Object purists want to treat the database as a repository for objects. "Just put my objects in the database." The result is a fat table with many sparse columns, which pivots to a skinny table with three columns.31:45 - Inheritence in LINQ. A LINQ table is a minimalist class that doesn't support much inheritence. The specification is mute about how interitence works in the LINQ model. One way inheritence would work is with "a universal relationship at the bottom."34.00 - Inheritence in T-SQL. T-SQL doesn't have a class concept at all. It's so loosely typed that its only classes are tables, but you can't pass tables as parameters. "T-SQL is a great scripting language, but it's never going to be as clean as C#, ever, period."35:15 - Break37:00 - TerraServer, SkyServer, and spatial indexing in SQL Server 2005.44:45 - Where are spatial applications heading? Billions of cellphones means that location services are central to future applications. Going beyond four dimensions (latitude, longitude, altitude, and time) is difficult.48:10 - Very large databases. VLDBs are in our future and most will be spatially oriented. The goal is to tell users about things that are nearby.49:40 - Evolution of SQL Server. "I came to Microsoft to scale up SQL Server. We've done a reasonable job of scaling up and scaling down, but we haven't done a good job of scaling out to self-organizing arrays of SQL Server instances. Over the next five years, we'll deliver on scale-out; what Oracle calls 'rack'. We're getting beat-up pretty badly about that, because it's the one thing we don't do." 52:00 - SQL Server parity with DB2 and Oracle. "We made a decision not to chase DB2 or Oracle tailpipes. Instead, we made SQL Server solve the next generation rather than the last generation problems. So we added data mining, automanagement, XML support, and a bunch of things we think are forward-looking." 52:20 - Limited resources caused a few things to slide. "In the next five years you'll see many things that were thrown out of the lifeboat just before SQL Server 2005 shipped: WinFS, LINQ, better integration with Visual Studio, more data mining rules, deeper XML support, and more Web services."54:00 - Scaleout will have a major Web services component story. "Having Web services and Service Broker built into SQL Server means that you don't need IIS any more."55:15 - What's coming up in Jim Gray's world? "We are working very hard to get scientific literature as well as scientific data online. PubMed Central is run by the National Library of Medicine (NLM) on SQL Server and has the abstracts—mostly in XMLish format—of all of the [NLM's] medical literature. The U.S. Congress has mandated that any research that the National Institutes of Healt (NIH) sponsors be deposited with the NLM and be published within six months of its publication in a journal. This is called taxpayer access, so if you get some exotic disease, you can go to the NLM and see the research that your tax dollars paid for, instead of paying $50 to get a copy of it. We've made a portable version of PubMed that's been installed in the U.K., Italy, and South Africa, and will be installed in Japan and elsewhere. The copies federate with one another using Web services. When a document is deposited in one place, it goes to all the other places. PubMed is a poster child for XML Web services."59:03 - EndTechnorati: DatabasesSQL Server 2005LINQDLinqXLinq

OpenOffice.org and its cohorts—Adobe, Corel, IBM, KDE, and Sun Microsystems—chose the Organization for the Advancement of Information Standards (OASIS) as the initial standards body for the OASIS Open Document Format for Office Applications (OpenDocument format or ODF). OASIS submitted the ODF standard to the ISO International Electrotechnical Commission's Joint Technical Committee (ISO/IEC JTC1) on September 30, 2005.Microsoft has had a long-term relationship with ECMA, having participated in the ECMAScript (JavaScript) standardization process and later submitted .NET's C# language and the Common Language Infrastructure (CLI) for ECMA and ISO/IEC standardization. Sun Microsystems, the commercial force behind OpenOffice.org, has favored OASIS since the origins of the Electronic Business XML (ebXML) and Universal Business Language (UBL) standardization process. OASIS's aegis extends to many Web services standards, such as UDDI 2.0 and 3.0.2, WS-Security, and Web Services Distributed Management (WSDM). Microsoft refused to support ebXML and UBL but was a very active participant—together with IBM—in the development of UDDI the and WS-Security specifications, plus the OASIS standards processes.Microsoft's choice of ECMA, which has no history in the document standards business, appears to me to be an example of standards-body "forum shopping."Forum Shopping DefinedWikipedia defines "forum shopping" as the "the practice adopted by some plaintiffs to get their legal case heard in the court thought most likely to provide a favourable judgment, or by some defendants who seek to have the case moved to a different court." For example, it's a common practice for individual and class-action plaintiffs to attempt try product liability cases in southeastern Texas state courts because these courts have a history of making unusually large awards to plaintiffs.Similarly defendandants in employee non-competition actions prefer California state courts, because California state law generaly favors individual rights and disdains corporate non-compete clauses. An example is the recent action filed in Washington state court by Microsoft against Google and Kai-Fu Lee, a former vice president of Microsoft's Interactive Services Division. Google, in turn, filed a motion in the California state court to throw out Microsoft's Washinton complaint. At present, the Washington action is scheduled for trial in early 2006.Note: The same CNet News.com article also describes Microsoft's non-compete action against Adam Bosworth and Tod Neilsen, who eventually ended up at Google and Borland, respectively. Click here for more on Bosworth at Google.Shopping for Standards Bodies that Support Your View of Intellectual Property RightsPrimary candidates for "open" XML document format standards bodies are the W3C, OASIS, ECMA, and, possibly, IETF. (IETF is the standards body for the Atom 1.0 XML syndication format.) There is, however, considerable controversy on what constitutes an "open standard." Microsoft patent attorney Nicos L. Tslilas contends in his recent "The Threat to Innovation, Interoperability, and Government Procurement Options From Recently Proposed Definitions of 'Open Standards'" paper that standards with reasonable and non-discriminitory (RAND) patent licenses qualify as "open standards."Note: "Government Procurement Options" in the paper's title obviously refers to the Commonwealth of Massachusetts' decision to restrict state purchase of office productivity applications to those that support ODF.Following are links to intellectual property rights policies of the preceding four standards bodies, plus ISO/IEC:

W3C "seeks to issue Recommendations that can be implemented on a Royalty-Free (RF) basis." However, section 7.5.3 of W3C Patent Policy, "Alternative Licensing Terms," permits a Patent Advisory Group (PAG) to "propose that specifically identified patented technology be included in the Recommendation even though such claims are not available according to the W3C RF licensing requirements of this policy."

Here's the IETF's IPR policy:

In general, IETF working groups prefer technologies with no known IPR claims or, for technologies with claims against them, an offer of royalty-free licensing. But IETF working groups have the discretion to adopt technology with a commitment of fair and non-discriminatory terms, or even with no licensing commitment, if they feel that this technology is superior enough to alternatives with fewer IPR claims or free licensing to outweigh the potential cost of the licenses.

Thus ECMA and OASIS became the finalists in the shopping list, but Sun pre-empted OASIS with ODF. Only ECMA remains to Microsoft as an unabashed champion of RAND licensing. However, Microsoft's initial offer of RF-mode licenses for the Office 2003 XML schema and subsequent change to a "covenant not to sue" moots the RAND-mode licensing issue for all forums.

ECMAInternational is an European organization, so the European Union's xenophobic bureaucrats might consider ECMA preferable to US-based OASIS as the standards body for the millions (or billions) of mostly superfluous Microsoft Office documents the EU produces per year.

An advantage of ECMA appears to be the speed at which standards emerge from TCs. For example, Microsoft, Hewlett-Packard, and Intel submitted the C# and CLI specifications to ECMA on October 31, 2000 and ECMA ratified the two standards on December 14, 2001. The ECMA standard's gestation period was 1-1/8 years.

In contrast, Arbortext, Boeing, Corel, CSW Informatics, Drake Certivo, National Archive of Australia, New York State Office of the Attorney General, Society of Biblical Literature, Sony, Stellent and Sun Microsystems founded the OASIS Open Office XML Format TC in December 2002. The first TC draft was approved in March 2003, the second in December 2004, and the third in March 2005. OpenDocument was approved as an OASIS Standard in May 2005, almost 2-1/2 years after formation of the TC and more than twice as long as the ECMA process.

Will Two Similar XML Document Standards Emerge?

The real issue—as I see it—is: How will ISO/IEC-JTC1 react when the ECMA working group submits in 2006 an almost-identical (or at least very similar) set of XML document standards as OASIS's 2005 submittal of ODF?Will JTC1 require the competing ECMA and OASIS "standards" to be rationalized into a single ISO/IEC standard?

Sun Microsystems' Tim Bray questions the need for two XML document standards in his recent "Thought Experiments" post (updated November 27, 2005.) His distaste for the Office Open XML format undoubtedly derives from his employer's position as the promulgator of OpenOffice and ODF. This conclusion is supported by Bray's position as co-chair of the IETF Atom Working Group. As Microsoft's Dare Obasanjo points out in his "Tim Bray's Hypocrisy and Competing XML Formats" post:

I find it extremely ironic that one of the driving forces behind creating a redundant and duplicative XML format for website syndication would be one of the first to claim that we only need one XML format to solve any problem. For those who aren't in the know, Tim Bray is one of the chairs of the Atom Working Group in the IETF whose primary goal is to create a competing format to RSS 2.0 which does basically the same thing. In fact Tim Bray has written a decent number of posts attempting to explain why we need multiple XML formats for syndicating blog posts, news and enclosures on the Web.

Note: I'll believe that IBM is a legitimate "open source" and "open standards" proponent when they open-source their current version of DB2, and other commercial applications and operating systems under royalty-free, fully sublicensable terms.

The degree of the openness of the "open standards" process is difficult to resolve. As an example, the OASIS ODF Technical Committee (TC) has 14 members. Three members are Sun employees, three work for IBM, and one each are employed by Adobe Systems, Intel, and OASIS. Three are listed as individuals: Patrick Durusau is Director of Research and Development at the Society of Biblical Literature; Gary Edwards is principal of Open Business Stack Systems; and David Faure is the maintainer of the KWord and KOffice libraries.

As an example of ECMA TC membership, the TC39 - TG2 - C# technical group has the following 14 nominated representatives: BEA Systems, Borland, COSC of the University of Canterbury, HP, Hitachi, IBM, Indiana University, Intel, IT University of Copenhagen, Macromedia, Mainsoft, Microsoft, Novell, and Plum Hall.

Presumably, the 11 backing organizations listed in the press release will join the future ECMA TC/TG. However, only the 18 Ordinary Members of ECMA have a right to vote. Of the backing organizations, only Microsoft, Intel, and Toshiba are Ordinary Members. How membership status affects an individual organization's right to insist on modifications to a proposed standard isn't clear from ECMA's Web site.

Traditionally, Microsoft has favored RAND-mode patent licensing but has granted RF licenses for some IP, such as the Office 2003 XML schemas. The terms of these licensing modes require application developers and users to sign a written contract. Brian Jones says, regarding changes to Microsoft's licensing approach:

[I]n order to clear up any other uncertainties related to how and where you can use our formats, we are moving away from our royalty free license, and instead we are going to provide a very simple and general statement that we make an irrevocable commitment not to sue. I'm not a lawyer, but from what I can see, this "covenant not to sue" looks like it should clear the way for GPL development which was a concern for some folks.

Robert Scoble asked Jean Paoli, "Do I need to sign, or agree to, any licensing agreements to use the formats?" Paoli responded:

No, for the specifications and in our work with Ecma International, we are offering a broad “covenant not to sue” to anyone who uses our formats. This is a new approach that continues our open and royalty-free approach. We think it will be broadly appealing to developers, including most open source developers. ([B]y the way you did not have to sign anything even before this announcement.)

Judging from most of the 50+ comments to the Scobelizer post and the 450 or so on SlashDot, few—if any—free-lance Open Source proponents would accept any free license or assertion of no license required from Microsoft for the Office Open XML format, no matter who endorses them.The issue for these hard-core folks won't be settled until Microsoft open-sources Office and Windows. eWeek's David Coursey takes up this issue in his "Bill Gates Is Not the Next Linus Torvalds" op-ed article.

Thursday, November 17, 2005

Adding Web pages or blog posts as News or Articles or Reference Articles item types to Google Base is problematic for content owners. Google's draconian Terms of Service for the content you upload gives Google carte blanche to "reproduce, modify, adapt, publish, and otherwise use, with or without attribution such Content," as well as "to use your trademarks, service marks, trade names, proprietary logos, domain names and any other source or business identifiers."

I'm certainly not enthusiastic about the Google folks modifying or adapting and then publishing my content without attribution. ZDNet's Garrett Rogers discusses these and related issues in his November 16, 2005 post, "Google Base: Preparing for the worst?."

Note: If you're searching on "Google Base," you're likely to see references to "All your base are belong to us," an idiosyncratic Internet message (apparently in pidgin, akin to "him belong me") that's explained in this lengthy Wikipedia entry. It's an interesting sidelight that Al Queda is "the base" in Arabic. The Wiki entry also mentions common derivatives, such as "all your data are belong to us," which is more germane to the Terms of Service issue.

Despite my misgivings about Google's Terms of Service, I decided to invest a few hours of Visual Studio 2005 programming time to clean up Blogger's Atom 0.3 XML file for this site, add some optional tags and attributes, and publish it to Google Base. I'll provide details on the VB 2005 code I used to manipulated the Atom.xml XmlDomDocument object in a future post.As I mentioned in the "Initial Conclusions" section of my earlier "Google Base and Bulk Uploads with Microsoft Access" post:

Initial tests with an Atom 0.3 (Atom.xml) file generated by Blogger for the OakLeafBlog, saved as a local XML file with FireFox 1.5 RC2, and Bulk Uploaded as the Reference Articles item type showed several problems. The description attribute contains HTML markup and error messaages state that the value is limited to a maximum of 10,000 characters. Thus only the shorter OakLeafBlog articles publish to the list; HTML markup contributes substantially to description length.Help Center's "What do I include in 'Description'?" topic says "Please ensure that the description does not contain any HTML as we don't currently recognize or display HTML tags in your item." Help Center also says the maximum description length is 1,000 characters.

I was surprised by the inconsistencies between the help topics and the result of an initial test with a moderate-size Atom.xml document from a Google application (Blogger). So I temporarily increased the size of the main page to include all OakLeafBlog posts (50 as of this post), which would permit more complete tests and let me evaluate issues that relate to creating Google Base-enabled XML files.

The ultimate objective of this exercise is to determine whether any benefits accrue to Web site publishers—or, for this example, bloggers—by publishing copies of linked content on Google Base. Much of the initial Googe Base content—such as real-estate listings—consists of links to existing Web pages. Presumably, Google will have spidered the source site's pages previously. Technorati's Niall Kennedy posits:

Why should you go to the trouble of submitting your information to Google Base? You will be completely sure that Google has all your latest content complete with the appropriate link back to your site. Feeding the content directly to Google may help your posts place better in Google search results.

Completing Your Personal Profile
If you have or create a Google account, which you need for most Google applications, you'll probably find it worthwhile to add the additional default attribute values that apply to Google Base only. See this section in the preceding "Google Base and Bulk Uploads with Microsoft Access" post for details.

Creating the Raw XML Bulk Upload File
The http://oakleafblog.blogspot.com/atom.xml document contains data for 50 posts (<entry> groups) in a 498-KB file for an average of about 10,000 characters per <entry>. FireFox 1.5 RC2 displays the HTML tags in the <content> elements, as shown here, which transform to Google Base description attribute values:FireFox 1.5's View Page Source command displays the Atom 0.3 source code and enables saving the Atom 0.3 source code to a physical file, which is required for bulk XML file uploads:

The stylesheet employed by Internet Explorer 5+ strips the HTML markup from the XML document's content element but won't display or enable saving the unformatted <content> value locally, as shown here:
Thus, you'll need to substitute FireFox for IE to generate and save a file—OakLeafBlogAtom.xml for this example—for the Bulk Upload operation. (Only FireFox 1.5 RC2 and RC3 have been tested to date.)Uploading the Atom 0.3 XML File as a Reference Articles Item TypeThe Specify a Bulk Upload page's Choose an Existing Type list doesn't offer the News and Articles Item Type, which would be more appropriate for a list of blog posts. (News and Articles and Wanted Item Types appear in the Choose an Existing Item Type list on the Post an Item page for ad hoc items.) News and Articles supports the following standard attributes, in addition to title and description: author, expiration_date, label, news_source, pages, and publish_date. (It's unfortunate that Google didn't adopt standardized metadata terms, such as those of the Dublin Core Metadata Intiative—DMCI.)

Update 11/25/2005: You're no longer stuck with Reference Articles as the Item Type for Blogger Atom 0.3 feeds. The Bulk Upload page's Choose an Existing Type list now includes News and Articles and Wanted Ads Item Types. Google also added Blogs, Coupons, Rentals, and Comic Books as standard search categories to the default home page. Rapid ad hoc changes like this demonstrate another advantage of Web-based services.

The process for uploading an Atom 0.3 XML file is similar to that for uploading a tab-separated value text file to create a list of the Products Item Type:

1. After logging in with your Google account, navigate to the Google Base home page and click the Post Multiple Items with a Bulk Upload File link to open the My Items page.

2. Click the Specify a Bulk Upload File link, type the FileNameAtom.xml file name in the text box, select Reference Articles in the Item Type list, and click Specify Bulk Upload File to open the My Items page.

3. Click Browse, navigate to and double-click the file you saved with FireFox to specify it as the source of the registered FileNameAtom.xml file, as shown here:4. Click Upload and Processs This File. Wait a few minutes (or hours), and then press F5 to determine the publication status of the file. If you can't stand the wait, click the Active Items link after it displays a count of 1 or more to review unpublished items in the list:5. Click one of the Edit links to display the item in the standard editing form for the Reference Articles Item Type:

Notice the HTML markup in the Description attribute textarea. This example has a substantially lower proportion of markup characters to content than most OakLeafBlog posts. It would be possible—but certainly tedious—to remove the tags manually and add Details attribute-value pairs and Labels keywords tags.

Viewing the Items as a Google Base UserTo emulate a search by an ordinary Google Base user, follow this drill:

1. Sign out of your account, navigate to the Google Base home page, type a unique search term, such as xlinq for OakLeafBlog posts, and click Search Base to display the results. Alternatively, click here.

As expected, clicking the OakLeaf Consulting link or here displays all active items for authorid=1063521.

2. Click one of the titles to open the linked page whose URL appears in green, or click here.

Fixing Feed ErrorsThe inclusion of HTML markup in the description attribute isn't a problem for ordinary users, because they don't see the attribute value. However, large amounts of markup combined with lengthy content can result in failure to post overlength entry groups. In this case, the My Items page displays an error message:

Note: It might take several hours for the preceding warning to appear. Bulk Updates don't occur in real time.

Clicking the Details link displays this page with error messages:

To overcome this problem, you must edit the content element of overlength entries, remove the HTML tags, test for content length, and then trim the string value if it's more than 10,000 characters.

Serious Bug in g:label Custom Attributes DocumentationGoogle has created its own taxonomy of Atom 0.3 extensions that's identified by an xmlns:g="http://base.google.com/ns/1.0" namespace attribute added to the the feed element. The Google Base - Atom 0.3 Specification page includes an example of use of this namespace to add several predefined elements—g:image_link, g:expiration_date, g:job_function, g:location; and g:label—to specify non-standard attributes for a specifc Item Type. The example for the <g:label> elements is incorrect. The label item of the Google Base - XML Attributes page has the same error.

Following is an abbreviated version of a Blogger Atom 0.3 test file with the Google Base extension namespace attribute and multiple g:label elements added in accordance with the preceding XML document example and attribute specification. Technorati tag names provide the values of the multiple g:label elements.

Note: Some line-breaks have been inserted at illegal positions to prevent exceeding the left frame width limit.

Click here for a more readable version of the preceding sample file from Google Groups (in print format).

Uploading the complete 256-KB file as a Reference Article resulted in a Failure status report in the My Items page with a single instance of "Bad data" as the reason for the failure. The Upload page reported 0 Items Processed, 0 Items Succeeded, and 0 Active Items. However, after a few hours (overnight), the Active Items page reported all items had Published status. (The Upload page data didn't change.)

Fixing the g:label Attribute Specification Bug
Opening in the Edit page the few entries that had a single Technorati tag—and thus a single <g:label> element, typically LINQ—showed the tag name in the Label textarea. The text associated with the Label control suggests "Keywords or phrases that describe your item. Maximum of 10. Separate with commas." Based on this hint, I changed the <g:label> elements from:

This change solved the Failure problems, reported Success as the status, and processed all 50 items, as shown here:Note: The Google Base - XML Attributes page's image item states that a comma-separated list—such as <g:label> leater, power locks, sunroof, ABS </g:label>—is Not acceptable. (It's doubtful that the list isn't acceptable because of leading or trailing spaces or a missing "h" in "leater").

The fix to the g:label attribute format also fixed the missing Labels entries problem with multiple <g:label> elements, as shown here:

The Label tags appear on the edit page immediately after Google processes the upload, so you don't need to wait for Published status to test your editing application.

Use Labels to Refine Google Base User Searches
When you add Label tags to your entries, users can refine their searches by clicking links that return entries that match all tags for an entry as shown here:

Notice that comma-separated Names (tag) values appear under the Titles. Click here to open the preceding interactive Google Base page, click the More... link to display all Names combinations, and try the various refinement choices. Click the publisher's moniker—Roger Jennings for this example—or click here to display a list of all items (not just Reference Articles) contributed by the publisher (authorid=1071203).

Conclusion
Google needs to clean up its Atom 0.3 documentation to minimize developers' wild-goose chases. The current (beta) UI undoubtedly will confuse potential users. For example, I would not have known the benefit of adding Name tags to search refinement, if I hadn't written a simple VB.NET 2005 project to clean up the description attribute (<context> element) value and add the Google Base namespace and a <g:name> element in the correct format.

The World Resources Institute (WRI) claims to have submitted information to Google Base "on a 5 million-record database on sustainable development for 200 countries over a period of up to a century." However, a search of Google Base on "World Resources Institute" returns only 4,253 items that were entered between November 15, 2005 and November 28, 2005 as Research Studies and Publications Item Type. This Item Type appears to have been replaced by Reference Articles. The status of the remaining 4.996 million (purported) items isn't clear as of December 2, 2005.

Watch for updates to this post as other developers add their content to Google Base and keep an eye on the Google Base Help Discussion group to see what problems users encounter.

The dual Web role application has been running in Microsoft's South Central US (San Antonio) data center since September 2009. I believe it is the oldest continuously running Windows Azure application.

About Me

I'm a Windows Azure Insider, a retired Windows Azure MVP, the principal developer for OakLeaf Systems and the author of 30+ books on Microsoft software. The books have more than 1.25 million English copies in print and have been translated into 20+ languages.

Full disclosure: I make part of my livelihood by writing about Microsoft products in books and for magazines. I regularly receive free evaluation software from Microsoft and press credentials for Microsoft Tech•Ed and PDC. I'm also a member of the Microsoft Partner Network.