A .data Top-Level Internet Domain?

January 10, 2012

There’s been very little change in top-level internet domains (like .com, .org, .us, etc.) for a long time. But a number of years ago I started thinking about the possibility of having a new .data top-level domain (TLD). And starting this week, there’ll finally be a period when it’s possible to apply to create such a thing.

It’s not at all clear what’s going to happen with new TLDs—or how people will end up feeling about them. Presumably there’ll be TLDs for places and communities and professions and categories of goods and events. A .data TLD would be a slightly different kind of thing. But along with some other interested parties, I’ve been exploring the possibility of creating such a thing.

With Wolfram|Alpha and Mathematica—as well as our annual Data Summit—we’ve been deeply involved with the worldwide data community, and coordinating the creation of a .data TLD would be an extension of that activity.

But what would be the point? For me, it’s about highlighting the exposure of data on the internet—and providing added impetus for organizations to expose data in a way that can efficiently be found and accessed.

In building Wolfram|Alpha, we’ve absorbed an immense amount of data, across a huge number of domains. But—perhaps surprisingly—almost none of it has come in any direct way from the visible internet. Instead, it’s mostly from a complicated patchwork of data files and feeds and database dumps.

But wouldn’t it be nice if there was some standard way to get access to whatever structured data any organization wants to expose?

Right now there are conventions for websites about exposing sitemaps that tell web crawlers how to navigate the sites. And there are plenty of loose conventions about how websites are organized. But there’s really nothing about structured data.

Now of course today’s web is primarily aimed at two audiences: human readers and search engine crawlers. But with Wolfram|Alpha and the idea of computational knowledge, it’s become clear that there’s another important audience: automated systems that can compute things.

There are product catalogs, store information, event calendars, regulatory filings, inventory data, historical reference material, contact information—lots of things that can be very usefully computed from. But even if these things are somewhere on an organization’s website, there’s no standard way to find them, let alone standard structured formats for them.

My concept for the .data domain is to use it to create the “data web”—in a sense a parallel construct to the ordinary web, but oriented toward structured data intended for computational use. The notion is that alongside a website like wolfram.com, there’d be wolfram.data.

If a human went to wolfram.data, there’d be a structured summary of what data the organization behind it wanted to expose. And if a computational system went there, it’d find just what it needs to ingest the data, and begin computing with it.

Needless to say, as we’ve learned over and over again in building Wolfram|Alpha, getting the underlying data is just the beginning of the story. The real work usually starts when one wants to compute from it—so that one can answer specific questions, generate specific reports, and so on.

For example, in our recent work on making the Best Buy product catalog computable, the original data (which came to us as a database dump) was perfectly easy to read. The real work came in the whole rest of the pipeline that was involved in making that data computable.

But the first step is to get the underlying data. And my concept for the .data domain is to provide a uniform mechanism—accessible to any organization, of any size—for exposing the underlying data.

Now of course one could just start a convention that organizations should have a “/datamap.xml” file (or somesuch) in the root of their web domains, just like a sitemap—rather than having a whole separate .data site. But I think introducing a new .data top-level domain would give much more prominence to the creation of the data web—and would provide the kind of momentum that’d be needed to get good, widespread, standards for the various kinds of data.

What is the relation of all this to the semantic web? The central notion of the semantic web is to introduce markup for human-readable web pages that makes them easier for computers to understand and process. And there’s some overlap here with the concept of the data web. But the bulk of the data web is about providing a place for large lumps of structured data that no human would ever directly want to deal with.

A decade ago I suggested to early search engine pioneers that they could get to the deep web by defining standards for how to expose data from databases. For a while there was enthusiasm about exposing “web services”, and now there are all manner of APIs made available by different organizations.

It’s been interesting for me in the past few years to be involved in the emergence of the modern data community. And from what I have seen, I think we’re now just reaching a critical point, where a wide range of organizations are ready to engage in delivering large-scale structured data in standardized forms. So it is a convenient coincidence that this is happening just when it becomes possible to create a .data top-level domain.

We’re certainly not sure what all the issues about a .data TLD will be, and we’re actively seeking input and partners in this effort. But I think there’s a potentially important opportunity, so I’m trying to do what I can to provide leadership, and further help to accelerate the birth of the data web.

55 comments. Show all »

I like the idea (especially if .data itself had more structure), it could then become the consumer semantic web long sought by many–since before the name was coined, and perhaps serve as another on-ramp to the neural network economy. I think it would be useful as an incentive to move some to share structured data that are on the fence while with others it may serve as an educational tool moving them in the direction we’ve long envisioned @Kyield.

Bringing everyone’s data as close to “computable” as possible is an all-round win so I hope this takes off.

A big problem is how to ETL these datasets between organizations, and I think Hadoop is a key technology there. It provides the integration point for both slurping the data out of internal databases, and transforming it into consumable form. It also allows for bringing the computations to the data, which is the only practical thing to do with truly big data.

Currently there are no solutions for transferring data between different organizations’ hadoop installations. So some publishing technology that would connect hadoop’s HDFS to the .data domain would be a powerful way for forward-thinking organizations to participate.

Another path towards making things easier to focus on the cloud aspect. Transferring terabytes of data is non-trivial. But if the data is published to a cloud provider, others can access it without having to create their own copy, and it can be computed upon within the high-speed internal network of the provider. Again, bringing the computation to the data.

your idea sounds really good, it would bring together your world of computation and data harvesting and analysis, which incidentally is my passion and “hobby” together with Mathematica, and my world of domain names (I am an executive in a domain&hosting company).

I hope you will present the idea to ICANN, it would be good to promote such a domain one day!

Hello Stephen,
Did you look at the .tel TLD. It is radically different from the other TLDs and its main purpose is to directly expose structured data at the DNS level, utilizing the DNS zone as a data store.
Our goal (I am the CTO of Telnic Ltd., the registry for .tel) wa specifically to publish at the root all possible communications channels for an entity inside its .tel DNS zone.
Today a .tel domain can store any number of key/value pairs for communication channels, textual descriptive info and geolocation.
Discoverability is the key concept, and I wrote an article about this at: http://www.telnic.org/blog/2010/11/09/the-internet-of-things-discoverability-first/

I came here to say what Brad said. Namespacing is only valuable when it’s not fragmented – i.e. it is rooted in a single place. If we are saying that organization is our main identifier, Wolfram.com should be the authoritative place to find all things Wolfram. http://www.wolfram.com is for web, http://ftp.wolfram.com is for ftp, mail.wolfram.com is for mail, data.wolfram.com is for data.

Wouldn’t be much simpler if WolframAlpha took a closer look at what’s happening on the Linked Data front. Bottom line, the entire effort is about unveiling the Web’s structured data dimension via data spaces.

at first glance, a new TLD is exciting, however, using a subdomain is a better approach. why must we have hundreds of TLD’s to begin with? it’ll be nice when http://www.x13 and mail.x13 work without the list of TLD’s we must retain defensive registration of.

Moving on to the question at hand, Stephen, you’re on the right track, and after definfing a simple way to share data (XML?) you’re off to the races. X13 would love to help, you have our developer resources at your disposal to help with testing and implementing – a data web helps everyone – commerce, education, and pushing the human race forward.

Trackbacks don’t seem to be working Stephen so you can find my opinion on this subject on my own blog – it was too long to publish as a comment here and it’s focused on “MetaCert” – which I didn’t want to “promote” through your post.

I dont think its correct to state that “…there are conventions for websites about exposing sitemaps …there’s really nothing about structured data.” when we already have RDFa, HTML5 microdata and many open vocabularies including schema.org

The problem is not having a place to put the data, it is agreeing the format of it. In particular, it is agreeing what the schema mean, not the details of XML or JSON or CSV of HTML.
What we’ve tried to do at microformats.org is document existing examples of structured data that is widely published, and converge schemas based on what is actually being published, not some idealized idea of what the data could look like. We look for the intersection between sets of properties, not the union of all possible ones. Then we encourage publication of data in these formats as part of HTML web pages so that they are bound up with the human-read web and not in some parallel, obsolete form.
Have a read of http://microformats.org/wiki/process to see what this empirical standardization approach is like.

“The central notion of the semantic web is to introduce markup for human-readable web pages that makes them easier for computers to understand and process.”

This could be read into the 2001 SciAm article that introduced most people to the semantic web, but it is hard to square with such markup only becoming a W3C recommendation in 2008 (RDFa, on which work started in 2004). The vast majority of work on the semantic web has *not* been about marking up human-readable web pages.

I agree with folks saying this proposal is backwards and more constructively, urge following up on Kingsley Idehen’s suggestion to take a look at what’s happening on the Linked Data front.

Starting a discussion about data interoperability is great to start with, but isn’t the .data TLD the least of our worries?
Multiple of committees have developed “standards” for machine readability, eg. RDF, semantic markup in HTML5, embedded metadata, etc. but only a small subset is integrating them correctly. And let’s be honest there’s no one-definition-fits-all standards we could come up with and use everywhere. So we end up with a semi-finished standard that some will implement correctly and fewer probably won’t, so that the usefulness of the .data TLD is degrading towards zero percent. Which leaves us with status quo and a lot of wasted time.

That be said, it shouldn’t be about the format, projects like the ones @Kingsley linked to are working in that area just fine, the discussion should be about centralizing the data so that everyone is able to use it without the offering company ending up paying for it as @Kovas Boguta mentioned, sort of the S3-content-requester-pays model.

The Web, as it exists today, already allows for a parallel data web — via content negotiation and augmented markup. Adding a layer of separation via a separate TLD would serve to drive a wedge between the “human” and “data” web, when both can and should exist in happy harmony.

I feel like the whole web should be this .data TLD: a sea of machine-readable feeds and queryable databases sitting behind every human-readable website. There’s no reason why the two can’t coexist in the same place.

As Johannes Schmidt says, there is little added value over content negotiation as supported by HTTP for ages. Also, HTML is data, too – just a different syntax with a different focus. And would HTML+RDFa or HTML+microdata be data or non-data?

Another concern is that .data resources would be ideal targets for denial of service attacks, because you can be almost sure to get large resources. If fetching http://company-name.data/*.* typically returns large dump files, then you have the perfect target for a DoS attack. You can even try whether there is a) a .data domain for a given .com domain name, and whether they point to the same IP, thus targeting your attack.

All in all, I think the proposal fundamentally misses the point of separating the naming schema of the Web from the content and syntax of a representation accessed via a single identifier.

IMO, the proposal would also violate Tim Berners-Lee’s “Opacity Axiom” for the Web:

The problem with this is that it breaks the principle of One Web, just as .mobi did. There’s not a mobile web and a desktop web, nor should there be a “data web” and a “document web”. Instead, One Web!

I love this idea. Most of what I’m working on right now is data driven, and much of your work has inspired mine. A rallying point for pure data sources would be fascinating. since everything is quantifiable, access to data is the lifeblood of a computational model of (the universe, for various values of universe). i’m not sure if a TLD is the way to go, but a nexus of data sources would be great.

I’m glad you were up front about this being primarily a marketing exercise for structured data. Because there’s no technical need when user agents can do content negotiation, or define well-known URLs, or simply use data.example.com, or an SRV record.

The whole new TLD thing is just a gimmick for the registries to make a coin out of people desperate enough to protect their trademarks. Witness the “.xxx” debacle. In the real world, we’d call this a protection racket.

Hi Stephen, last year, I also thought about this very same idea. I’ve been working with IATI and previously to that IDML within the aid world, but then went to the ICANN meeting in San Francisco in March which opened my eyes to all the TLC coming next soon.

Whether its really needed or not to me isn’t the issue as much as knowing that if I were to go to a .data domain I can find machine readable code. Perhaps one of the arguments for it, is so that the location can persist over time. You can always find their open data at – organization.data I would keep the subdomain for the specific standard/format so – iati.organization.data or -idml.organization.data Tools and systems change a lot and things in directories like something.org/10/10/data tend to change their location over time.

The .tel domain exists already. It is Perfect for storing machine-readable data. That was what it was designed for, the dns is used, security is rock solid. Why re-invent this idea? Surely .tel has patents over this dns storage anyway

Sorry, I like the idea of a common data structure but I don’t like the domain.

Unless youre proposing to give it away to free?

Open source projects are not likely to want to pay for an extra domain to support the data standard. Also unless you are going to do something like .co.uk.data then you run into problems with two valid companies and only one being able to expose their data. Also as somebody else has mentioned you risk cybersquatters, or people offering data pretending to be the authority but with spam inserted inside it.

I personally believe using a particular TLD for a particular kind of representations is counter to the uniformity of the Web. In particular, if a given resource called e.g. http://http://blog.stephenwolfram.com has two representations, one as a HTML doc and another as (say) a CSV table, the appropriate place to get both representations is via HTTP using that URI but changing the media type, not http://blog.stephenwolfram.data/

I like the idea that the .data domain would as a general concept also include standards for access and format, discovery, access control and authentication, and maintenance (perhaps even to include versioning). With subdomains as used today there is no standard at all, and you never know what you’re going to get.

Arguably, yes, one could say that we could apply the same standards to the use of subdomains ( as in ‘data.x.y.z.com’) but the semantics become much more complicated as a result of using a subdomain, where it could be nested arbitrarily deep, rather than a TLD where the ‘depth’ is fixed.

As software architect and developer, this makes a lot of sense to me – especially when we are talking about the wild, wild web…

I largely concur with the comments of Sean Palmer (and would add, the days when TLDs clarified types of content or provider are unfortunately behind us, not that it was ever great), Brad Threatt, Brent Ashley, Kingsley Idehen, Tony Wilson, John Giannandrea, Mike Linksvayer, Philipp Küng, Tom Morris and Kingsley Idehen, among others. To that, I would add:

1.) “Data” just means too many different things to different people. To some, it’s anything “machine-readable,” to others, that as opposed to human-readable, to others, it’s gotta be numeric; from my experience as an academic data librarian, end users (not to get into what that means) seeking “data” increasingly really mean “information” or “statistics” (human-comprehensible, not to compute on) by that term. So I think the .data TLD would become misunderstood, if not misused, especially by “non-technical” audiences. And who would judge whether a .data domain registration is appropriate – whether it’s “really” for data, if that term is so ambiguous?

2.) Stephen writes “And there are plenty of loose conventions about how websites are organized. But there’s really nothing about structured data.” Actually, I think there *are* web-actionable, partly quite extensive specifications for data; the emerging problem seems to me to be that there are too many of them emerging, something touched upon in the presentation at http://hdl.handle.net/1813/28192 (from an academic/research data perspective), with insufficient coordination. Even Google is at it with their own DSPL: Dataset Publishing Language
(http://code.google.com/apis/publicdata/).

I do not see much distinction between this idea and the idea of the semantic web. I think you have mischaraterized the mission and capability of the technologies underlying the semantic web in describing them as “markup” capabilities. RDF/OWL/SPARQL enable any data owners to expose structured data and describe its meaning precisely enough to enable machine computation across a distributed collection of endpoints.. I believe this is your goal. It may be worth looking more deeply into the capabilities of the existing W3C recommendations.

As a veteran developer I have been experiencing an exponential rate of obfuscation since the artificially induced dot com bubble. I am daily resolving a tower of babel made by idiots just to get simple work done. Simple is good, yet another official TLD is not only reckless and vain, its breaking the trees further into virtually unfathomable graphs that make it that much harder to focus on the objectives for which I use this technology.
So you create another TLD, another language, another framework so complex no one can learn it before it is obsoleted by the next scheme. I won’t use it unless it is the best way for me to accomplish my objective.
You would think that after 30 years we’d have some decent security. But no, good security is rare. But another TLD and a thousand thousand slimy one off patent dodging languages and frameworks live on, and so do I. The austerity of the math heart of the computer contrasted with this scheming feeding frenzy convinces me that George Carlin was right: We’re going away, people. We’re just going away, another evolutionary dead end. The corruption of logic that I’m inundated with is not the work of a successful species.

I can see a “data web” as being particularly handy for slow-changing data. For example; a programmer like myself trying to keep track of miscellaneous standards and specifications as they evolve. A certain spec might remain the same for a year or more, than suddenly a minor adjustment is made. As it is currently, I would have to regularly check the specifications documentation to see if anything has changed. I might not even notice a change if it’s something like a minor number change. With a data web, this could conceivably become an automated notification sent to me by some application that does the monitoring for me.

While the idea of new TLD’s (Top Level Domain s) is great what do you think about the potential Trojan in the new TLDs? The Trojan known as the equal treatment clause (with existing TLDs like ‘.com’) could result in price caps being removed because the new TLDs don’t have price caps. That then would enable registry operators to charge any rate they desire (without limits) on domain registration making the re-registration of an existing site price itself outside the reach of the existing domain name owners reach. The owner of a website like http://www.IOpposeBigBrotherGovernment.com (an example made up site shown here) that is a headache for government could be easily shutdown and legally by getting the registry operator to raise the re-registration price of the domain to a level that the domain owner cannot afford.

Don’t think this can’t happen either or that the principles of supply & demand would prevent this. If the Feds offered a domain registry operator was made an offer by the Feds to either raise the registration price on iOpposeBigBrotherGovernment to something unaffordable like $250K and in exchange get thousands or more in tax credits or not take the offer and make maybe $100 on the domain re-registration which do you think the operator would pick especially in the current economy?

I think the arguments against a .data TLC are the same for just about any new TLC. Why do it when everything can be under .com .net was misused and so was .org Now the internet is changing again and anyone with deep enough pockets can create a new TLC. Is that bad. no, I don’t think so, nor is creating .data What might be different is who manages .data and for what purpose. Just as country level TLCs have restrictions on use so could .data

I think a .data could work, but only if its purpose was clear and use was restricted to that purpose. If it were simply another .com that anyone can buy/sell then it has no value at all.

New domains will always keep coming up only the security issues are most of the issues … Due to this Web hosting sometimes creates lots of issues.. different domains will only create costlier to the client as to keep others acquiring the benefits of their names they have to purchase all the domain names… Like .com,. org now it is .data …
Lets say we evolve faster than we are evolving now

If you want to make a difference please invest your time in setting up another root DNS hierarchy independent of the corporate compromised, censorship armed for profit fiasco that has overrun the free web.

Dear Mr. Wolfram: Well, I do really dig what you’re going for. It’s just that when I merge the capabilities of Wolfram/Alpha and the potential scope of the TLD .data project, I’m more than just a little bit reminded of 1984 and Brave New World. I’m sure you’ve noticed that foreshadow, too. I just hope the boss can’t tell I stay up until 3 am every night playing online, and the USDA doesn’t send the Food Pyramid Police after me! I can just hear it now: A voice that sounds like my grandmother telling me that we need to eat more dark leafy greens. Please don’t misunderstand, though; I’m not being reactionary. Rather, I am sharing my black humor!
On the serious side, though, I can’t wait to see what W/A can do in my life! I think it might be something I can use to help me teach the scientific method to a group of fifth-graders. We have one month to prepare for the Science Olympiad, where they’ll build a suspension bridge from drinking straws, predict how much a foil barge will hold, and do several other events which involve the hypothesize-design-predict-test process. I’m going to assume you’re a busy man, so I won’t expect you to answer this post, but just in case I might as well ask a boon: It would be awesome if you had recommendations to share as to how to introduce the system to students, or specific ideas for applying it to meet their needs. Thank You for all of your special work ~ Liz Uelmen