This was/is actually an extremely controversial project. The corporation (basically the Executive Director) pursued the grant and the idea without soliciting input or really disclosing it to the community of editors, and eventually one of the community-elected trustees was removed for questioning the lack of transparency. The community has a long list of software improvements that they'd like to see to the core platform.

A recent employee survey showed only 10% of WMF staff approved of the Executive Director, probably in large part due to things like this.

The biggest problem is the lack of data about what people are searching. It's a catch-22 that's very hard to break in the face of Google's search dominance and ubiquity.

By Google being the best, it only becomes better, and introduces a huge barrier to entry to competitors. It used to be possible to know what people were searching for to end up in a given Wikipedia article, but the process is now only asynchronous (and limited) through Webmaster tools[1]

In my mind, the most interesting aspect of the announcement should not be how much money they have to spend, but how they plan on solving this paradox.

They already have a FaceBook search engine. I noticed when I start typing a friend's name in the search bar, my friend matches appear below pages that I haven't liked/discussions or if I search an artist I liked, other results appear above it in the quick search drop down. I haven't noticed sponsored search results but I'm sure they're working on it.

Despite many real (though also some exaggerated) counter-examples, Facebook does have features to protect privacy.

One of those things is you generally can't get very useful results from search unless you're friends with someone, or a friend of a friend (depending on the user's privacy settings). You can't see general trends, or even search for every person with a given first and last name in an area, for example.

It used to be more open, but they've heavily restricted the breadth of data returned from searches within the past few years.

Ditto. I have seen this as a crippling deficiency in the Facebook platform. Not being able to search properly in my own posts or in the groups I'm a member of really sucks. For all the engineering prowess, open sourced tools, etc., shown by Facebook, the lack of a working search makes the company seem incompetent from the top down.

Sometime ago I started storing important information (like others links, my own comments, etc.) outside of Facebook where I can find them easily. I also started a Facebook page and added Notes into it to make it easier to document, find and share things. It seems ridiculous that I'd have to do this just to have access to information, but that's been the sad state for years.

FB search is terrible because there is no incentive for an engineering team to work on it, and in fact maybe quietly discouraged from doing so. FB make money on from the news feed, so any feature that distract users from scrolling down their feed will come up losing on an A/B test where ultimate metric is ad views and clicks.

That's what security trimming is for. You index everything, including the permissions for each item and provide the searcher's permissions (whether it's an acl, a claim, or some other form) as part of the query. Anonymous or unauthenticated users get a "public" claim that only returns results available to everyone.

Google uses a highly sophisticated set of data and algorithms to determine what to present to who when they search for what from where. The algorithms are obviously secret, but any user paying a bit of attention will notice the results are clearly not generated from the text in the text field alone.

The Mozilla / open directory project tried this. Curation doesn't scale and often assumes a single unifying ontology. This is particularly problematic in a cross-cultural context. Besides, 'quality' is not a unidimensional metric in a result set: consider timeliness, authority, notability, uniqueness, comprehensibility, etc.

Most search engines include a URL, I can see a [crawldate] button like the [cache] or [translate] buttons on each hit adding some information, but this will be of dubious additional utility for most searches.

3) Open data access to metadata, giving users the exact date source of the information;

We have duckduckgo already, friends are welcome but it's hardly a unique offering nor a trustworthy one given Snowden's revelations regarding the scale of systematic 5 eyes traffic monitoring/recording.

5) No advertising, which assures the free flow of information and a complete separation from commercial interests;

DDG or Google or Bing with plugins can supply this. Not ground breaking.

6) Internalization, which emphasizes community building and the sharing of information instead of a top-down approach.

This is so amorphous as to be a non-point.

So out of six points, 2 things (33%) are only useful in edge cases, 1 thing (16%) is too vague to be useful, and the other 3 things (50%) are currently implemented by others and have been tried before.

Wikipedia is a pretty big exception to that assertion. Perhaps DMOZ (a clone of Yahoo circa 1996) is not the only way to do curation. Perhaps Wikipedia could apply what has worked for Wikipedia, i.e. develop a set of POV-neutral criteria for organizing collections of links and then invite everyone to participate.

It's really easy to be negative. But that's something that might at least be an interesting research project for the #1 open-curation system in the world.

You make a fair point. I'm not rubbishing Wikipedia, just questioning the supposed USP. I would also point out in response to your argument that a Wikipedia article and a set of search results are apples and oranges.

The article is written once then modified or evolved occasionally by (almost exclusively) humans, but very frequently read. It is intended to be intelligible, being structured and based in natural language. It has a very well defined scope within a flat namespace, and often clear relations to multiple formal ontologies. It is structured to be consumed in part or in whole, and may contain rich media and strong supporting contextual information (related pages).

By contrast a search result summarizes a set of potential information sources that may answer a search query in whole or in part, to various definitions of "answer". It is generally written once, by a computer, and thrown away after some period of caching. It is intended to be concise. Each component result has relatively poor context, relying upon the searcher to interpret timeliness, authority, notability, uniqueness, comprehensibility, etc. with the limited information presented, typically a very short content excerpt. It is structured to be scanned, classically in a ranked fashion from "best hit" to "worst hit", and is generally a wall of text.

Wikipedia successfully attracts people to contribute to the former, but the latter - where the information product is generated on the fly and lasting impact is amorphous (nothing particularly concrete for contributors to point to and say "I did that! Warm and fuzzies!") - is a very different beast.

I too believe there is room for innovation ... there are potentially low hanging fruit like inter-linguistic semantic queries (not keyword search) ... but there are no such key problem areas identified in the paper's summary.

The other big problem is that curating search results is inherently about prioritising a position rather than establishing a sourced and reasonably neutral version of the truth.

I'm imagining the edit wars and debates that take place on contentious wordings or facts in some parts of Wikipedia, but on a much wider scale involving hundreds of SEO consultants each aware that changing a particular criterion will have a quantifiable impact on their clients' bottom line. It doesn't sound like it would be fun to police.

Wikipedia already curates links to some extent on every page under "External Links". So there is a seed there.

And even the page text is not immune from the problem you describe. Grading and prioritizing sources is a fundamental part of producing a "reasonably neutral version of the truth." It's what determines what gets cited and how prominently it influences the article.

So while I wouldn't equate text and links in terms of the difficulty of managing POV-neutrality, I would say they sit on a spectrum.

There was a remark recently that most Jeopardy answers are Wikipedia titles. Consider Wikipedia as an ontology, with Wikipedia titles as the vocabulary. A search engine could associate articles with relevant Wikipedia titles, and try to do the same with queries. The first step of search is then relatively straightforward.

DuckDuckGo a meta-search-engine! It relies mainly on Yahoo Boss API which uses Bing search (for most countries)! Yahoo Boss API turned from free to expensive in early 2015 and the future of Yahoo (tech company, not Alibaba stock) is uncertain.

We definitely need more search engines, only 6-7 exist that cover a wider range (international). Search on HN to retrieve the list, we had this discussion before.

Curation scales for some topics. It was hard to build a curated list of Linux sites, they come and go. But there are only a couple of new good, comprehensive health websites per year.

I was Blekko's founder/cto. And it's worth noting that our founding team was the Open Directory Project's founding team. Blekko's curation data was even better than dmoz in its day. Check it out: https://github.com/blekko/slashtag-data

Nonsense. It means content will be filtered through the lens of one or more individuals. The results vary dramatically. Mob-rule is one possibility. Yahoo Directory used to be a great example of mid-level of quality where it gave nice start and obscure stuff people overlooked. On high end, the link below shows Stanford Encyclopedia of Philosophy set a pretty awesome precedent for high-quality curation:

I'm pretty sure that the spammer argument is just an excuse used by Google to allow them to keep their business practices out of public scrutiny. Google search results are biased in favour of content produced by those who have money and power.

Google ranks everything based on popularity - Not based on quality. Popularity and quality are two independent concepts and not necessarily related. That's something which Wikimedia understands but which Google doesn't.

Google does take quality into account. That's the whole point behind static ranking algorithms. However, quality isn't some universal concept. I'd say the inevitable paper on gravity wave detection is highest quality on it, but it certainly isn't popular because it's impenetrable to the lay masses, unlike say a Wikipedia article, which falls into look-at-me-i'm-oh-so-smart territory when it gets into double gradients and other math formulas with more letters than digits.

Do a Google search for anything even slightly obscure, and you're likely to find the first page or so of results filled with highly-SEO'd sites that offer little in the way of deep, detailed content. The smaller sites which do have that content, but just haven't been SEO'd much, have been eclipsed. They're still there, but rendered nearly inaccessible.

This may or may not be what you're looking for, but my freelance site just can't hit the front page in my industry when dozens of highly funded agencies can dominate it. I've published a 50k word industry-specific book on my site, have SEO'd as much as possible and have an older domain than those better funded. Won't link to it here, but it seems to be a real issue to me.

If a search engine let you differentiate and sort between content match vs pagerank match vs Adwords spend we might be able to mitigate the issue somewhat.

btw, your site's redirect to the HTTPS version doesn't seem to work correctly in Firefox, Safari, or IE. After reading your comment, I was curious to learn more. When I typed in just your domain name plus CMD+ENTER (which adds "www." and ".com" to the address bar text in Firefox), I got a 404 page, not the 301 redirect to the HTTPS site. When I add "http://", the redirect seems to work.

Thanks for letting me know, and for the more detailed info - the redirect worked for the basic permutations I tested for (www.domain.com, and http://domain.com etc.) but I am redirecting non-www and www traffic to https://www. in Nginx. (Solution found, see edit below.)

I'd not heard of the CMD+ENTER method before, so thanks for the heads up. Still not entirely sure what Firefox is submitting in that case. Will test.

I wasn't referring to this site in the parent comment, but to my freelance site. I'll put that site in my profile for 24 hours just in case anyone wants to take a look.

EDIT: Fixed, as noted in reply to nl's comments. A recent change led to a redirect line being mistakenly commented out.

I don't work in this area, but I'd say that 99% of the time I hear someone complaining about how Google is favoring sites that pay for advertising over them I find that they are making these incredibly basic errors.

That's all well and good, and thanks for checking, but I wasn't referring to that site in the parent comment. The other site does have SSL enabled, but only recently and via Let's Encrypt. The issue is much longer standing than this.

So I'm afraid it's not quite as simple as you make out in this case.

In other news, I've just pinpointed the missing line in the recently changed nginx config for Linguaquote; the http://www block had it's redirect commented out. Still, for this site in particular Google Webmaster tools is set up for the https version where no errors have been reported and SSL Labs gave an A+ for the stapling, PFS, heartbleed etc. etc. efforts I went to. I don't think this redirect was having an adverse effect on ranking, but I don't expect this site to hit the front page just yet - much more content to add before aiming for that.

Well they do have some quality metrics, like duplication with other content, words used and so on. I suspect more, eg writing style measurements, correlated with other things that are found useful. There is a lot that could be done without actually understanding the content, although if course it can be gamed.

The primary measurement used for Google's PageRank algorithm is the number and "value" of backlinks that a page has. The "value" of a backlink is determined by the cumulative number of descendent sub-backlinks that it has. This is common knowledge among SEO professionals.

Basically, when judging quality, Google is making assumptions like: "This page has a lot of backlinks, and those backlinks themselves have a lot of backlinks... Therefore this page is of high quality." This approach puts all the power in the hands of content providers (bloggers) who are funded by big companies (or well-funded startups) and who serve the interests of those companies.

Google wrongly assumes that content-providers serve the interest of consumers and that they can be trusted - Which is not the case.

This is a very, very, very small amount of money if you want to build a search engine, let alone one "to rival Google" (source?). Looks like the goals are realistic, though - look how wikipedia search could be extended beyond results from wikipedia.org, build some test sets. And get a better idea what it really is that is supposed to be built.

I run a knowledge engine project that actively mines facts from web-based sources and third-party data dumps. It was featured on the front page of HN a while ago, and has a total of $10 in funding (from a single donation; not a typo). I have, however, put a ton of time into it, and it's something I'm very passionate about. I'm fairly confident Wikipedia can have success making initial headway on their grant objectives with $250,000.

Over 15 years ago, for my undergraduate thesis I set up a "Hypermedia Textbook" on the history of my field. I had to manually collate all the info, manually scan in every photo, and type in every last bit of text and html. The end result was a couple hundred pages that looked very, very similar to what your Einstein page looks like! At the time, I knew a better way would emerge, but didn't know how or when. It's moments like this that I (a) feel old :( and (b) am amazed by the times we live in and the speed at which things are happening! :) Thank you for providing such a wonderful, if unintentional, moment of self-reflection!

Cuil spent about $30M before they went bust. By the time they went under, due to a total lack of a revenue model, they had a halfway decent search engine that did its own web crawl. So that's a data point on how much it costs.

Building a basic search engine is relatively easy. Building one that rivals Google is extremely difficult, and not just because they're so big and convincing people to switch is hard. It's much easier to have good results when you know that the websites you're indexing don't care about you at all. Once you get popular enough to rival Google everyone and their mother will be trying to game you and that changes the problem significantly.

The original implementation of the Google search engine would get obliterated today, though I guess you have to start somewhere.

There's another incumbent advantage here: I imagine it's much easier to provide good results when you also have data on which results thousands or millions of people clicked on for millions of search terms.

It is a small amount of money and the database for it has to be huge to index a lot of websites. So the server will have a large hard drive and be able to serve up enough users to not suffer outages.

I imagine they will build a proof of concept and get more money for it later. Have volunteers work on building it to save money. Open source the project and have others look into fixing the issues with it.

Most non-profits, especially the ones that are always asking, usually have a lot of funds.

Before I give, I go to guidestar, hit free preview(they try to trick you into a paying membership), download last few years of 1040's, and see if everything looks copacetic. I look at who is making the most money. Their is usually one person making a very good living. California non-profits are much easier to scrutinize than Deleware non-profits.

I don't know if you can trust anyone, I mean bias is everywhere. You can pick a side.

For example the Portuguese Wikipedia, shared by Brazil, Portugal (and other countries), with controversial matters many on colonization in the last 500 years, where both sides have academic work to support their contradictory views. Which views prevail?

An example of things being done differently are some Ex-Yugoslav countries (Serbia, Croatia, Bosnia, Montenegro, etc) whose languages are more similar than Portuguese and Brasilian Portuguese, and each one has its own Wikipedia, with different articles on the same subject depending on their point of view. Lately, I've been seeing more of the Serbocroatian Wikipedia, which I think aims to unite more of the others.

I don't know which way is better, I'm just a user.

Another reason you can't trust anyone, and this is general to the Internet, is that shilling, commercial and political interests aiming to change perception are everywhere. On reddit or facebook, with or without sources. It's the worst aspect of the internet for me these days.

Google has been the dominant search engine for a solid 12 years. They have a practical monopoly. They're under constant anti-trust scrutiny - of one form another - in all of their major markets. And you think they're going to start taking advantage (as in particularly egregious behavior) of their position soon?

No need. It's generating $23 billion per year in operating income and growing. There are no serious challengers. It's far more likely their search engine will be constrained under piles of government oversight in the coming years. About the time governments start worrying about products like this in tech, is about the time they're just beginning to become less central. The exact same thing happened to both IBM and Microsoft.

Not only is it a search engine, but it is a grant application that has had WMF staff leaving in droves, and has greatly upset many, many others - who will quite likely also leave.

It's very, very sad. And it's also a shameful moment for the WMF.

edit: and don't just think it's me saying it. The WMF has had a mass exodus of staff in the last week or so. If you speak to any WMF non-executive staff members directly, you'll quickly find out that morale is at an all time low, and confidence in the WMF Board is sitting at something like 12%.

Can you say more/explain? What about the application is so upsetting? What is shameful about this?

The Knight Foundation is about as upstanding as you can get, so it can't be that (full disclosure, I've received funding from them, so I'm definitely not unbiased on that point). So, what exactly is it that's so shameful here?

To be clear: my issue (and in fact, most peoples' issues) are not with the Knight Foundation. In fact, they appear to have been above board in every way in this whole debacle. It is the WMF board who are the problem here.

As an aside, I think the internet is simultaneously great at spreading absolute bs and disinformation and pushing people to have citations handy... paradoxically, both seem to be getting more frequent. (It all depends on where you browse, clearly).

Yeah, nobody knows this more than myself. I created [citation needed] and I've watched it be misused for years. I am glad I came up with the idea, but I'm resigned to the fact that it's human nature to misuse a valuable idea.

Ironically, it's not the project itself that is the problem. It is the way it was done. It is causing masssive, massive problems internally within the WMF and frankly it's spilling over into the wider Wikipedia community. Can't speak for the other projects though, I don't know enough about them to say what the general feeling is around there.

It's not nice to be the only one here on HN pointing out that there are some absolutely massive problems going on at the WMF at the moment, but I'm an outsider who was once an insider and I still know enough influential people through Facebook and other mechanisms to see enough to know that there is a crisis happening right now within the WMF.

edit: I should note that, as an outsider who doesn't ever really want to be hugely involved in Wikipedia-related matters again (for various personal reasons not necessarily related to Wikipedia or the WMF), I don't really have any fear in stating what I see - nobody can really come back at me so I have no fear of any reprisals.

The grant application you are looking at was only revealed due to a MASSIVE amount of controversy and pressure within the Wikimedia Foundation.

The community representative (James Heilman) on the board was let go the other day, in part because of concerns around this grant. You might want to look at the Wikipedia Signpost article he wrote about this:

Many people have questioned this. Lila, their Executive Director, seems to have conjured this up out of thin air, without consultation of any WMF staff members, or anyone in any of the various communities. Even highly influential, well respected people like Tim Starling appear to have been broadsided by this.

Here is what Lila Tretikov wrote about the search engine:

It was my mistake to not initiate this ideation on-wiki. Quite honestly, I really wish I could start this discussion over in a more collaborative way, knowing what I know today. Of course, that’s retrospecting with a firmer understanding of what the ideas are, and what is worthy of actually discussing. In the staff June Metrics meeting in 2015, the ideation was beginning to form in my mind from what I was learning through various conversations with staff. I had begun visualizing open knowledge existing in the shape of a universe. I saw the Wikimedia movement as the most motivated and sincere group of beings, united in their mission to build a rocket to explore Universal Free Knowledge. The words “search” and “discovery” and “knowledge” swam around in my mind with some rocket to navigate it. However, “rocket” didn’t seem to work, but in my mind, the rocket was really just an engine, or a portal, a TARDIS, that transports people on their journey through Universal Free Knowledge.

From the perspective I had in June, however, I was unprepared for the impact uttering the words “Knowledge Engine” would have. Can we all just take a moment and mercifully admit: it’s a catchy name. Perhaps not a great one or entirely appropriate in our context (hence we don’t use it any more). I was motivated. I didn’t yet know exactly what we needed to build, or how we would end up building it. I could’ve really used your insight and guidance to help shape the ideas, and model the improvements, and test and verify the impacts.

However, I was too afraid of engaging the community early on.

Why do you think that was?

I have a few thoughts, and would like to share them with you separately, as a wider topic. Either way, this was a mistake I have learned enormously from.

That's a very, very real problem. An executive director of the Wikimedia Foundation should never have felt too afraid of engaging with the wider community on an issue as fundamental as this one.

It's even more concerning that a half-thought through idea didn't get discussed and yet a grant application was made. All those who say that the application is only for $250,000 are entirely missing the point - the entire project would be $2.5 million, this is just the first, initial stage.

It's even worse when Jimmy Wales states that:

"To make this very clear, no one in top positions has proposed or is proposing that WMF should get into the general "searching" or "try to be Google". [1]

Yet that is precisely what is being done here.

The WMF appear to have known about this, because they seem to have done a large number of hirings to be dedicated to search - which I hear through contacts was questioned at the time as it seemed an odd way to allocate WMF resources.

There have been, in the last week, I believe 5 or 6 influential staff leave the WMF. IN fact, they appear to be haemorrhaging staff currently, with no real sign of any abatement.

None of this is at all satisfying to me. I was very, very involved in Wikipedia years ago. I started their Admin Noticeboard, and I did lots of article work, and helped kick off some key things, one of which was the [citation needed] tag which I have to admit I have some mixed feelings about. But for such an important project, it saddens me greatly to say that as an outsider now, it looks like things are being badly mismanaged.

I hope for everyone's sake (and not just the folks at the WMF) that this can be resolved. It's not like governance issues can't be addressed - when Sue Gardener was in charge of the WMF, things not only ran like clock-work, but she ensured maximum transparency and we all trusted her implicitly because she earned that trust. I can't say the same for the current Executive team.

It may be of little importance vs your excellent references and what they show but... does anyone else notice how she worded that message is just... so... weird? The wording comes off like a combination of academia, PR, and email scams to me. Just straight BS that no normal, caring person in a mission-oriented organization should ever say.

I mean, there's certainly styles I'm unfamiliar with. I'm always open to new experiences. Could be the case here. Hers just instantly set off red flags in my intuition. I hope she didn't always write like that as it might mean whoever brought her in either fell for a con or were part of it.

Yeah, it's super weird. I'm usually pretty understanding of corporate communication, especially when it's going to be public, but those paragraphs are just bizarrely phrased. "Ideation," "retrospecting", "beings," "universe," frequent references to a "rocket." It reads like one of those crazy-people websites from the 90s with a dozen different fonts.

Her native language is Russian and country she was born in was Russia. I always assumed that was the reason the language used is like this.

I've been critical of her "rocket" imagery, but I like to think I'm understanding about the odd use of English in the rest of her comments. Especially as I'm a monolinguist, heavens only knows how I would sound if I tried to learn and speak Russian amongst Russians...

too.
DuckDuckGo also started to crawl the web with its own bot (right now they're using Yandex's api).

We need more competition from different countries. Just think about the censorship done by Baidu or how Google never plays by its own rules.

It's also interesting to think about a way to monetize a search engine. For kairos.xyz I was thinking about paid accounts (1 euro per month) providing more features, like the ability to search from the command line. For example you write "kairos Richard Stallman" and it prints basic information about Richard Stallman on your terminal.

the only thing I can think off right now is that the http "Host" header field is not sent. I have several sites on the same server and Nginx is used as a reverse proxy and uses the Host field to redirect traffic to different ports.

Anyone want to hazard a guess at the technology they plan to implement to get this started?

Surely this is not designed to be written from scratch, so..

- Are they using known lexical & semantic scanners?
- Is it focused on English language first?
- What crawlers will scan content?
- I'll assume it's an open platform, but license for contributors?
- What database architecture will hold the graph?
- How does it know the mark of authority, and is this primarily based on human input learning or machine learning?

I'm sure $2.5M wont touch the sides, but maybe if it's a well directed project, with healthy user contribution, based on interesting technologies they might develop a good backbone architecture. Ambitious for sure.

Computing is so cheap now, google isn't going to be so dominant on text search for long. Their money is needed for video and pictures and audio, but the text internet can be cached whole by small entities now.

Maybe Wikipedia should launch a video encyclopedia to try to provide a 5 minute video of every article, for people who like videos more than reading.

That's a darned shame. I'm aware that there are a lot of areas that people want to fix on MW Core.

I get really concerned when I hear that the person who holds the vision and direction for the Wikimedia Foundation didn't really participate in it beforehand, and I get even more concerned when I see that she branches off into proposals for search technology that appear to be far outside the scope of Wikimedia projects.

Nobody has ever thought search in Wikipedia or the various projects was particularly effective. However, bringing everything together doesn't just involve searching, and frankly there are a number of more pressing governance and community issues that need to be managed.

Perhaps I'm being a bit unfair here, but she was profiled when she first joined the WMF Board, and the following was said about her:

At the meeting, she described the impact on friends and family of the Chernobyl nuclear disaster, and the difficulty of getting reliable information in the face of “so much secrecy.”

Yet we see that this is precisely what happened with this grant proposal. A major grant was applied for and awarded and not even WMF staffers knew about it. You can see on the mailing list that it was a total shock when it was finally revealed.

I'm watching this train wreck from afar, but closer than others because some of my friends are deeply involved in Wikipedia and the WMF. I'm always amazed that a leadership change can complete kill an organisation. I've seen it in the corporate world, and I see it all the time in the volunteer world as well. The Wikimedia Foundation seems to be yet another victim of the appointment of a clueless leader, with no experience in the area or with the group they are meant to be leading, thrashing around, making changes without really understanding how systems work, the history of the organisation or relying on the experience and sage advise of the many expert and dedicated people around them, ultimately leading to a great deal of unnecessary turmoil, ill-will and frankly destruction in their wake.

turns out that is the only thing this project is about. There is no web crawler, there is no external content. The grant and the money WMF is spending are being spent to improve internal search at wikipedia.

Good luck. We definitely need more search engines. (That Google announced to lower the PageRank(R)/site score for non-HTTPS sites is a clear indicator that they about to cross the line (monopoly). And no, DDG and most others are "just" meta search engines that rely on Yahoo Boss ($$$) which future is uncertain and relies on Bing.)

There was "Wikia Search" by Wikipedia founder Jimmy Wales:

"Wikia Search was a short-lived free and open-source Web search engine launched by Wikia, a for-profit wiki-hosting company founded in late 2004 by Jimmy Wales and Angela Beesley.

Wikia Search followed other experiments by Wikia into search engine technology and officially launched as a "public alpha" on January 7, 2008. The roll-out version of the search interface was widely criticized by reviewers in mainstream media. After failing to attract an audience, the site closed by 2009."

Google doesn't want to lower the PageRank of http sites, it just wants to use http vs https as a feature in ranking. That isn't particularly surprising, I would be willing to bet google already uses hundreds of such features (one of which might be PageRank).

Given the terrible state of advertising, I would welcome a search engine that penalizes pages with popovers, animated ads, auto playing audio, and so on. Google would never build this, given its business model.

I hope Wikipedia brings some innovation to search, untethered from advertising revenue.

As a former NLP engineer and former wikiHow engineer, I have some perspective on this. Google has included more and more information from Wikipedia. Furthermore, Google includes snippets of external websites in the knowledge box on more and more pages.

How long will be until can algorithmically generate its own Wikipedia articles? Wikipedia relies upon coming to its site for contributions and donations. Without search, Wikipedia risks being subsumed by Google. They have a difficult position of thinking about the future without pissing off Google.

Computers are getting more and more powerful. Wikipedia needs to do stay relevant. I think this is the right decision.

I've read a log of the inside wikimedia links here, and I'm confused about all the talk of gnashing of teeth and rending of cloth. This is controversial because some want to pay down technical debt rather than have a small team do knowledge graph search?

Yeah the relative merits of the initiative seem to be besides the point. If you have a toxic environment, even a proposal to cure all disease for everyone for free will attract derision.

It's hard to get worked up over some other team's morale. If its such a crappy place, just quit. The could probably literally go across the street and get a new job. I don't really care. It's all way too inside baseball for me.

You cared enough to comment. This wasn't just a place of work for me, I volunteered my time because I believed in what they were doing.

You might never have contributed in a significant fashion to Wikipedia and other WMF projects, but I did. Sure, I didn't get employed, but then again I know a lot of people I met and have continued to be friends with who are still deeply involved. You may not care about your friends' morale, and you might think it's easy for people to "go across the street and get a new job", but then you seem like a pretty thoughtless person.

Of course, you've not understood at all what the larger issues are. You must have a bit of a comprehension issue, because I supplied quite a few links that you apparently read that explained the underlying problems.

0) Stop with the personal attacks. It's rather unbecoming, and I don't appreciate it.

1) Being genuinely confused about what the big deal isn't the same thing as caring. It's asking for confirmation of a conclusion.

2) If you're in a situation where you're unhappy, then you have a responsibility to make yourself happy. Staying around in a crappy situation and whining about doesn't help, and neither does insulting people.

3) Wikimedia is in San Francisco. If I had to take a guess, I would say there was a literally a hundred other tech organizations in that city alone, including nonprofit organizations with a societal purpose. 18F comes to mind. Again, See 2.

2) they aren't whining. Saying so is pretty much a personal attack. It's certainly insulting. I don't appreciate it, and I'd say neither do they. Funny how that works both ways.

But interestingly enough, as has been pointed out already - people ARE leaving in droves.

I'm no longer involved in Wikipedia, but I can still be unhappy with the direction they are taking.

3) if you think that just leaving a non-profit you have emotionally invested in is an easy decision, then you really haven't thought things through. If you think it's elementary to just step out of one job and into another, that's also thoughtless.

That's how much 5 qualified software engineers would cost to employ for a year (gross, including compensation, benefits, payroll taxes, office space, hardware, etc, and that's on the low end of the range). Good luck with that.

So? I've found 1:10 to be a good ratio between leads and grunts. 1:5 if lead is technical and also contributes work and not just leadership/management. And that's when those 10 (or 5) people are working full time.

A software engineer qualified to work on this kind of thing is worth about $350K in combined compensation on the market right now. Typically half of that is base, while the other half is stock and other taxable benefits. The number can be higher. This is the cost of just compensation to the company, excluding the payroll tax. You can, of course, find someone a lot cheaper, but then you'd be a fool to expect the result to be anywhere near as good as what Google can pull off, because if that someone could do what Google can, why would she work for half the compensation instead of applying to Google or FB or whoever pays competitive salaries these days.

As a former Wikia employee, I am somewhat of a MediaWiki insider. I sped Wikia's search engine up by several orders of magnitude and then went on to pilot a number of NLP/machine learning initiatives in the company.

Jimmy Wales' already tried to make a "Google Killer" ten years ago. It was tilting at windmills to say the least. Letting individuals help manage algorithmic search results was harder than you could imagine. Let's not even get into the difficulty of building an effective crawler.

One of Wikia's former CEOs, Gil Penchina, notoriously undervalued search as a result of this very public gaffe. By the time I came in, it took over five seconds to do a simple on-wiki search. Searching across wikis took so long they actually just sent the search to Google and had you abandon the site. I personally fixed a lot of these problems, and that part was pretty cool.

So now let's get to the subject at hand, which is a search feature based on an authoritative knowledge graph. Something like this should adequately surface factual information in an intuitive manner -- optimally based on natural language. Wikia already tried this, too. They brought on a very seasoned advisor who played a crucial role in the semantic web movement far back into the early oughts. I remember going to semantic web meetups in Austin when I was in grad school quite some time ago now to hear this guy talk.

This guy was essentially the SF-based manager or lead for a small team located in Poland whose job it was to take some of the "structured data" at Wikia and attempt to build some kind of knowledge graph on top of it. This project was unsuccessful.

So why did it fail? We'll start with a lack of product direction. Wikia had and probably still has a very junior product organization that is mostly interested in the site's UI and (recently) a focus on "fandom" (yuck). The team allocated to the project was based in Poland (Poznan, to be exact), and primarily kids coming out of a technical school on their first job. Your assumption about communication being a problem would be correct. However, the subject matter expert was so entrenched in his area of specialization, the problem was even more compounded on the native English-speaker side. There was too much getting in the weeds, and not enough focus on incremental progress.

To make things worse, they tried using a proprietary, not-ready-for-primetime data store because it most closely matched the SME's preconceptions on how the data should be structured. There was absolutely not an existing business use case for this data store, and problems getting it to work turned even building a simple demo into a death march.

Either way, what I'm saying is, $250,000 is not enough to solve this problem. We have attempted to solve this problem before in the MediaWiki world. It's not going to magically get better. To make something like this work, you need:

1) Best-in-class UX people who would know how a knowledge graph provides a significant improvement over existing solutions
2) Leadership that can bridge the gap between SMEs and implementers
3) Very skilled engineering resources with backgrounds in less conventional technologies

This is a massive investment that no one is willing to spend on what is essentially a media play.

About six months later, I had built a proof-of-concept that sucked data out of MediaWiki Infobox templates into Neo4j, a well supported graph database. I was able to answer questions like, "Which cartoon characters are rabbits", and "What movie won the most Oscars in 1968" using the Cypher query language.

At that point in time, Wikia had decided they were tired of investing in structured data, and wanted to re-skin the site for a third time in as many years to make it look more like BuzzFeed.

Structured data is cool. In many cases, unsupervised learning may be what you're actually looking for. But in the end it has to satisfy a real user's needs.

Wikipedia has five million English articles. Wikia has over 20 million. As far as capitalizing on this wealth of knowledge, the devil is truly in the details. But it's a real shame that all of that information isn't put to better use than to encourage the socially maladjusted to take quizzes over which anime character they're more like.

How did you arrive at 20 million? This sounds like one of those "technically true" facts that are cooked up for investors. http://wikis.wikia.com/wiki/List_of_Wikia_wikis puts the combined total of the top 1,000 wikis (in all languages) at 12.4m.

Google's advantage isn't just that they were first, or that their algorithm is the best- it's the CPU resources they have available to keep their data updated faster.

Search for any news item and you'll have all articles published more than 2 minutes ago included in your results, all blog posts, everything. They consume it all, and offer the output in near-real-time.

Wikimedia don't have the resources to do that. And they especially won't without advertising to pay for it.