Tag Archives: Search Engines

Yahoo is finally pulling the plug on AltaVista on July 8th. It appears as a one line entry in Yahoo’s latest list of closures (http://yahoo.tumblr.com/post/54125001066/keeping-our-focus-on-whats-next) with the comment “Please visit Yahoo! Search for all of your searching needs”. AltaVista was started by Digital Equipment in 1995 and quickly became the default search engine for many of us. I still meet people who have remained loyal to AltaVista even though it lost its unique search features a long time ago. Danny Sullivan has written a short history and eulogy for the search engine at http://searchengineland.com/altavista-eulogy-165366 – “A Eulogy for AltaVista, The Google of its Time”. Great though it was, some of us had already defected to the Inktomi powered search engine HotBot by the time Google had arrived on the scene. Alas, HotBot is now a shadow of its former self and AlltheWeb, which Yahoo had also acquired, was closed down in April 2011.

April is going to be a very busy month for me this year. As well as speaking at conferences I am also giving six full day workshops so am having to prepare the presentations, handouts and notes well in advance. When it comes to the Google sessions the material the delegates receive never matches exactly what they see on their screens during the practicals. That’s par for the course where Google is concerned and it’s a great way of getting across to people how Google messes up enhances search results. The problem I had yesterday, and am still having this morning, is that Google seems to have dumped me into several major ‘live experiments’ and results keep changing second by second. The consequence is that it is impossible for me to pull together a set of consistent screen shots but I, and the delegates, will just have to live with that. And it makes a good story on the day!

If you don’t know what Google’s ‘live experiments’ are the YouTube video at http://www.youtube.com/watch?v=J5RZOU6vK4Q will enlighten you. In essence, Google tests out changes to its search and ranking algorithms on users before deciding whether or not to go ahead with the changes. It could be me or you who ends up being one of Google’s lab rats. We are not asked if we want to be part of the test nor are we told. Most of the time the changes are so minor that we don’t notice the difference but occasionally they lead to some very bizarre results. See my blog posting from a couple of years ago when Google decided that coots were really lions (http://www.rba.co.uk/wordpress/2011/02/12/google-decides-that-coots-are-really-lions/). What I’ve been seeing over the last couple of days is not in that league but extremely irritating all the same.

One of test searches is fairly straightforward – copper extraction north wales. This is what I saw:

Search in Google Chrome – no emboldened terms in the extract

What’s wrong with that you might ask. At first glance it looks as though Google is dropping terms from my search because none of them are emboldened in the extracts. On closer inspection, though, the terms and their synonyms are present. I ran Verbatim on the search and saw a similar set of results with no emboldening apart from words in the title.

I use Chrome as my default browser and wondering if it was an issue with that I tried Firefox. The emboldened terms reappeared.

Search in Firefox – emboldened terms present in the extracts

Internet Explorer also displayed emboldened terms.

I went back to Chrome and ran the search in an Incognito window. The search terms appeared emboldened in the extracts.

Thinking the problem was due to me being signed in to a Google account I signed out and ran the search. No emboldened terms. I cleared the cache and cookies. No emboldened terms. I disconnected Chrome from my Google account. No emboldened terms. I disabled all of the extensions. No emboldened terms. It was clear that Google was not going to show me emboldened terms when using a normal Chrome window. Why is it so important? Because it is a quick way of initially assessing the relevance of the results. No emboldened terms in the extract suggests that they were not found in the text of the page. If this is indeed an experiment and not a local glitch on my system, and Google decides to roll this out to all users we are all going to waste a lot of time wading through irrelevant results.

On to possible experiment number 2. Google sometimes ignores the setting that tells it how many results to display on a page. I have set mine to 100 but occasionally it reverts to just 10. Refreshing the page or going into settings and saving them again usually works for me. This is a minor irritant, unlike experiment number 3.

I didn’t see 6 results but 4! (As an aside, the emboldened search terms in the extract have returned). The fifth was a result for similar searches with an annotation that indicated ‘& co second hand’ had been omitted. A couple of the results were OK-ish but I was hoping for more detailed information. Is there really so little information for this query? Like Phil, when I clicked on to the next page I was back to sensible results. Unlike Phil, using Verbatim on the search worked for me and overrode the experiment, so again I was back to sensible results. This morning, I could not replicate the 6 results per page display.

Experiment number 4: annotations below the extracts. Some of these annotations look like headings from the pages themselves but others are not. I cannot replicate what I saw yesterday and didn’t take any screenshots of this one. I am definitely sure I didn’t dream it because a couple of my network on Twitter have reported similar experiences.

This continual round of disappearing, reappearing, disappearing “features” is infuriating. Yes, we can all go off and use other search engines but there are times when the type of content and level of coverage tempts us back. You do have to know how to use the advanced search commands to get anything sensible out of Google, but even then success is not guaranteed. This is an area I concentrate on in my workshops. The next one on Google is being organised by UKeiG in Manchester (see the UKeiG web site for details). The title “Make Google behave: techniques for better results” may seem a little overoptimistic given my own and other people’s experiences, but there are plenty of tricks we can employ to get usable results.

Given that Google is now just over 13 years old and a teenager it is not surprising that it has become somewhat truculent. It’s when it starts going through the silent grunting phase that we need to really start worrying.

Google has put together a site showing how Google search works (http://www.google.com/insidesearch/howsearchworks/thestory/). The main page is a scrolling animated graphic that just gives you some elementary facts but there are links to more detailed information and videos on the main topics of crawling and indexing, the searching and ranking algorithms, fighting spam and Google’s general policies. They are a useful set of pages for anyone who does not already know the basics of how Google works, but if you are looking for something that tells you how to get sensible results from Google you’ll be disappointed. As Phil Bradley says:

“…. boils down to ‘we find some stuff, do magic to it, filter out the crap that our magic didn’t get and then give it to you.’ Yes folks, an entire site to say that. Wasted opportunity.”

Anyone who has attended one of my workshops knows that I ask the group to propose at the end of the session their top tips. These are the Canterbury group’s top 10 tips.

1. What’s going on?
Try and find out what’s going on behind the scenes and how the different search tools work. For example, Google and Google Scholar are quite different in the way they manage your search. Understanding how they operate means that you can adapt your search strategy accordingly and also manage your expectations; for example Google Scholar does not use the publishers’ meta data so author and date search are unreliable.

2. Personalisation and ‘unpersonalisation’
Google personalises your search based on past activity, who is in your social networks,and a whole host of other ‘stuff’. You can quickly ‘unpersonalise’ your results by using a separate browser window that does not use cookies or your web history as part of the search algorithm.

If you use Chrome as your browser, open what is called an incognito window. In the top right hand corner of your screen there is an icon with three lines. Click on it and from the drop down menu select New incognito window. Alternatively press the Ctrl Shift N keys on your keyboard

If you use Firefox, from the menu at the top of the screen select Tools followed by Start Private Browsing.

In Internet Explorer select Tools followed by InPrivate Browsing. If you cannot see InPrivate under Tools try looking under the Safety option.

3. Advanced search commands
Use Google advanced commands such as filetype: to focus on PDFs, presentations, spreadsheets containing data and site: to look for information on just one site or a range of sites such as UK government. Although the advanced search screen has boxes for you to fill in for the commands the file format or filetype option is limited. It does not include options for the newer Microsoft Office formats such as .pptx and xlsx. Use filetype: as part of your search strategy, for example:

nasa dark energy dark matter filetype:pptx

Google Scholar commands are more limited – see slide 28 of the presentation.

4. intext:
Google automatically looks for variations on your terms and sometimes omits words from your search if it thinks the number of results is too low. Prefixing a term with intext: tells Google that it must be included in your search and exactly as you have typed it in. For example:

UK public transport intext:biodiesel statistics

tells Google that biodiesel must be included in the search and exactly as typed in.

5. Reading Level
Use Reading level if Google is failing to return any research oriented documents for a query. Run the search and from the menu above the results select Search tools, All results and then from the drop menu Reading level. Options for switching between basic, intermediate and advanced reading levels should then appear just above the results. Google does not give much away as to how it calculates the reading level and it has nothing to do with the reading age that publishers assign to publications. It seems to involve an analysis of sentence structure, the length of sentences, the length of the document and whether scientific or industry specific terminology appears in the page.

6. Date options
In Google web search, use the date options in the menus at the top of the results page to restrict your results to information that has been published within the last hour, day, week, month, year or your own date range. Click on Search tools, then Any time and select an option. This works best with news, discussion boards, and blogs and web sites that use blogging software to generate pages but Google is getting better at identifying the correct date of a web page.

Google Scholar handles publication dates differently. On the results page you can select a date range from the menu on the left hand of the page. Alternatively, you can run a Google advanced search and enter your publication years. However, Google Scholar looks for publication years in the area of the document where the date is most likely to be. As a result it may identify a page number or part of an author’s address as a year!

7. Google Scholar alerts
To be used with caution as the searches periodically stop without warning, and so have to be set up again, and they sometimes include documents that are several years old. Whatever your search you can set up an alert by selecting Create alert from the menu on the left hand side of the results page.

If the author has created a profile on Google Scholar, from their profile page you can follow new articles and/or new citations for that author. From past experience I warn you that this is not entirely reliable.

8. Metrics – top publicationsAlthough it claims to search all scholarly literature Google Scholar does not always cover all of the key journals in a subject area. There is no complete source list but there is a top publications for subjects and languages under the ‘Metrics’ link in the upper right hand corner of the Scholar home page.

9. Microsoft Academic Search – visualisations
Microsoft Academic Search (http://academic.research.microsoft.com/) is a direct competitor to Google Scholar. The site is sometimes slow to load and it often assigns authors to the wrong institution. Nevertheless, the visualisations such as the co-author and citation maps can be useful in identifying who else is working in a particular area of research. The visualisations can be accessed by clicking on the Citation Graph image to the left of the search results or author profile.

Author Citation Graph

10. Mednar visualDeep Web Technologies has developed in conjunction with various institutions a number of science and research specific portals, some of which are publicly available. The sources that they cover are different but they all have similar search and display options. Results are automatically ranked by relevance but this can be changed to date, title or author. In addition to the standard relevance ranked list of results the portals create clusters of topics on the left hand side of the screen. The topics include broad subject headings, authors, publications, publishers, and year of publication and are a useful tool for narrowing down a search. Some of the portals, such as Mednar (http://mednar.com/), offer a clickable ‘visual’ of topics and sub-topics.

Fed up with seeing the same results from Google again and again? Wondering if that elusive document is buried somewhere at the bottom of Google’s 2,000,000 hits? Then get thee hence to Million Short (http://millionshort.com/). Million Short runs your search and then removes the most popular web sites from the results. Originally it removed the top 1 million, as its name suggests, but the default has changed to the top 10,000. The principle remains the same, though: exclude the more popular sites and you could uncover a real gem. The page that best answers your question might not be well optimised for search engines or might cover a topic that is so “niche” that it never makes it into the top results. Million Short does not say what it uses for search results or how it determines what are the most popular web sites. According to Webmonkey “Sanjay Arora, founder of Exponential Labs, tells Webmonkey that Million Short is using “the Bing API… augmented with some of our own data” for search results. What constitutes a “top site” in Million Short is determined by Alexa and Million Short’s own crawl data.” (http://www.webmonkey.com/2012/05/million-short-a-search-engine-for-the-very-long-tail/).

Using Million Short is straightforward. Type in your search and select how many sites you want to exclude (top 10K, top million, top 100). The results page includes a list of the sites that have been removed and you can opt to add one or more back in. You can also block a site using a link next to it in the results or click on “Boost!” so that pages from the site go to the top.

Million Short automatically tries to detect which country you are in but you can change it under “Manage Settings and Country”. I didn’t notice much difference when I changed countries but then most of the queries I pass through Million Short tend to be scientific or technical. On the same page you can manage sites that you have blocked, added or boosted.

Does it work? I would not use it instead of the existing major search engines such as Google, Bing or DuckDuckGo but as an additional tool to surface material that is not easily found in the likes of Google. As well as web search there are image and news searches, but I’m not convinced that I’d find those all that useful.

If you are interested in comparing Million Short with Google try Million Short It On at http://www.millionshortiton.com/index.html. I had several goes at this and most of the results were a draw. That is no surprise as the searches I ran were very specific and I wanted to see if Million Short would pull up additional information, which it did. Million Short won outright on a couple and Google on one. The Google win was by default because Million Short did not come up with anything for comparison (the search in question was biofuels public transport carbon emissions).

There are a number of techniques that you can use to improve Google results for example changing the order of the words in your search, Verbatim, filetype or Reading Level but I would also recommend trying Million Short. The results should at least be different and may reveal vital information for your research.

There was a time when Google would aggregate pages from the same website in your search results. There might be just a couple of entries for the site with a “More from….” link next to the result.

Alternatively you might see a mini sitemap:

This has the advantage that you are not swamped with results from a single website but are given instead a variety of options that might provide you with a better answer to your question.

Not any more.

You may have noticed that multiple entries from single websites have started appearing in your results. For example, rather than just one Wikipedia entry you see 4, 5, 6 or even more. On the other hand, you might not have noticed anything at all. Some of my colleagues are seeing this and some are not. Google tests new features and algorithms on a small percentage of its users to see how they react so new or test features are not seen by everyone (see How Google makes improvements to its search algorithm – YouTube http://www.youtube.com/watch?v=J5RZOU6vK4Q). As far as I’m concerned this particular “improvement” is a disaster.

I was running a very general search on the use of biofuels by public transport in the UK. I just want to get an idea of some of the issues that were being discussed before refining my search and went, by default, to Google. My first screen had nothing but results from the UK government Department for Transport (DfT).

I scrolled down and saw more DfT pages. I scrolled down further and yet MORE dft pages. OK, Google, so dft.gov.uk is a good place for me to look at biofuels in public transport. I get the message. STOP! There were 27 DfT pages in total flooding the top of my results page, which I have set to display 100 entries at a time. Creeping in at number 28 came the Guardian with 5 results.

The Friends of the Earth website had 7 results, and then at last I started to see more variety in my results at around number 40, but still with a lot of repetition.

Google may think that the DfT is a very important source of information on the topic but I want to decide whether or not to explore more of a particular site. Spamming my results list annoys me and makes me want to go elsewhere. So I did.

DuckDuckGo (http://www.duckduckgo.com/) is my main Google alternative and it came up with a decent and varied set of results without repetition, hesitation or deviation.

Blekko (http://www.blekko.com/) came up with some interesting alternative pages for me to consider. These would not have been that useful to me in the earlier stages of my research but this test confirmed my feeling that Blekko is good at pulling up information that explores more than the mainstream issues.

If you want to stay with Google how do you deal with multiple listings of sites? The most obvious approach would be to incorporate a ‘-site:’ command in your search, for example:

biofuels public transport -site:dft.gov.uk

If you are conducting in depth research and are likely to be running many variations on a search, incorporating ‘-site:’ each time can become a chore. Google’s own browser Chrome has a Personal Blocklist extension that enables you to block selected sites from results (https://chrome.google.com/webstore/detail/nolijncfnkgaikbjbdaogikpmpbdcdef). Once installed a block link appears next to each entry in your results. Click on the link to block the site from all future results. A message appears at the bottom of searches that would normally contain pages from the blocked site warning you about exclusions.

The ‘show’ link displays and highlights the previously blocked pages and offers an option to unblock them.

Neither the -site: option nor the Blocklist approach should be necessary. There was nothing wrong with the previous ways of offering additional pages from a site in search results. It wasn’t broke but Google did break it by trying to fix it. For me, there are now several Google alternatives that produce quality results and with less irritation. I shall be using them more in future.

My talk at the recent INFORUM 2012 conference held in Prague was about the issue of personalisation and the impact of our social network activities on search results. I believe that personalisation, and in particular contributions from our social and professional networks and even Google+, can present us with an alternative view of a topic or person that can be an important part of our analysis of a situation. I always have two different browsers open. One is not logged in to any account of any sort, has all cookies cleared at the end of each research session, and has search history disabled. The other is permanently logged in to a Google+ enabled account, social and professional accounts, and has web history enabled. This enables me to quickly switch between two very different environments to give me very different results when I am conducting research on Google or even Bing. Demonstrating this at a workshop or conference can be difficult, though, because postings and comments from the social elements of the search results may have been restricted to friends or limited circles.

For the INFORUM 2012 conference I decided to generate word clouds for personalised and non-personalised results for a Google.co.uk search on the single word Prague. The titles and up to the first 250 words of the top 20 results for the searches were scraped into a document from which the clouds were generated. In the graphic below, which has been taken from my presentation, the first word cloud represents a search that is as non-personalised as I could make it and the second has been personalised by several weeks of research on what to do and see in Prague. There are no prizes for guessing what we were interested in visiting!

A paper is also available on the INFORUM web site at http://www.inforum.cz/en/proceedings. It covers much of what I said but bear in mind it was written a few weeks beforehand and the presentation was updated with new developments the night before I gave the talk.

Many of us have been saying for a while that the search engine that will kill Google is Google itself. It has come so close in the past, two of the more recent incidents being the removal of the plus sign from general web search and stopping the ‘ANDing’ of search terms. Prefixing search terms with the plus sign enabled searchers to disable Google’s synonym and variation search so that it carried out an exact match search. It still works in Google Scholar but not in general web search; Google is now using the ‘+’ prefix within Google+ to help users find Google+ business pages, for example +BASF will quickly take you to the BASF business page. Google redeemed itself to some extent by hastily bringing in the Verbatim option, which can be found in the left hand menu of your results page. This will run your search exactly as you specify it (Google: Verbatim for exact match search http://www.rba.co.uk/wordpress/2011/11/18/google-verbatim-for-exact-match-search/). However, while it works with Google commands such as ‘filetype:’ and ‘site:’ it gives up as soon as you start using some of the options in the left hand menu on the results page, such as date.

Initially I was in two minds about SPYW. I thought I might find it useful if I wanted to check what people in my Google circles were saying about a particular issue but then realised that most of them prefer to post on Twitter rather than in Google+ and Google+ does not cover Twitter! The Search+ results include

listings from the web

pages from the web that have been given priority because of your search behavior

pages from the web given priority because of your social connections

both public and private (or limited) Google+ posts, photos and Google Picasa photos

When it comes to serious research Search+ includes far too much irrelevant information. So how easy is it to turn it off? If you are logged in when you run your search you will see a message above your results that tells you the number of personal results and “other results” that have been found. There is also a toggle that enables you to switch between personalised and unpersonalised results. You can also switch it off permanently within your search settings.

You can of course just log out of your Google account before you run a search, or never sign up for Google+ in the first place. But Google is making the latter increasingly difficult. Let’s look at the results that might be popping up on your screen and as an example I’ll use a search on Phil Bradley, search and social media expert and President of CILIP. First of all a search on Phil Bradley before Search+ arrived:

On my screen I see pages from his web site, his blog and a Wikipedia entry (which is not the Phil Bradley I am looking for!). When I sign in to a Google account that has Google+ associated with it I see something completely different:

Phil’s Google+ profile is given priority above everything else and takes up most of the screen regardless of whether or not it is the most relevant or most up to date (Real-Life Examples Of How Google’s “Search Plus” Pushes Google+ Over Relevancy http://searchengineland.com/examples-google-search-plus-drive-facebook-twitter-crazy-107554). And don’t think you can escape with a Google account that does not include Google+. Google has ways of enticing you to “upgrade”:

Search+ has even tainted the suggestions that pop up as you type in your search:

Phil’s Google+ profile is given prominence and if you click on the link without having an account yourself your are invited to join:

To see what the suggestions should look like a group called Focus on the User (http://www.focusontheuser.org/) has produced a bookmarklet for Chrome, Firefox and Safari and extensions for Chrome and Firefox. This tries – and succeeds most of the time – to display your search results without the intrusion of Google+ results. For my search on Phil his Google+ profile is replaced with Twitter.

When I run a search on my own name my Google+ entry is supplanted by my LinkedIn profile.

“What Google should be” does not, though, remove the extra “content” that Search+ sometimes adds to the right of your results. Run a subject search and you may see “People and Pages on Google+” that are supposedly related to your search terms.

I have not yet found these entries to more relevant than standard search results and the link “Learn how you could appear here too” indicates that Google sees this as another way of persuading people and organisations to join Google+. Switching it off is not easy. It is still there if you are logged out of your Google account. It is still there if you add &pws=0 to the search URL (in fact &pws=0 does not seem to work any more at all for depersonalising results). It does disappear, though, if you use Incognito in Chrome. The intrusion of Google+ is most obvious when running searches with just one or two terms or more consumer biased searches. As soon as you start building more complex searches involving filetype: or site: for example, or research more scientific subjects then Google+ takes a back seat.

Search+ is not all that is affecting how Google presents results. Google is simplifying its privacy policies and combining user data from all of its services (Official Google Blog: Updating our privacy policies and terms of service http://googleblog.blogspot.com/2012/01/updating-our-privacy-policies-and-terms.html). It sounds innocent enough but I’ve already spotted major changes. Google knows I live in Reading because I have told it and I do find that useful when I am carrying out local searches for restaurants, builders etc. Google has now decided, though, to bombard my YouTube home page with videos about Reading.

The videos of the Reading railway station redevelopment are vaguely interesting but I see enough of that in real life on a daily basis when I pass through the centre of town. The football videos are of no interest to me whatsoever. So the crossover of content has already started and I am not looking forward to what Google decides to put in my web search results as a consequence of my YouTube activity!

It is becoming increasingly difficult to make Google behave. Using advanced search commands is one way but many searches do not require them. The best method I have found so far is to use Chrome as your browser and open an incognito window. This depersonalises your results, ignores your web history and existing cookies, and leaves no traces of your search activity. Alternatively, since Google has clearly lost the plot when it comes to search, try another service. The three that I would currently recommend are Bing (http://www.bing.com/), DuckDuckGo (http://duckduckgo.com/) and Blekko (http://blekko.com/).

A reminder that Yahoo Site Explorer is closing down tomorrow (November 21st ) and I assume that the link and linkdomain commands will go with it, although they are not specifically mentioned (http://www.ysearchblog.com/2011/11/18/site-explorer-reminder/). Webmasters are being told to use Bing Webmaster Tools. This enables you to analyse links to your own domains but is no use if you want to find who links to other web sites as part of research. Bing, or Live.com as it then was, removed its link and linkdomain commands in November/December 2007 and Yahoo was left as the only reliable alternative. The link command enabled you to find who linked to a specific page on the web and linkdomain found links to anywhere on a specified web site. Both were useful ways of finding other sites containing similar content and discovering what others were saying about a page. Google’s link command is useless as it picks up a minuscule number of results, which now leaves Blekko (http://blekko.com/) as the only realistic alternative.

Blekko enables you to track down linked pages in two ways but both lead to the same results. The first is to use their slashtags ‘/links’ and ‘/domainlinks’ with a URL or domain name. For example http://www.rba.co.uk/sources/registers.htm /links will find pages that link to my official company registers page whereas http://www.rba.co.uk/ /domainlinks finds all inbound links to my site rba.co.uk.

The second route is via your search results. Below each entry is a downwards pointing arrow. Click on this and select ‘links’ from the pop-up box.

You will then see a list of sites that link to that page.

To view inbound links to the whole of the web site click on the seo option below the result and you will see some statistics together with the total number of inbound links.

Click on the inbound links number and Blekko presents you with a list of domains containing links to yours and how many.

To see exactly where the links are located and where they go to on your site just click on the number in the links column.

I have only looked in detail at a couple of sites but Blekko seems to do a good job and is certainly far superior to Google. The Blekko data on my own site seems to correspond with that available from Bing Webmaster Tools but of course I cannot compare other sites in the same way. My initial thoughts are that for link searching Blekko is definitely worth adding to your research toolkit.