The report was developed using an innovative approach to capture the day-to-day patterns of information use in seven research teams from a wide range of disciplines, from botany to clinical neuroscience. The study undertaken over 11 months and involving 56 participants found that there is a signiﬁcant gap between how researchers behave and the policies and strategies of funders and service providers. This suggests that the attempts to implement such strategies have had only a limited impact. Key ﬁndings from the report include:

Researchers use informal and trusted sources of advice from colleagues, rather than institutional service teams, to help identify information sources and resources

The use of social networking tools for scientiﬁc research purposes is far more limited than expected

Data and information sharing activities are mainly driven by needs and beneﬁts perceived as most important by life scientists rather than top-down policies and strategies

There are marked differences in the patterns of information use and exchange between research groups active in different areas of the life sciences, reinforcing the need to avoid standardised policy approaches

“XGDBench: A Benchmarking Platform for Graph Stores in Exascale Clouds”

This research conducts a performance evaluation of four famous graph data stores AllegroGraph, Fuseki, Neo4j, an OrientDB using XGDBench on Tsubame 2.0 HPC cloud environment. XGDBench is an extension of famous Yahoo! Cloud Serving Benchmark (YCSB).

OrientDB is the faster Graph Database among the 4 products tested. In particular OrientDB is about 10x faster (!) than Neo4j in all the tests.

Researchers are free to pick any software packages for comparison but the selection here struck me as odd before reading a comment on the original post asking for ObjectivityDB be added to the comparison.

For that matter, where are GraphChi, Infinite Graph, Dex, Titan, FlockDB? Just to call a few of the other potential candidates out.

Will be interesting when a non-winner on such a benchmark cites it for the proposition that easy of use, reliability, lower TOC outweighs brute speed in a benchmark test.

The Harvard Dataverse Network is open to all scientific data from all disciplines worldwide. It includes the world’s largest collection of social science research data. If you would like to upload your research data, first create a dataverse and then create a study. If you already have a dataverse, log in to add new studies.

MindMup is a zero-friction mind map canvas. Our aim is to create the most productive mind mapping environment out there, removing all the distractions and providing powerful editing shortcuts.

This git project is the JavaScript visualisation portion of MindMup. It provides a canvas for users to create and edit mind maps in a browser. You can see an example of this live on http://www.mindmup.com.

This project is relatively stand alone and you can use it to create a nice mind map visualisation separate from the MindMup Server.

After 24 hours of staring at their screens, the teams that participated in our Disrupt NY 2013 Hackathon have now finished their projects and are currently presenting them onstage. With more than 160 hacks, there are far too many cool ones to write about, but one that stood out to me was NewsRel, an iPad-based news app that uses machine-learning techniques to understand how news stories relate to one other. The app uses Google Maps as its main interface and automatically decides which location is most appropriate for any given story.

The app currently uses Reuters‘ RSS feed and analyzes the stories, looking for clusters of related stories and then puts them on the map. Say you are looking at a story about the Boston Marathon bombings. The app, of course, will show you a number of news stories about it clustered around Boston, then maybe something about the president’s comments about it from Washington and another article that relates it to the massacre during the Munich Olympics in 1972.

In addition to this, the team built an algorithm that picks the most important sentences from each story to summarize it for you.

No pointers to software, just the news blurb.

But, does raise an interesting possibility.

What if news video streams were tagged with geolocation and type information?

So I could exclude “train hits parade float” stories from several states away, automobile accidents, crime stories and replaces it with substantive commentary from the BBC or Al Jazeera.

Now that would be a video feed worth paying for. Particularly if for a premium it was commercial free.

Freedom from Wolf Blitzer’s whines in disaster areas should come as a free pre-set.

Just a small amount of additional semantics could lead to entirely new markets and delivery systems.

Posted in Mapping, News | Comments Off on Mapping the News [Idea for a NewsApp]

Fast response times generate costs savings and greater revenue. Enterprise data architectures are incomplete unless they can ingest, analyze, and react to data in real-time as it is generated. While previously inaccessible or too complex — scalable, affordable real-time solutions are now finally available to any enterprise.

Read Infochimps’newest whitepaper on how Infochimps Cloud::Streams is a proprietary stream processing framework based on four years of experience with sourcing and analyzing both bulk and in-motion data sources. It offers a linearly and fault-tolerant stream processing engine that leverages a number of well-proven web-scale solutions built by Twitter and Linkedin engineers, with an emphasis on enterprise-class scalability, robustness, and ease of use.

The price of this whitepaper is disclosure of your contact information.

Annoying considering the lack of substantive content about the solution. The use cases are mildly interesting but admit to any number of similar solutions.

If you need real-time data aggregation, skip the white paper and contact your IT consultant/vendor. (Including Infochimps, who do very good work, which is why a non-substantive white paper is so annoying.)

On a recent engagement, we were posed with the problem of sorting through 6.5 million foreign patent documents and indexing them into Solr. This totaled about 1 TB of XML text data alone. The full corpus included an additional 5 TB of images to incorporate into the index; this blog post will only cover the text metadata.

Streaming large volumes of data into Solr is nothing new, but this dataset posed a unique challenge: Each patent document’s translation resided in a separate file, and the location of each translation file was unknown at runtime. This meant that for every document processed we wouldn’t know where its match would be. Furthermore, the translations would arrive in batches, to be added as they come. And lastly, the project needed to be open to different languages and different file formats in the future.

Our options for dealing with inconsistent data came down to: cleaning all data and organizing it before processing, or building an ingester robust enough to handle different situations.

We opted for the latter and built an ingester that would process each file individually and index the documents with an atomic update (new in Solr 4). To detect and extract the text metadata we chose Apache Tika. Tika is a document-detection and content-extraction tool useful for parsing information from many different formats.

On the surface Tika offers a simple interface to retrieve data from many sources. Our use case, however, required a deeper extraction of specific data. Using the built-in SAX parser allowed us to push Tika beyond its normal limits, and analyze XML content according to the type of information it contained.

No magic bullet but an interesting use case (patents in multiple languages).

Posted in Indexing, Solr, Tika | Comments Off on Indexing Millions Of Documents…

Recently, I was tasked with evaluating LingPipe for use in our NLP processing pipeline. I have looked at LingPipe before, but have generally kept away from it because of its licensing – while it is quite friendly to individual developers such as myself (as long as I share the results of my work, I can use LingPipe without any royalties), a lot of the stuff I do is motivated by problems at work, and LingPipe based solutions are only practical when the company is open to the licensing costs involved.

So anyway, in an attempt to kill two birds with one stone, I decided to work with the LingPipe tutorial, but with Scala. I figured that would allow me to pick up the LingPipe API as well as give me some additional experience in Scala coding. I looked around to see if anybody had done something similar and I came upon the scalingpipe project on GitHub where Alexy Khrabov had started with porting the Interesting Phrases tutorial example.

Now there’s a clever idea!

Achieves a deep understanding of the LingPipe API and Scala experience.

The focus of the Atlas of Design is on the aesthetics and design involved in mapmaking. Tim Wallace and Daniel Huffman, the editors of Atlas of Design explain the book’s introduction about the focus of the book:

Aesthetics separate workable maps from elegant ones.

This book is about the latter category.

My personal suspicion is that aesthetics separate legible topic maps from those that attract repeat users.

The only way to teach aesthetics (which varies by culture and social group) is by experience.

Now we have our webapp that can read json from the outside world and store them inside MongoDB. But during my daily job what I usually need to do is to talk to some REST service and get, manipulate and store some arbitrary JSON. Fortunately for us, Haskell and its rich, high-quality libraries ecosystem makes the process a breeze.

Alfredo continues his series on building a basis web app in Haskell.

Promises a small DSL for describing recipes in the next espisode.

Which reminds me to ask, is anyone using a DSL to enable users to compose domain specific topic maps?

That is we say topic, scope, association, occurrence, etc. only because that is our vocabulary for topic maps.

No particular reason why everyone has to use those names in composing a topic map.

For a recipe topic map the user might see: recipe (topic), ingredient (topics), ordered instructions (occurrences), measurements, with associations being implied between the recipe and ingredients and between ingredients and measurements, along with role types, etc.

To a topic map processor, all of those terms are treated as topic map information items but renamed for presentation to end users.

If you select an ingredient, say fresh tomatoes in the salads category, it displays other recipes that also use fresh tomatoes.

I’ll apologise for the title right away: this post isn’t about a Frankenstein-like attempt at creating a living being in Excel, I’m afraid. Instead, it’s about my attempt to implement Jon Conway’s famous game ‘Life’ using Data Explorer, how it didn’t fully succeed and some of the interesting things I learned along the way…

When I’m learning a new technology I like to set myself mini-projects that are more fun than practically useful, and for some reason a few weeks ago I remembered ‘Life’ (which I’m sure almost anyone who has learned programming has had to write a version of at some stage), so I began to wonder if I could write a version of it in Data Explorer. This wasn’t because I thought Data Explorer was an appropriate tool to do this – there are certainly better ways to implement Life in Excel – but I thought doing this would help me in my attempts to learn Data Explorer’s formula language and might also result in an interesting blog post.

We’re very happy to announce the 2.3 release of Hue, the open source Web UI that makes Apache Hadoop easier to use.

Hue 2.3 comes only two months after 2.2 but contains more than 100 improvements and fixes. In particular, two new apps were added (including an Apache Pig editor) and the query editors are now easier to use.

…

Here’s the new features list:

Pig Editor: new application for editing and running Apache Pig scripts with UDFs and parameters

After nearly 2 years of hard work, the Qi4j Community today launched its second generation Composite Oriented Programming framework.

Qi4j is Composite Oriented Programming for the Java platform. It is a top-down approach to write business applications in a maintainable and efficient manner. Qi4j let you focus on the business domain, removing most impedance mismatches in software development, such as object-relation mapping, overlapping concerns and testability.

Qi4j’s main areas of excellence are its enforcement of application layering and modularization, the typed and generic AOP approach, affinity based dependency injection, persistence management, indexing and query subsystems, but there are much more.

The 2.0 release is practically a re-write of the entire runtime, according to co-founder Niclas Hedhman; “Although we are breaking compatibility in many select areas, most 1.4 applications can be converted with relatively few changes.”. He continues; “These changes are necessary for the next set of planned features, including full Scala integration, the upcoming JDK8 and Event Sourcing integrated into the persistence model.”

“It has been a bumpy ride to get this release out the door.”, said Paul Merlin, the 2.0 Release Manager, “but we are determined that Qi4j represents the best technological platform for Java to create applications with high business value.” Not only has the community re-crafted a remarkable codebase, but also created a brand new website, fully integrated with the new Gradle build process.

Within complex scientific domains such as pharmacology, operational equivalence between two concepts is often context-, user- and task-specific. Existing Linked Data integration procedures and equivalence services do not take the context and task of the user into account. We present a vision for enabling users to control the notion of operational equivalence by applying scientic lenses over Linked Data. The scientific lenses vary the links that are activated between the datasets which affects the data returned to the user.

Two additional quotes from this paper should convince you of the importance of this work:

We aim to support users in controlling and varying their view of the data by applying a scientific lens which govern the notions of equivalence applied to the data. Users will be able to change their lens based on the task and role they are performing rather than having one fixed lens. To support this requirement, we propose an approach that applies context dependent sets of equality links. These links are stored in a stand-off fashion so that they are not intermingled with the datasets. This allows for multiple, context-dependent, linksets that can evolve without impact on the underlying datasets and support differing opinions on the relationships between data instances. This flexibility is in contrast to both Linked Data and traditional data integration approaches. We look at the role personae can play in guiding the nature of relationships between the data resources and the desired affects of applying scientific lenses over Linked Data.

and,

Within scientific datasets it is common to find links to the “equivalent” record in another dataset. However, there is no declaration of the form of the relationship. There is a great deal of variation in the notion of equivalence implied by the links both within a dataset’s usage and particularly across datasets, which degrades the quality of the data. The scientific user personae have very different needs about the notion of equivalence that should be applied between datasets. The users need a simple mechanism by which they can change the operational equivalence applied between datasets. We propose the use of scientific lenses.

BTW, the notion of equivalence being represented by “links” reminds me of a comment Peter Neubauer (Neo4j) once made to me, saying that equivalence could be modeled as edges. Imagine typing equivalence edges. Will have to think about that some more.

The fourth Open PHACTS Community Workshop was held at Burlington House in London on April 22 and 23, 2013. The Workshop focussed on “Using the Power of Open PHACTS” and featured the public release of the Open PHACTS application programming interface (API) and the first Open PHACTS example app, ChemBioNavigator.

The first day featured talks describing the data accessible via the Open PHACTS Discovery Platform and technical aspects of the API. The use of the API by example applications ChemBioNavigator and PharmaTrek was outlined, and the results of the Accelrys Pipeline Pilot Hackathon discussed.

The second day involved discussion of Open PHACTS sustainability and plans for the successor organisation, the Open PHACTS Foundation. The afternoon was attended by those keen to further discuss the potential of the Open PHACTS API and the future of Open PHACTS.

During talks, especially those detailing the Open PHACTS API, a good number of signup requests to the API via dev.openphacts.org were received. The hashtag #opslaunch was used to follow reactions to the workshop on Twitter (see storify), and showed the response amongst attendees to be overwhelmingly positive.

This summary is followed by slides from the two days of presentations.

Not like being there but still quite useful.

As a matter of fact, I found a lead on “operational equivalence” with this data set. More to follow in a separate post.

Data scientists, that peculiar mix of software engineer and statistician, are notoriously difficult to interview. One approach that I’ve used over the years is to pose a problem that requires some mixture of algorithm design and probability theory in order to come up with an answer. Here’s an example of this type of question that has been popular in Silicon Valley for a number of years:

Say you have a stream of items of large and unknown length that we can only iterate over once. Create an algorithm that randomly chooses an item fromthis stream such that each item is equally likely to be selected.

The first thing to do when you find yourself confronted with such a question is to stay calm. The data scientist who is interviewing you isn’t trying to trick you by asking you to do something that is impossible. In fact, this data scientist is desperate to hire you. She is buried under a pile of analysis requests, her ETL pipeline is broken, and her machine learning model is failing to converge. Her only hope is to hire smart people such as yourself to come in and help. She wants you to succeed.

Remember: Stay Calm.

The second thing to do is to think deeply about the question. Assume that you are talking to a good person who has read Daniel Tunkelang’s excellent advice about interviewing data scientists. This means that this interview question probably originated in a real problem that this data scientist has encountered in her work. Therefore, a simple answer like, “I would put all of the items in a list and then select one at random once the stream ended,” would be a bad thing for you to say, because it would mean that you didn’t think deeply about what would happen if there were more items in the stream than would fit in memory (or even on disk!) on a single computer.

The third thing to do is to create a simple example problem that allows you to work through what should happen for several concrete instances of the problem. The vast majority of humans do a much better job of solving problems when they work with concrete examples instead of abstractions, so making the problem concrete can go a long way toward helping you find a solution.

In addition to great interview advice, Josh also provides a useful overview of reservoir sampling.

Whether reservoir sampling will be useful to you depends on your test for subject identity.

I tend to think of subject identity as being very precise but that isn’t necessarily the case.

Or should I say that precision of subject identity is a matter of requirements?

For some purposes, it may be sufficient to know the gender of attendees, as a subject, within some margin of statistical error. With enough effort we could know that more precisely but the cost may be prohibitive.

Thinking of any test for subject identity being located on a continuum of subject identification. Where the notion of “precision” itself is up for definition.

Russia’s warnings on one of the Boston Marathon bombers, a warning that used his name as he did, not as captured by the US intelligence community, was a case of mistaken level of precision.

Mostly likely the result of an analyst schooled in an English-only curriculum.

While writing about Drake, I was struck by the attractiveness of the project logo:

So I decided to look at some other projects logos, just to get some ideas on what other projects were doing as far as logos:

But the most famous project at Apache has the simplest logo of all:

To be truthful, when someone says web server, I automatically think of the Apache server. Others exist and new ones are invented, but Apache server is nearly synonymous with web server.

Perhaps the lesson is the logo did not make it so.

Has anyone written a history of the Apache web server?

A cross between a social history and a technical one, that illustrates how the project responded to user demands and and requirements. That could make a very nice blueprint for other projects to follow.

Paul Butler, a self-described Data Hacker, recently published an article called “Make for Data Scientists“, which explored the challenges of managing data processing work. Paul went on to explain why GNU Make could be a viable tool for easing this pain. He also pointed out some limitations with Make, for example the assumption that all data is local.

We were gladdened to read Paul’s article, because we’d been hard at work building an internal tool to help manage our data workflows. A defining goal was to end up with a kind of “Make for data”, but targeted squarely at the problems of managing data workflow.

A really nice introduction to Drake, with a simple example and pointers to more complete resources.

Not hard to see how Drake could fit into a topic map authoring work flow.

LevelGraph is a Graph Database. Unlike many other graph database, LevelGraph is built on the uber-fast key-value store LevelDB through the powerful LevelUp library. You can use it inside your node.js application.

Couple of months ago we reported that the White House is planning for an executive cyber security order, from some official sources it has also come to know that the U.S. President Mr. Barack Obama has a special plan to re-introduce the Cyber Intelligence Sharing and Protection Act (CISPA). Today that deceleration get executed as the US House of Representatives has passed the controversial Cyber Information Sharing and Protection Act. This is the second time when CISPA have been passed by the White House, first it was rejected by the Senator while saying that the bill did not do enough to protect privacy. But yet again with the initiative of Obama and a substantial majority of politicians in the House backed the bill. Though there is a huge chance of getting rejected. According to some relevant sources it has been came to light that, this time also CISPA could fail again in the Senate after threats from President Obama to veto it over privacy concerns. Sources are saying that the main reason of re-introducing CISPA is the the President Barack Obama expressed concerns that it could pose a privacy risk. The White House wants amendments so more is done to ensure the minimum amount of data is handed over in investigations. The law is passing through the US legislative system as American federal agencies warn that malicious hackers, motivated by money or acting on behalf of foreign governments, such as China, are one of the biggest threats facing the nation. “If you want to take a shot across China’s bow, this is the answer,” said Mike Rogers, the Republican politician who co-wrote CISPA and chairs the House Intelligence Committee.

Don’t be distracted by the privacy/civil liberties/cybersecurity dance in Washington, D.C.

Why would you trust a government with a kill list to balk at listening to your phone or reading your email traffic?

A government that does those things and lies to the public about them, is unworthy of trust.

Guard your privacy as best you can.

No one else is going to do it for you.

PS: Topic maps may be able to help your watch the watchers. See how they like a good dose of sunshine.

Search is a conversation: a dialogue between user and system that can be every bit as rich as human conversation. Like human dialogue, it is bidirectional: on one side is the user with their information need, which they articulate as some form of query.

On the other is the system and its response, which it expresses a set of search results. Together, these two elements lie at the heart of the search experience, defining and shaping much of the information seeking dialogue. In this piece, we examine the most universal of elements within that response: the search result.

Basic Principles

Search results play a vital role in the search experience, communicating the richness and diversity of the overall result set, while at the same time conveying the detail of each individual item. This dual purpose creates the primary tension in the design: results that are too detailed risk wasting valuable screen space while those that are too succinct risk omitting vital information.

Suppose you’re looking for a new job, and you browse to the 40 or so open positions listed on UsabilityNews. The results are displayed in concise groups of ten, occupying minimal screen space. But can you tell which ones might be worth pursuing?

As always a great post by Tony but a little over the top with:

“…a dialogue between user and system that can be every bit as rich as human conversation.”

Not in my experience but that’s not everyone’s experience.

Has anyone tested the thesis that dialogue between a user and search engine is as rich as between user and reference librarian?

Exploring bioactive chemistry requires navigating between structures and data from a variety of text-based sources. While PubChem currently includes approximately 16 million document-extracted structures (15 million from patents) the extent of public inter-document and document-to-database links is still well below any estimated total, especially for journal articles. A major expansion in access to text-entombed chemistry is enabled by chemicalize.org. This on-line resource can process IUPAC names, SMILES, InChI strings, CAS numbers and drug names from pasted text, PDFs or URLs to generate structures, calculate properties and launch searches. Here, we explore its utility for answering questions related to chemical structures in documents and where these overlap with database records. These aspects are illustrated using a common theme of Dipeptidyl Peptidase 4 (DPPIV) inhibitors.

Results

Full-text open URL sources facilitated the download of over 1400 structures from a DPPIV patent and the alignment of specific examples with IC50 data. Uploading the SMILES to PubChem revealed extensive linking to patents and papers, including prior submissions from chemicalize.org as submitting source. A DPPIV medicinal chemistry paper was completely extracted and structures were aligned to the activity results table, as well as linked to other documents via PubChem. In both cases, key structures with data were partitioned from common chemistry by dividing them into individual new PDFs for conversion. Over 500 structures were also extracted from a batch of PubMed abstracts related to DPPIV inhibition. The drug structures could be stepped through each text occurrence and included some converted MeSH-only IUPAC names not linked in PubChem. Performing set intersections proved effective for detecting compounds-in-common between documents and/or merged extractions.

Conclusion

This work demonstrates the utility of chemicalize.org for the exploration of chemical structure connectivity between documents and databases, including structure searches in PubChem, InChIKey searches in Google and the chemicalize.org archive. It has the flexibility to extract text from any internal, external or Web source. It synergizes with other open tools and the application is undergoing continued development. It should thus facilitate progress in medicinal chemistry, chemical biology and other bioactive chemistry domains.

A great example of building a resource to address identity issues in a specific domain.

The result speaks for itself.

PS: The results were not delayed awaiting a reformation of chemistry to use a common identifier.

For our flagship product, Searchbox.com, we strive to bring the most cutting-edge technologies to our users. As we’ve mentioned in earlier blog posts, we rely heavily on Solr and Lucene to provide the framework for these functionalities. The nice thing about the Solr framework is that it allows for easy development of plugins which can greatly extend the capabilities of the software. We’ll be creating a set of slideshares which describe how to implement 3 types of plugins so that you can get ahead of the learning curve and start extending your own custom Solr installation now.

There are mainly 4 types of custom plugins which can be created. We’ll discuss their differences here:

Sometimes Andrew says three (3) types of plugins and sometimes he says four (4).

At first glance, the Apache HBase architecture appears to follow a master/slave model where the master receives all the requests but the real work is done by the slaves. This is not actually the case, and in this article I will describe what tasks are in fact handled by the master and the slaves.

You can use a tool or master a tool.

Recommend the latter.

Posted in HBase | Comments Off on How Scaling Really Works in Apache HBase