Organizations are struggling with a fundamental challenge – there’s far more data than they can handle. Sure, there’s a shared vision to analyze structured and unstructured data in support of better decision making but is this a reality for most companies? The big data tidal wave is transforming the database [...]

Tony Agresta's insight:

This article was recently published on graph databases and why they matter. It lists some examples of how graph databases are part of search engines we all use as well as some interesting features that integrate the results of text analysis into the graph database.

Learn analytics - especially adaptive learning using semantic analysis - is starting to gain momentum with some of the academic, scientific and educational publishers. Some of the core concepts are covered here.

In the not too distant past, analysts were all searching for a “360 degree view” of their data. Most of the time this phrase referred to integrated RDBMS data, analytics interfaces and customers. But with the onslaught…

Tony Agresta's insight:

Semantic pipelines allow for the identification, extraction, classification and storage of semantic knowledge creating a knowledge base of all your data. Most organizations have struggled to create these pipelines primarily because the plumbing hasn't existed. But now it does.

This post discusses how free flowing text streams into graph databases using concept extraction processes. A well coordinated feed of data is written to the underlying graph database while updates are tracked on a continuous basis to ensure database integrity.

Other important pipeline plumbing includes tools for disambiguation (used to resolve the definition of entities inside the text), classification of the entities, structuring relationships between entities and determining sentiment.

Organizations that deploy well functioning semantic pipelines have an added advantage over their competitors. They have instant access to a completed knowledge base of data. Research functions spend less time searching and more time analyzing. Alerting notifies critical business functions to take immediate action. Service levels are improved using accurate, well structured responses. Sentiment is detected allowing more time to react to changing market conditions.

In general, the REST Client API calls out a GATE-based annotation pipeline and sends back enriched data in RDF form. Organizations typically customize these pipelines which consist of any GATE-developed set of text mining algorithms for scoring, machine learning, disambiguation or any of the other wide range of text mining techniques.

It is important to note that these text mining pipelines create RDF in a linear fashion and feed GraphDB™. Once the RDF is enriched in this fashion and stored in the database, these annotations can then be modified, edited or removed. This is particularly useful when integrating with Linked Open Data (LOD) sources. Updates to the database are populated automatically when the source information changes.

For example, let’s say your text mining pipeline is referencing Freebase as its Linked Open Data source for organization names. If an organization name changes or a new subsidiary is announced in Freebase, this information will be updated as reference-able metadata in GraphDB™.

In addition, this tightly-coupled integration includes a suite of enterprise-grade APIs, the core of which is the Concept Extraction API. This API consists of a Coordinator and Entity Update Feed. Here’s what they do:

The Concept Extraction API Coordinator module accepts annotation requests and dispatches them towards a group of Concept Extraction Workers. The Coordinator communicates with GraphDB™ in order to track changes leading to updates in each worker’s entity extractor. The API Coordinator acts as a traffic cop allowing for approved and unique entities to be inserted in GraphDB™ while preventing duplicates from taking up valuable real estate.

The Entity Update Feed (EUF) plugin is responsible for tracking and reporting on updates about every entity (concept) within the database that has been modified in any way (added, removed, or edited). This information is stored in the graph database and query-able via SPARQL. Reports can be run notifying a user of any and all changes.

As mentioned, the value of this tightly-coupled integration is in the rich metadata and relationships which can now be derived from the underlying RDF database. It’s this metadata that powers high performance search and discovery or website applications – results are compete, accurate and instantaneous.

In general, the REST Client API calls out a GATE-based annotation pipeline and sends back enriched data in RDF form. Organizations typically customize these pipelines which consist of any GATE-developed set of text mining algorithms for scoring, machine learning, disambiguation or any of the other wide range of text mining techniques.

It is important to note that these text mining pipelines create RDF in a linear fashion and feed GraphDB™. Once the RDF is enriched in this fashion and stored in the database, these annotations can then be modified, edited or removed. This is particularly useful when integrating with Linked Open Data (LOD) sources. Updates to the database are populated automatically when the source information changes.

For example, let’s say your text mining pipeline is referencing Freebase as its Linked Open Data source for organization names. If an organization name changes or a new subsidiary is announced in Freebase, this information will be updated as reference-able metadata in GraphDB™.

In addition, this tightly-coupled integration includes a suite of enterprise-grade APIs, the core of which is the Concept Extraction API. This API consists of a Coordinator and Entity Update Feed. Here’s what they do:

The Concept Extraction API Coordinator module accepts annotation requests and dispatches them towards a group of Concept Extraction Workers. The Coordinator communicates with GraphDB™ in order to track changes leading to updates in each worker’s entity extractor. The API Coordinator acts as a traffic cop allowing for approved and unique entities to be inserted in GraphDB™ while preventing duplicates from taking up valuable real estate.

The Entity Update Feed (EUF) plugin is responsible for tracking and reporting on updates about every entity (concept) within the database that has been modified in any way (added, removed, or edited). This information is stored in the graph database and query-able via SPARQL. Reports can be run notifying a user of any and all changes.

As mentioned, the value of this tightly-coupled integration is in the rich metadata and relationships which can now be derived from the underlying RDF database. It’s this metadata that powers high performance search and discovery or website applications – results are compete, accurate and instantaneous.

A group of computer scientists are developing a smarter academic search engine in support of geoscientists. This will allow them to find the exact data sets and publications they want in the blink of an eye, instead of spending hours and days scrolling through pages of irrelevant results on Google Scholar.

GeoLink is the name of the project recently kicked off and it's part of EarthCube, an initiative funded by the National Science Foundation (NSF) to upgrade the cyber infrastructure for the geosciences - This translates into semantic data integration of earth sciences data.

Computer programs will also be created to extract information from conference abstracts, NSF awards, and geoscience data repositories and then digitally connect these resources in ways that make them more accessible to scientists.

In the not too distant future, this will culminate in a one-stop search hub for the geosciences.

This is a perfect use case for a graph database that can store all of the meta data, integrate a variety of data sources, maintain connections between the semantic statements in the graph and link back to the originating documents and publications. Everything becomes discoverable.

Search has changed dramatically over the past year and semantic technology has been at the center of it all. Consumers increasingly expect search engines to understand natural language and perceive the intent behind the words they type in, and search engine algorithms are rising to this challenge. This evolution in search has dramatic implications for marketers, consumers, technology developers and content creators — and it’s still the early days for this rapidly changing environment. Here is an overview of how search technology is changing, how these changes may affect you and what you can do to market your business more effectively in the new era of search.

Here's a graph that shows how human diseases are connected. In this graph, nodes are diseases. The links show how they are connected to one another. The larger the node, the more diseases it is connected to.

You can also see the genes associated with the diseases which indicate the common genetic origin of the diseases.

This graph allows you to filter diseases by disease categories. To search the graph for a specific category, use the "hide all" filter and then select the disease category that's of interest. The diseases are highlighted on the graph. Then you can zoom in to see related diseases and gene associations.

Everything in our digital universe is connected. Every single day, you wake up and start a series of interactions with people, products and machines. Sometimes, these things influence you, and sometimes you play the role of the influencer. This is how our world is connected, in a network of relationships [...]

Tony Agresta's insight:

This recent article by Scott Gnau of Teradata does a great job at discussing Graph Analytics. For example, Scott writes

"The ability to track relationships between people, products, processes and other 'entities' remains crucial to breaking up sophisticated fraud rings."

He also talks about the fact that graph analytics:

"Allow companies to detect, in near real-time, the cyber-threats hidden in the flood of diverse data generated from IP, network, server and communication logs – a huge problem, as we know, that exists today."

But what's powering the analysis? Where is the data stored that drives the visual display of the graph? Today, graph databases are becoming more and more popular. One type of graph database is the native RDF triplestore.

Triplestores store semantic facts in the form of subject - predicate - object using the Resource Description Framework. These facts might be created using natural language processing pipelines or imported from Linked Open Data. In either case, RDF is a standard model for data publishing and data interchange on the Web. RDF has features that facilitate data merging even if the underlying schemas differ. RDFS and OWL are its schema languages; SPARQL is the query language, similar to SQL.

RDF specifically supports the evolution of schemas over time without requiring all of the data transformations or reloading of data. A central concept is the Universal Resources Identifier (URI). These are globally unique identifiers, the most popular variant of which are the widely used URLs. All data elements (objects, entities, concepts, relationships, attributes) are identified with URIs, allowing data from different sources to merge without collisions. All data is represented in triples (also referred to as “statements”), which are simple enough to allow for easy, correct transformation of data from any other representation without the loss of data. At the same time, triples form interconnected data networks (graphs), which are sufficiently expressive and efficient to represent even the most complex data structures.

When Scott talks about tracking "entities", one way to do this is using graph visualization (graph analytics) that sits on top of a native RDF triplestore. The data in the triplestore contains the relationships between the entities. They can be displayed using relationships graphs (a form of data visualization). Since graphs can get very complex very fast (and since triplestores can hold billions of RDF statements, all of which are theoretically eligible to be displayed in the visual graph), users apply link analysis techniques to filter the data, configure the graph, change the size of the entities (nodes) on the graph and the links (edges), search the graph space and interact with charts, tables, timelines and geospatial views.

SPARQL, the powerful query language that can be used with triplestores, is more than adequate to create subsets of RDF statements which can then be stored in smaller, more nimble triplestores that make graph analysis easier.

So, while graph analysis is getting hot, the power is in the blend of the visual aspect and the underlying RDF triplestore. It's also fair to say that creating the triples by analyzing free flowing text is also an important part of this solution.

To read more about native triplestores, graph visualization and text mining, get the white paper published by Ontotext called "The Truth About Triplestores" which outlines all of this and goes deeper into text mining and other semantic technology.

GraphDB 6.1 is the latest version of Ontotext's flagship RDF Triplestore product.

Tony Agresta's insight:

GraphDB 6.1, a native triplestore, is now available from Ontotext. You can get a free copy of Lite, Standard or Enterprise here: http://www.ontotext.com/products/ontotext-graphdb/ This is worth trying, especially since it comes with the Knowledge Path Series which guides you through the entire evaluation: http://www.ontotext.com/graphdb-knowledge-path-series/

Take a look at the types of entities that the semantic biomedical tagger (SBT) can identify from complex text. The biomedical tagger has a built-in capability to recognize 133 biomedical entity types and semantically link them to a knowledge base system. In this case it is Linked Life Data (LLD). The SBT can load entity names from the LLD service or any other RDF database with a SPARQL endpoint.

What does this mean for you? You can analyze free flowing text that has complex biomedical terms. Ontotext can analyze the text, identify entities and match those entities to our Linked Life Data service. By doing so, we enrich the terms identified in the Biomedical Tagger. Entity names can then be loaded into GraphDB (an RDF database) or any other RDF database in support of search and discovery or analytics applications. Your documents are discoverable at the ENTITY LEVEL allowing analyst and researchers to find precisely what they are looking for - instantly.

Last week I had a chance to present the Self Service Semantic Suite (S4) at the LT-Accelerate conference in Brussels. LT-Accelerate is a new event focusing on language technology and its applications in various domains: social media…

Tony Agresta's insight:

The presentation by Marin Dimitrov from Ontotext is worth a review. Ontotext has been delivering solutions across verticals that share some common themes:

Uncovering insight from text using search and discovery applications that pinpoint specific parts of free flowing text based on semantic indexing and classification.

Interlinking structured text AND structured data in the same semantic repository allowing for complete and accurate search results.

Reducing the impact of schema evolution while also integrating heterogeneous data sources.

Applying the vast amounts of Linked Open Data available to enrich internal data sources and also disambiguate entities.

Revealing implied relationships (new facts) from existing RDF statements and then using these facts to answer queries faster and uncover new insights.

Today, these common themes have been used to deliver solutions in Media & Publishing, Life Sciences, Government, Financial Services, Healthcare, Claims Management and other areas.

S4, the Self Service Semantic Suite allows developers to build their own applications using proven, enterprise class tools that run on demand and in the cloud. If you are looking for a low cost alternative to Semantic Technology, try Ontotext S4 for free. If you want to learn more about the complete suite of tools in the Ontotext portfolio, visit www.ontotext.com

Northern Va. organizations form committee focusing on big data Loudoun Times-Mirror “Industry, academic and research leaders in this region, and the NVTC's leadership in particular, truly believe that we have a unique set of powerful resources,...

Tony Agresta's insight:

With 70 % of the world's daily internet traffic coming through Loudon every day, one of the most beautiful counties in the US is also a center for big data.

Berners-Lee envisioned a Web where all sites inherently included this capability. A Web that involved far less effort, time, and expense for both Web site producers and users. A Semantic Web. On social networks, this idea of starting with one piece of data (i.e.: a user of Facebook) and finding your way to other data (that user's friends, and then their friends, and who they work for and where they live) is often referred to as a social graph. A social graph is an example of a data graph and the foundational element of a data graph is something called a triple. "David is a friend of Wendell" is a triple. It involves two objects (David and Wendell) and the explanation of the relationship. In true Semantic Web vernacular, "David" is the subject, "is a friend of" is the predicate, and "Wendell" is the object. When linked together (David knows Wendell who knows Kevin and so on..), triples form the basis of graphs.

Companies have more data than they think, they need less data than they think, and predictive models consistently outperform human decision-making abilities.

Tony Agresta's insight:

All of the model types described in this article will help smaller companies predict outcomes, deploy resources more effectively and save time. Given some of the less expensive data mining tools on the market today, you would be surprised at how low costs are to create them given good input data and someone familiar with predictive analysis.

Predicting churn, retention, up-sell, or money spent should be left to predictive analytic products that focus on response or performance models. Visualizing networks of activity through connection points are best done with data visualization tools and link analysis.

Borislav Popov, head of Ontotext Media & Publishing, will show you how news & media publishers can use semantic publishing technology to more efficiently generate content while increasing audience engagement through personalisation and recommendations.

Tony Agresta's insight:

This webinar is recommended for those interested in how to apply core semantic technology to structured and unstructured data. In this webinar you will learn:

The importance of text analysis, entity extraction and semantic indexing - all directly linked to a graph database

The significance of training text mining algorithms to create accuracy in extraction and classification

The power of semantic recommendations - delivering highly relevant content using a blend of semantic analysis, reader profiles and past browsing history

How "Semantic Search" can be applied to isolate the most meaningful content

This webinar will show live demonstrations of semantic technology for news and media. But if you are in government, financial services, healthcare, life sciences or education, I would still recommend the webinar. The concepts are directly applicable and most of the technology can be adapted to meet your needs.

The Semantic Web Journal was launched 5 years ago. There's a wealth of information here on semanitc.

Below is the abstract for the number two entry - GraphDB (formerly OWLIM). You can download the paper on the site.

"An explosion in the use of RDF, first as an annotation language and later as a data representation language, has driven the requirements for Web-scale server systems that can store and process huge quantities of data, and furthermore provide powerful data access and mining functionality. This paper describes OWLIM (now called GraphDB), a family of semantic repositories that provide storage, inference and novel data-access features delivered in a scalable, resilient, industrial-strength platform."

The term “Semantic Search” is certainly not new. However, it has taken on a new dimension and implications in both search and social engines today. In addition, it has had a strong impact on targeted semantic advertising.

This special series of forthcoming articles on semantic search will take a look at the history behind the development of semantic technology and why it has now become so commercially viable and topical. It will also take a look at how the technology enables “answer engines,” rather than simple search engines, to improve the user experience.

This Ontotext webinar is designed to provide a summary of the value of Semantic Technology for smarter data management, as well as a brief technical introduction to the Self-Service Semantic Suite (S4) by Ontotext, which provides on-demand capabilities in the Cloud for text analytics, RDF data management and access to knowledge graphs.

Tony Agresta's insight:

I thought some of my followers might want to attend this webinar which will be given by our CTO, Marin Dimitrov. Marin will be talking about important use cases for text mining and knowledge graphs running in the cloud. This is worth attending.

Used to be that medical researchers came up with a theory, recruited subjects, and gathered data, sometimes for years. Now, the answers are already there in data collections on the cloud. All researchers need is the right question.

Tony Agresta's insight:

Through semantic analysis of free flowing text and the indexing of results, fine grained details about diseases, treatments, symptoms, clinical trials and current research can be made accessible to medical practitioners in real time. How does this work? It typically involves creating a text mining or natural language processing "pipeline" that is used to analyze the text, identify entities (even complex bio medical terms), classify them, develop relationships between them and then "index everything."

The way we have done this successfully is by using proven text mining algorithms and tuning them to highly specific domains like life sciences, healthcare and biotech. We use curation tools and trained curators to read the text, annotate it and gain agreement on the annotations. Then the results are used to refine the text mining algorithms, test and validate.

This process may seem cumbersome to some but the reality is, when done by trained pros, it is not. It has the added benefit of being done one time and then being applied for long periods of time without interruption. Results are highly accurate.

Paste text into the box from an article or research paper on healthcare or life sciences - make sure the article is replete with complex bio medical terms that you don't think any automated algorithm can figure out.

Select Bio Medical Tagger (by the way, you can also do this for general news or Tweets)

Click Execute

Analyze the results

Pretty cool.

Organizations that don't semantically enrich their content are operating at a disadvantage. The benefits are real - saving patients lives, finding new treatment strategies, developing drugs faster and much more.

If you would like to learn more about semantics, we suggest you visit www.ontotext.com where's there's a wealth of information, demos, customer stories and news about this important subject.

This post is a bit technical but I would encourage all readers to look this over, especially the conclusions section. The key takeaway has to do with the ability for a graph database (GraphDB 6.1 in this case) to perform updates (inserts to the database) at the same time queries are being run against the database. Results here are impressive.

The Applied Research and Communications Fund together with Enterprise Europe Network – Bulgaria and KIC InnoEnergy awarded the ‘Innovative Enterprise of the Year 2014’ to Ontotext. The contest is supported by the Bulgarian Ministry of Economy and Energy…

Tony Agresta's insight:

The growth in unstructured data and the need to discover contextual insights in your data are fueling the growth in natural language processing, text mining, graph databases and discovery interfaces. The vertical application of this technology is widespread. It can include patient data, lab results, insurance claims data, clinical trials and research - all of which can be analyzed and accessible in one solution designed to improve patient outcomes, expedite claims processing or quickly find current, relevant research in support of new drug development.

The media and publishing world applies semantic technology in a different way. Entity extraction is still used to identify and disambiguate specific people, places, events and other attributes from within free flowing text. But this is often combined with a digital footprint of visitor behavior and past searches to deliver highly targeted, relevant articles and facts all of which are stored within a centralized knowledge base.

Other core use cases include curating new content, automated tagging, enrichment using Linked Open Data and enhanced authoring tools designed to prompt authors with relevant content they can use to add color to their current articles.

There is no limit to the application of semantic technology including manufacturing (fast access to manuals and plans), customer service (analysis of customer call notes), financial services (targeted know-your-customer and compliance-based search) or semantic ad targeting (analyzing on line news followed by targeted ads that pinpoint places to visit, hotels, restaurants).

Ontotext has been doing this longer than anyone - 15 years and built a complete portfolio of semantic tools to analyze text, extract and classify entities, enrich the data, resolve identities, optimize the storage of tens of billions of facts and make ALL of your data discoverable. For these reasons, Ontotext has been recognized as the Innovative Enterprise of the Year for 2014.

To learn more about semantic technology and try it for free, visit www.ontotext.com

Strategic hires for Ontotext USA indicates Ontotext's expansion in the North American marketplace.

Tony Agresta's insight:

Ontotext has long had a presence in North America but recently expanded operations for a number of reasons: support for the growing install base in this region, expanding into key US markets and building out alliances. Success in EMEA and wide adoption of Ontotext has driven this growth. Recently, Ontotext released 6.0 of its native RDF triplestore, GraphDB. GraphDB is widely regarded as the most powerful RDF triplestore in the industry and has support for inferencing, optimized support for data integration through owl:sameAs, enterprise replication cluster, connectors to Lucene, SoLR & Elasticsearch, query optimization, SPARQL 1.1 support, RDF rank to order query results by relevance or other measures, simultaneous high performance loading, queries and inference and much more.

Organizations have gravitated toward Ontotext more so than other NoSQL vendors and pure triplestore players because of the broad portfolio of semantic technology Ontotext provides that goes beyond GraphDB. This includes Natural Language Processing, Semantic Enrichment, Semantic Data Integration, Curation and Authoring tools. Experience Ontotext has working with Linked Open Data sets extends back to the beginning of the LOD movement. When these tools and technologies are blended with GraphDB, they offer a powerful combination of semantic technologies that deliver a solution using a single vendor while lowering maintenance costs, shortening time to delivery and delivering proven deployment options.

Graph databases, also known as RDF Triplestores, have unique benefits over other databases allowing users to store linked data, query the graph and a NoSQL database simultaneously and infer new meaning using reasoning engines thereby crating facts that can be used to answer questions very quickly and enhance the search and discovery user experience.

The underlying technology (RDF, Ontologies and SPARQL as the query language) are often not well understood by everyone. Here are a set of classes that users interested in this topic can take. The link takes you to a description of the classes.

In 1960-s, statisticians have used terms like «Data Fishing» or «Data Dredging» to refer to what they considered a bad practice of analyzing data without a prior hypothesis. The term «Data Mining» appeared around 1990′s in the database community. I coined the term «Knowledge Discovery in Databases» (KDD) for the first workshop on the same topic (1989) and this term became popular in academic and research community. KDD conference, now in its 21 year, is the top research conference in the field and there are also KDD conferences in Europe and Asia.

However, the term «data mining» is easier to understand it became more popular in the business community and the press.

Sharing your scoops to your social media accounts is a must to distribute your curated content. Not only will it drive traffic and leads through your content, but it will help show your expertise with your followers.

Integrating your curated content to your website or blog will allow you to increase your website visitors’ engagement, boost SEO and acquire new visitors. By redirecting your social media traffic to your website, Scoop.it will also help you generate more qualified traffic and leads from your curation work.

Distributing your curated content through a newsletter is a great way to nurture and engage your email subscribers will developing your traffic and visibility.
Creating engaging newsletters with your curated content is really easy.