Archive for the 'Database' Category

UMBC Data Science Graduate Programs

UMBC’s Data Science Master’s program prepares students from a wide range of disciplinary backgrounds for careers in data science. In the core courses, students will gain a thorough understanding of data science through classes that highlight machine learning, data analysis, data management, ethical and legal considerations, and more.

Students will develop an in-depth understanding of the basic computing principles behind data science, to include, but not limited to, data ingestion, curation and cleaning and the 4Vs of data science: Volume, Variety, Velocity, Veracity, as well as the implicit 5th V — Value. Through applying principles of data science to the analysis of problems within specific domains expressed through the program pathways, students will gain practical, real world industry relevant experience.

The MPS in Data Science is an industry-recognized credential and the program prepares students with the technical and management skills that they need to succeed in the workplace.

Scientists and casual users need better ways to query RDF databases or Linked Open Data. Using the SPARQL query language requires not only mastering its syntax and semantics but also understanding the RDF data model, the ontology used, and URIs for entities of interest. Natural language query systems are a powerful approach, but current techniques are brittle in addressing the ambiguity and complexity of natural language and require expensive labor to supply the extensive domain knowledge they need. We introduce a compromise in which users give a graphical “skeleton” for a query and annotates it with freely chosen words, phrases and entity names. We describe a framework for interpreting these “schema-agnostic queries” over open domain RDF data that automatically translates them to SPARQL queries. The framework uses semantic textual similarity to find mapping candidates and uses statistical approaches to learn domain knowledge for disambiguation, thus avoiding expensive human efforts required by natural language interface systems. We demonstrate the feasibility of the approach with an implementation that performs well in an evaluation on DBpedia data.

Wild Big Data is data that is hard to extract, understand, and use due to its heterogeneous nature and volume. It typically comes without a schema, is obtained from multiple sources and provides a challenge for information extraction and integration. We describe a way to subduing Wild Big Data that uses techniques and resources that are popular for processing natural language text. The approach is applicable to data that is presented as a graph of objects and relations between them and to tabular data that can be transformed into such a graph. We start by applying topic models to contextualize the data and then use the results to identify the potential types of the graph’s nodes by mapping them to known types found in large open ontologies such as Freebase, and DBpedia. The results allow us to assemble coarse clusters of objects that can then be used to interpret the link and perform entity disambiguation and record linking.

“President Obama signed an Executive Order directing historic steps to make government-held data more accessible to the public and to entrepreneurs and others as fuel for innovation and economic growth. Under the terms of the Executive Order and a new Open Data Policy released today by the Office of Science and Technology Policy and the Office of Management and Budget, all newly generated government data will be required to be made available in open, machine-readable formats, greatly enhancing their accessibility and usefulness, while ensuring privacy and security.”

Stanford is experimenting with an interesting idea — offering some of their most popular undergraduate computer science courses online for free and simultaneously with their regular offerings. An AI course was announced several weeks ago and now there are similar offerings for databases and machine learning. These are taught by first rate instructors (who are also top researchers!) and are the same courses that Stanford students take.

“A bold experiment in distributed education, “Introduction to Artificial Intelligence” will be offered free and online to students worldwide during the fall of 2011. The course will include feedback on progress and a statement of accomplishment. Taught by Sebastian Thrun and Peter Norvig, the curriculum draws from that used in Stanford’s introductory Artificial Intelligence course. The instructors will offer similar materials, assignments, and exams.”

“A bold experiment in distributed education, “Introduction to Databases” will be offered free and online to students worldwide during the fall of 2011. Students will have access to lecture videos, receive regular feedback on progress, and receive answers to questions. When you successfully complete this class, you will also receive a statement of accomplishment. Taught by Professor Jennifer Widom, the curriculum draws from Stanford’s popular Introduction to Databases course.”

“A bold experiment in distributed education, “Machine Learning” will be offered free and online to students worldwide during the fall of 2011. Students will have access to lecture videos, lecture notes, receive regular feedback on progress, and receive answers to questions. When you successfully complete the class, you will also receive a statement of accomplishment. Taught by Professor Andrew Ng, the curriculum draws from Stanford’s popular Machine Learning course.”

If successful, this might be a game changer. Two weeks after the online AI course was announced, 56,000 students had signed up! The approach might work for many disciplines, not just CS. The Kahn Academy is a related effort.

Universities should keep an eye on them and think about how to adapt if they are successful. Most of our students will probably benefit from taking our traditional courses. If so, we should be able to explain the benefits from taking them (and make sure we deliver those benefits). At the same time, we may want to leverage the online material from these courses in a synergistic way.

Google announced today that it has acquired Metaweb, the company behind Freebase — a free, semantic database of “over 12 million people, places, and things in the world.” This is from their announcement on the Official Google blog:

“Over time we’ve improved search by deepening our understanding of queries and web pages. The web isn’t merely words — it’s information about things in the real world, and understanding the relationships between real-world entities can help us deliver relevant information more quickly. … With efforts like rich snippets and the search answers feature, we’re just beginning to apply our understanding of the web to make search better. Type [barack obama birthday] in the search box and see the answer right at the top of the page. Or search for [events in San Jose] and see a list of specific events and dates. We can offer this kind of experience because we understand facts about real people and real events out in the world. But what about [colleges on the west coast with tuition under $30,000] or [actors over 40 who have won at least one oscar]? These are hard questions, and we’ve acquired Metaweb because we believe working together we’ll be able to provide better answers.”

In their announcement, Google promises to continue to maintain Freebase “as a free and open database for the world” and invites other web companies use and contribute to it.

Freebase is a system very much in the linked open data spirit, even thought RDF is not its native representation. It’s content is available as RDF and there are many links that bind it to the LOD cloud. Moreover, Freebase has a very good wiki-like interface allowing people to upload, extend and edit both its schema and data.

Here’s a video on the concepts behind Metaweb which are, of course, also those underlying the Semantic Web. What the difference — I’d say a combination of representational details and centralized (Metaweb) vs. distributed (Semantic Web).

ComputerWorld has an article on the “nosql” movement and a recent nosql meetup held in San Francisco, No to SQL? Anti-database movement gains steam. Nosql systems are distributed, non-relational data stores that typically use a simple key-value approach to indexing and retrieving data and use a simple procedural query API rather than a sophisticated declarative query language.

“The inaugural get-together of the burgeoning NoSQL community crammed 150 attendees into a meeting room at CBS Interactive. Like the Patriots, who rebelled against Britain’s heavy taxes, NoSQLers came to share how they had overthrown the tyranny of slow, expensive relational databases in favor of more efficient and cheaper ways of managing data.

“Relational databases give you too much. They force you to twist your object data to fit a RDBMS [relational database management system],” said Jon Travis, principal engineer at Java toolmaker SpringSource, one of the 10 presenters at the NoSQL confab (PDF). NoSQL-based alternatives “just give you what you need,” Travis said.”

There were presentation on nine different ‘nosql’ databases: Voldemort, Cassandra, Dynomite, HBase, Hypertable, CouchDB, VPork, MongoDb as well as general presentations by Google’s Jonas Karlsson, and Cloudera’s Todd Lipcon.

“The relatively young but rapidly growing “nosql” community met last Thursday in San Francisco. The idea was to give attendees a solid introduction to how distributed, non relational databases work as well as an overview of the various projects out there.”

I learned of this meeting on Hacker News, where you can find some interesting comments.

Of course their are many popular key-value stores that are not designed to support the highly-scalable distributed needs of many Web applications. I found, for example, that as a persistent RDF store for rdflib, Sleepycat out performed MySQL.

Price Waterhouse Coopers is one of the largest “professional services” organization and has always been strong on technology consulting and advice. The Spring issue of their quarterly Technology Forecast journal focuses on the Semantic Web. This is from the table of contents

46 Semantic technologies at the ecosystem level. Frank Chum of Chevron talks about the need for shared ontologies in the oil and gas industry.

You can download the free 58 report here. You can also read a note on the issue in ReadWriteWeb, which focuses on linked data and interoperability.

“A new PricewaterhouseCoopersTechnology report explains how the Semantic Web and Linked Data can help enterprises manage their large scale data better. The PwC Center for Technology and Innovation team spent several months researching and analyzing the problem of data silos in enterprises – and what solutions are being developed to help with that problem. The answer, according to PwC, is Semantic Web techniques. PwC believes that the Semantic Web offers a practical way to address the problem of large-scale data integration. … “

The group defines it’s geographic location as Columbia MD and their first HUG meetup was held last Wednesday at the BWI Hampton Inn. In addition to informal social interactions, it featured two presentations:

Amir Youssefi from Yahoo! presented an overview of Hadoop. Amir is a member of the Cloud Computing and Data Infrastructure group at Yahoo!, and will be discussing Multi-Dataset Processing (Joins) using Hadoop and Hadoop Table.

If you’re in Maryland and interested you can join the group at meetup.com and get announcements for future meetings. It might provide a good way to learn more about new software to exploit computing clusters and cloud computing.

Databases are a fundamental technology for most information systems and especially those based on the web. A group of senior database researchers met recently to assess the state of database research, as documented in site. So, where did the Semantic Web fit into their vision?

“In late May, 2008, a group of database researchers, architects, users and pundits met at the Claremont Resort in Berkeley, California to discuss the state of the research field and its impacts on practice. This was the seventh meeting of this sort in twenty years, and was distinguished by a broad consensus that we are at a turning point in the history of the field, due both to an explosion of data and usage scenarios, and to major shifts in computing hardware and platforms. Given these forces, we are at a time of opportunity for research impact, with an unusually large potential for influential results across computing, the sciences and society. This report details that discussion, and highlights the group’s consensus view of new focus areas, including new database engine architectures, declarative programming languages, the interplay of structured and unstructured data, cloud data services, and mobile and virtual worlds.”

It’s a good report with lots of interesting things in it and definitely worth reading, but I was disappointed to find that it makes no mention of the Semantic Web, RDF, OWL, ontologies, AI, knowledge bases, or reasoning. Here’s a word cloud (generated with wordle) generated from the report, which provides a 10,000 foot view of it’s content.

The reports says that it was “surprisingly easy for the group to reach consensus on a set of research topics to highlight for investigation in coming years”. Those topics are:

Revisiting Database Engines

Declarative Programming for Emerging Platforms

The Interplay of Structured and Unstructured Data

Cloud Data Services

Mobile Applications and Virtual Worlds

There is clearly overlap between the database and semantic web communities in the first three topics.

“Hypertable is a high performance distributed data storage system designed to support applications requiring maximum performance, scalability, and reliability. Hypertable will be particularly invaluable to any organization that needs to manage rapidly evolving data to support demanding real-time applications. Modeled after Google’s well known Bigtable project, Hypertable is designed to manage the storage and processing of information on a large cluster of commodity servers, providing resilience to machine and component failures. Hypertable seeks to set the open source standard for highly available, petabyte scale, database systems. ” (link)

Like most research labs, we rely on MySQL whenever we need a database. And like most (I’m guessing, here), it’s common to overhear something like the following in our lab — “We really need to replace MySQL with Oracle or DB2 in X so it can handle the load.” But we never get around to it.

“In mid 2006, YouTube served approximately 100 million videos in a single day. To maintain a website of that scale, one would imagine YouTube has hundreds of DBAs. But in fact, there are just three people that make it all work. Paul Tuckfield, the MySQL DBA at YouTube shares horror stories about scalability at YouTube and how he coped with them to keep the show going everyday, while learning important lessons along the way. … According to him, the three important reasons for YouTube’s scalability are Python, Memcache and MySQL replication, the last having the most impact. Most people think that the answer to scalability is in upgrading hardware and CPU power. Adding CPUs doesn’t work on its own; wisdom is in getting the maximum amount of RAM for the CPU and then fine tuning.” (src)