What is a Database, really? Data Storage for Librarians

When a librarian talks about a “database” he or she usually means a website that will let you search for information not normally found in Google. A techy would probably correct on this and say “actually, Google is a database too.” And a real pedant would say something horrible like “actually, both of you are wrong. What you are talking about are interfaces that USE a database to store, modify and retrieve data.”

*** Yawn! ***

The problem is that we have jargon based on jargon that gets passed around until it means nothing to nobody. Here is my attempt to explain my pedant for you, and I will do it by answering the question: “What are my options for storing data?” Ooops, I’m hearing my inner pedant again. What are my options for storing electronic data?” would be more precise. Then I will explain why a relational database is one of the best options available for doing so.

So, how can you store data? Here are the options:

A Text File

Good old text is still not a bad way to store data. In fact, if you count scripting code like PHP, a heck of a lot of data is actually stored as plain text. It’s all pretty simple. You type some information in a text file and then save it on your harddrive.

If you want to retrieve that information, you can use a script to tell the computer to organize the information in a particular way. This is called parsing aka manipulating data according to specific rules.

There’s only one problem: you need ways to separate some kinds of information from others. Here are some possibles:

spaces — this is not bad, except you will not be able to distinguish phrases from plain words. (eg. “cracker jack in the box of tissues” parsed using only spaces would not help us make sense of the data).

commas — this is precisely what csv (comma separated values) does — separate each data cell with a comma.

indents/carriage returns — this is slightly better than the others, but still quite inadequate. How will distinguish a paragraph from a list of things, for instance?

Text is good for basic data storage, but it does not give us much flexibility in its use. For instance, what if we wanted to classify the data into groups like cats (siamese, manx, tabby)? In order to do this, we need a more complex structure.

Structured Text

As you climb the data food chain, complex systems get developed to organize information. Basically, this means you use symbols to help a computer understand how the data can be structured. There are many subsets to structured text.

Mark-up: Marked information is data that has marks or signals to let a computer distinguish one type of data from the other. Python and C computer languages use a # to indicate that the information in the line following is intended to be read by humans only. Javascript Object Notation (JSON) uses quotation marks to encase data, and squiggly brackets {} to show datasets that should be grouped.

A variable: A variable happens when you give some data a name. In PHP, $pie = 3.14; would assign the value 3.14 to the name “pie” which could be manipulated later on.

A string: A string is a kind of variable that has text. For instance in PHP you can say $myName = “Ryan”;.

An array: An array is a type of variable that includes a list used by computer programs for later manipulation. $myBooks = array(“the blue one”, “the red one”, “the yellow one”); would provide a numbered list that could be manipulated and organized.

An object: Explaining objects in full requires that someone read up on object-oriented programming. But one thing you can do with an object is store a broad taxonomy like animals->[cats->[siamese, manx, tabby], dogs->[collie, poodle->[toy, miniature, standard], doberman]].

Tree-based Structures

Tree-based structures, or cluster models are a subset of “structured text” data storage models. XML is probably the best example. In tree-based stuctures, data is organized in “parents”, “siblings” and “children”. Library-thesaurus addicts might understand these better as “broader, related and narrower” terms. Take the following code:

In this example, “librarybranch” is a child of “library” which is a child of “libraries.” Conversely, “libraries” is the parent of “library” and so on. If I wanted, I could add another child “library” within “libraries” and call it “New York Public Library” or “Ann Arbor Library” or whatnot and then add some respective branches.

The benefit to this kind of framework is that I use an XML parser to extract information such as “tell me all the branches of libraries that begin with the letter “A.” It is useful because the information is grouped into nice categories (like “Libraries” and “branches”).

The downside to all this is the sheer verbosity of it all. Say I add a parent to this tree file and call it “organizations.” Then I add a whole list of other organization types like “hospitals,” “businesses,” and “schools.” Depending on the scope of my data, this file could be huge in size and, even if you just want to find information about “libraries” you’d have to load up the entire file everytime you did a search (yes, programmers, I know there are ways around this, but they are generally quite complicated ways).

Also, tree-based structures do not always account for data anomalies. For instance, a library could be a branch department in a hospital, business or school. So, do you put the “library” underneath “school” or leave it on its own. Or maybe you do both? Ahh– but then you have to spend alot of time maintaining data in two, three or even more places. In database terms, this is called redundancy.

The Relational Model

The relational model is better than a tree-model when your dataset is large and complicated. The way it works is, instead of thinking in terms of “parents, siblings and children,” you think in term of relationships. This provides alot of flexibility for your database design, and helps to reduce the amount of “places” where you will have to keep information updated.

How does it work?

I won’t go too far into database design, but there are some basic principles. The first is called the “Primary Key.” This means that every data object (such as a library) has a way of identifying itself in a unique way (usually through a number). The second is called a “Foreign Key.” The Foreign key is a way to associate one dataset (eg. libraries) with another dataset (eg. library branches, hospitals or businesses). This association is called a relationship. Once each entity/data object has its own unique identifier, you can make just about any relationship you want, either by creating a foreign key within your table.

By using the unique identifier from table “LibraryTypes,” we are able to create a relationship between a library and a libraries purpose or type. Once we have this relationship established, any data we include in the “Library Type” table, will automatically be associated with the library. For instance, we could add a bit of data that defines what an “academic,” “public” or “special” library is, or define library associations that support/govern them. Then, everytime a library is called “public” all the information pertaining to the “public” librarytype also belongs to that library. This really helps reduce the amount of maintenance required for updating your database.

In more complicated relationships (called the Many-to-Many relationship), you might have to create a third table to associate two entities. I won’t go into this too far, but the principle is the same — you create relationships among entities using the unique identifiers among the tables.

Within the relational database model, there are at least two different kinds of databases:

TEXT/XML Based

Text files can be organized in such a way as to accommodate some of the benefits of the relational model. This is done by constructing keys within the XML files and associating two or more files together. Once very popular, this is not very common anymore. The advantage of this model is that text files are quite easy to share with others across a network. Text files are also readable by humans. The disadvantage is that, because the computer has to parse the text as well as compute the relationships among files, text models tend to be slower to operate, update and maintain.

The more common relational database type is a piece of software running on a server, rather than a set of text files. They are generally accessed using a standard language called SQL (Structure Query Language), or more specifically SQL as supported by a popular scripting language like PHP, Python or Java.

What you choose in terms of an SQL-type database will depend on your needs. Here are some examples of popular Open Source databases and why you would use them:

SQLite: SQLite is excellent for a database that is updated by only a few people. It is the simplest of all SQL database types, requires very little configuration and is stored in a single file, which makes it easy to backup, secure, share and manage. Also, it is a public domain database, meaning there are hardly any restrictions on how it can be used and/or modified. It’s major drawback is that because it comes in a single file only, it will not permit many updates from different people. In short, SQLite would be horrible for a large scale wiki project.

MySQL: MySQL is probably the most popular of open source databases. It takes a little more effort to configure on a server, but it also provides more power for people to manage and update information. MySQL handles multiple updates easily, so it is the database of choice for most CMS systems, including wikis. THe Koha ILS uses MySQL as well.

PostgreSQL: PostgreSQL is the most powerful of databases IMHO, providing just about everything MYSQL does, with some added back-end configuration options that make it better for maintaining very large projects. I highlight this one because it is the database behind the PINES Evergreen projects.

Final Thoughts + It’s Not All Mutually Exclusive!:

This intention of this article has been to identify the ways data can be stored and to explain a little about why the Relational Database Model is so important.

That said, I neglected one important thing. Almost any major web application will have a combination of all these types of data storage methods! For example, a website may store its data in a MySQL database, but offer part of that database in an XML file that can be accessed by other websites. This is called a “Web Service” or, more popularly, an API. Configuration information for servers and computer languages like PHP is almost always stored in a plain text file. In short, each data storage method has its own benefits and sacrifices — and that’s why it’s almost always better to diversify your data storage method.

Finally, for those who are interested in learning a little bit more about how a relational database works in the real work, I recommend the SQLZOO tutorials. They cover SQL in pretty straight-forward ways, covering a wide range of services from Oracle to SQLlite.

7 thoughts on “What is a Database, really? Data Storage for Librarians”

[…] 23 10 2008 Ryan Deschamps (”The Other Librarian”) posted [https://otherlibrarian.wordpress.com/2008/10/23/what-is-a-database-really-data-storage-for-librarians…] an interesting discussion on librarians and their understanding of the term […]

[…] very well the way that the term “database” is used in information services in his post What is a Database, really? Data Storage for Librarians: When a librarian talks about a “database” he or she usually means a website that will let you […]