Kurt Cagle

Kurt Cagle, managing editor/Drupal lead/contributor, has worked within the data science sphere for the last fifteen years, with experience in data systems architecture, data modeling, governance, ETL, semantic web technologies, data analytics, data lifecycle and large scale data integration projects. Kurt has also worked as a technology evangelist and writer, with nearly twenty books on related web and semantic technologies and hundreds of articles.

Why 'Ontology' Will Be A Big Word In Your Company's Future

Your business has its own language. If you sell cars, then you need not only make, model and year, but also MSRP, leather bucket seats and dealer incentives. If you are a dentist, knowing about bicuspids, prostheses and various forms of anesthesia is a must. Media companies have producers and writers, actors and grips, distribution networks and video masters. Business language is code.

This language is not only critical to being able to communicate with others in your organization as well as with your customers, but it also influences how the programmers and data scientists identify those things that their data systems track, accept and analyze. How things relate to one another makes its way into programs, into data designs, influences IT buys, and ultimately dictates whether the data that you collect is actually of value to your organization or simply a waste of a database farm.

From Business Language to Machine Language

While there are a number of different ways you can describe this language, one of the more useful is called an ontology, which quite literally means the study of the names of things. The idea behind ontologies is relatively simple, though it has some profound implications. In effect, imagine all the resources that your company sells, buys or uses to process stuff as classes of resources. In your car dealerships, automobiles, dealerships, sales-people, contracts and so forth would be resources, while in a publishing company, books, authors, editors, publishers, printing machines, printers and so forth are also considered resources.

Resources in turn either have associated sets of values (such as an author's name, or a car's mileage) or have some relationship to another thing or categories. Thus,

individual car will be sold from a given dealership (a relationship between different resource entities) or may be either gas-powered, diesel or electric (a category). Moreover, each resource has both a unique identifier as well as potentially multiple externally issued ones (such as a car's VIN number). The categories can be thought of as adjectives, while the relationships can be thought of as prepositional phrases (such as “has VIN Number 31EAF54915” or “carried by dealership X”)

The concepts, relationships/properties and individual data, and are collectively referred to as an ontology. In effect, the ontology describes the things, the sets of descriptions and the relationships, while the data is then given as a particular data cell. A good analogy is a spreadsheet, where each sheet represents a class of resources, each row a given thing in that class, and each column a property for that class, with the cell being either a value or a link to another resource in a different sheet. The set of rules that describes what properties are in each sheet and how the sheets are related is the ontology, and the set of all cells makes up the data (or triples) is a dataset or triple store in that ontology.

Collectively, the rules, structures and query languages for getting data from this dataset is known as semantics, with the whole of the dataset known as a knowledge base. In other words, the goal of semantics is to make your business language machine-readable.

Semantics: Better Search, Better Relevance, Better SEO

When you read a newspaper article or a web page, a magical transformation takes place where you parse through words and layout, building up new ideas from each sentence, mentally building a summary that lets you abstract why the article is important (or dismiss it if it isn't), identifying key people, places and things, and ultimately gaining new insight. Despite a lot of claims made about artificial intelligence, the vast majority of computers are only just beginning to develop the barest rudiments to allow them to do the same thing.

As the web first emerged in the 1990s, one of the things it most needed was an index, where you could look up a keyword, then trace that keyword back to the document(s) that had that keyword. At first, this process was done manually. Culminating in directories, but as the number of documents climbed beyond the first ten thousand or so, it became increasingly evident that the only way to keep up was to automate this process.

The first search engines were intended to do just this, first by reading all the terms in a document then stripping out those that were unnecessary (so-called stop words), then by creating a link between document and word set. The first to do this was AltaVista, but the first to be successful as a business was Google. These search engines would then use certain sets of criteria to determine when a set of words was presented as part of a search, what document best fit those terms (and by extension, appeared at the top of the list).

It turned out, from a business sense, that being at the top of that list was far better than being on page 2, which was light years better than being on page 3. Nobody wanted to be on page 3. This launched a grand melee and companies appeared overnight that would help companies optimize their search ranking, what soon became known as Search Engine Optimization or SEO. Pretty soon, this became an evolutionary war, with the search engines working to keep their rankings as fair (or at least as optimized for them) as possible, while SEO involved trying to trick the search engines into better pushing a given listing upward by a few slots.

One temporary truce to this effort came when Google, Yahoo, Microsoft and other search engine providers established a set of HTML tags that would best represent keywords that most closely matched the topics of the associated article, a process called tagging. These typically involved terms or concepts that the company was most interested in capturing, which in turn were part of the business language for that company. This metadata had higher priority than scanned content, and it, in turn, made it easier to put articles in the right set of buckets.

Towards a Smart Data Ecosystem

Semantics takes this up to eleven. By marking an article up semantically, you can identify all kinds of interesting things - what things the article (or catalog page) discusses, how much they cost at a certain time, where they are located, what their intended target audience is, what features those things have, what events occurred where these things were significant and even an idea about how reliable the page itself is. You can tie in resources with hash-tag keywords used in viral campaigns, you can embed tracking codes to show how popular an item is - in essence, you can create and shape your own tracking categories. This is having a profound influence on SEO, and is influencing how both web apps and mobile apps get written to provide Big Data measures.

Semantics also makes building such web applications easier and more cost-effective. Comprehensive use of ontologies means that you can make full cycle pipelines from data acquisition to user interfaces to data analysis to dashboards. Google, Bing and other search engines are now accepting semantic data as part of their search strategy, and increasingly social media sites such as Facebook, Instagram, Pinterest and others, are exposing semantically oriented endpoints (programmable web functions) that can tie semantic identifiers through social media, and from there into big data ingesters.

Our company, Semantic SEO Solutions, have been tracking the rise of semantics and ontologies for the last decade, focusing on building the best solutions for not only increasing the visibility of your sites, but also for making your sites more intelligent, easier to build, and far more flexible. Our latest offerings use semantic and machine learning solutions to help your business, to create smarter catalogs, more dynamic interfaces and more relevant content.

Who's doing this? Seventy-five percent of the Fortune 500 companies have some kind of smart data or semantics program underway, most under the banner of 360° initiatives, comprehensive enterprise data systems, or machine learning/data science projects. Amazon has recently added linked data capabilities to their AWS infrastructure with the Neptune project, and social media giants have built their entire data infrastructure around smart ontological data. Moreover, China, Japan, England, the OECD, and the United States have all moved critical data resources into semantic form, and semantics has become one of the hottest areas for investment banks such as Wells Fargo, Morgan Stanley, Citigroup, Goldman Sachs and others. It even ties into such cutting-edge technologies as Blockchain and the Internet of Things.

Ontologies - it's a big word, but it will have an even bigger impact upon your business, today and tomorrow.

Kurt Cagle is a writer, data scientist and futurist focused on the intersection of computer technologies and society. He is the founder of Semantical, LLC, a smart data company.

Rethinking Millennials and Generations Beyond

In 1991, authors and sociologists William Strauss and Neil Howe published Generations (updated in 1997 with The Fourth Turning) where they argued the idea of Generational Theory — the notion that there were distinct cohorts throughout history that shared characteristics and values. These cohorts, going through different phases of their life, determined turnings that identified pivotal periods in history, with cohorts having a cycle of 18-20 years, and turnings taking place over an 80-year cycle.

One of the central tenets of Generational Theory is actually a sound one — that there is a correlation between birth rate (the number of native births per family) and the economy. Up until the twentieth century in the US, this correlation was driven by agrarian concerns: in good times, you needed kids working in the fields to increase the family's fortunes, so families tended to be large. In bad times, extra mouths became a liability. However, by the 1930s, this pattern changed due to the rise in urban populations, where the cost of raising children was significantly higher but infant mortality declined dramatically. This shift, even with the Depression and World War II, would make the Baby Boom generation the largest to date in history in absolute terms, and remarkable even in relative terms.

About 1954, the Baby Boom peaked in terms of birth rate, and it then began a slide that would take it back to a level comparable to that of the early Depression era. Several factors were involved with that, with the introduction of the birth control pill in 1960 competing with urbanization reducing the upper number of children a family had. Note that the birth rates were still positive — the population itself has been continuously growing since the 1930s with death rates remaining mostly stable, so birth rates act a lot like compound interest. However, in 1971, the native birth rate fell below the 2.1 children per family that demographers consider replacement rates, the rate at which a population starts to shrink. This means that most of the growth in the US population since 1971 has been due to immigration, and that without that immigration, the US population would be declining now, according to data gleaned from Google Public Data.

Economic growth is a direct function of population growth. Total economic activity is due primarily to the number of financial transactions that take place (not necessarily the magnitude of those transactions), which increases with more participants up to a limit imposed by the velocity of money in the system. As population growth slows (which happens when the birth rate is below 2.1), the number of transactions also slow, and money has to move faster through the system in order to compensate (e.g., wages need to increase).

Strauss & Howe, GatePost and Inflection Point GenerationsKurt Cagle

Changing Minds without Twisting Arms

When Strauss & Howe laid out their theory, they went on the assumption that generations were symmetrical, so that if you measured a generation from midpoint to midpoint, you would get a representative sample of people with related interests (which was also indicative of the pattern they had seen in earlier population data, which was distinctly sine-wave in shape). However, because of that, the endpoints between generations (the red vertical lines in the above diagram) don't seem to line up with anything significant. Shift it by a quarter cycle (9 years), and something interesting emerges. The green lines are the shifted (GatePost) "generations" while the blue lines indicate inflection points in the population, where macro-behavior changes significantly. The green and blue lines overlap enough to suggest that there are definite regimes of behavior.

I refer to this shift as inflection point (IP) generations, and rather than trying to view generations as "half-cycles", I argue here that these inflection points make a much more natural generational division that what Strauss and Howe established originally (and in the process hope to cut down on a lot of the more tenuous "turnings" associations that has long been a valid criticism of the generational approach.

So, to summarize, an Inflection Point Generation occurs when there is a clear behavioral trend change, typically a peak or trough point, in the birth rate. These correspond (very roughly) with the mid-points of the S&H generations, so one effect of this is to, on average, shift the age upwards of each generation by approximately nine years. The Boomers, as an example, go from 1946-1964 to 1936-1957. While this serves to place the goalposts at the beginning and end, rather than at the midpoint, it also makes these dates consistent with broader economic, rather than just social trends.

The rationale behind this is that there is a strong correlation between birth rate and economic activity. When economic activity is rising, people begin to feel more optimistic, and tend to have larger families. When economic activity is falling, people begin to feel less optimistic, and tend to have smaller families. Note that this is trend that really only applies to industrial and post-industrial societies, and the one notable trend that kicked off the boomers was a transition in the US from a primarily agrarian to a primarily industrial economy during the 1930s.

In many respects, the idea of generations provides a useful way for assessing economic trends over the next twenty years. 2018 is actually a good point for examining these, as it (nominally) marks the end of one generation (GenZ) and the start of the next (GenAA?). Each group consequently has the following impacts (table scrolls left & right):

Generation

Interval
(S&H) / Average
Age

Interval
(IP) / Average Age

Population Percentage
2017 IP

Economic Impacts 2018

The Greatest Generation

1907-1925 /
100

1898-1916
109

< 0.1%

Minimal current impact upon the economy, save in health care costs

Silent Generation

1926-1944 /
82

1917-1935 /
92

3%

Healthcare costs predominate. Like GenXers, a small generation. Mostly fixed income retirees. Inflection point (IP) adjusted, this group is even smaller, and made up bulk of World War II generation.

The Baby Boomers

1945-1963 /
64

1936-1956
72

25%

IP adjusted, the Boomers were witness to the biggest growth period in US history. The Boomers are mostly well into retirement age at this point. This means that they are now drawing on fixed income portfolios or social security, are downsizing (and mostly have done so) and are reducing their spending.

GenX

1964-1982 /
45

1957-1975 /
53

21%

IP Adjusted, GenXers were the disco generation. The economy ran in reverse for the GenXers, and, in general, they were overshadowed by the Boomers over most of their lives. GenXers largely ended up going into technical fields, and were more introverted and pragmatic than their parents. They more or less created the PC, networking and the infrastructure of the Internet. Because they are a smaller generation, GenXers overall will have a smaller presence in the economy, leading to weakening growth.

Millennials
(GenY)

1983-2000 /
26

1976-1990 /
35

22%

IP Adjusted, the Millennials created the World Wide Web and the Mobile Web on top of the Internet. Their generation is actually just a bit larger than GenX, and came of age during a period of mild economic growth.

GenZ

2000-2018 /
9

1991-2008 /
18

26%

When people talk of Millennials as kids, they are probably actually thinking of IP adjusted GenZ. This was a group that grew up with the Internet, and for the most part they are most adept at using it as social media and are easily the most media savvy. Culturally Millennials and GenZ are fairly distinct. It is also a group that is as large as the Boomers, and likely will have a dominant role in culture moving forward.

GenAA

2018- /
0

2009- /
9

13%

In 2008, the housing crisis occurred, and the birth rate began to fall significantly for the first time since the GenXers. GenAA is still something of a guess, but given the persistence of the trend it is likely that we are seeing a new generation here. This birthrate is still falling, and should it continue this trend, it will hit "bottom" around 1.5 births per family around 2025. This is already affecting elementary and secondary school districts in a way that should be familiar to any student of the 1960s, where expansive school districts consolidated. GenAA will likely end up having a significant impact upon the economy around 2045 or so.

In 2008, the housing crisis occurred, and the birth rate began to fall significantly for the first time since the GenXers. GenAA is still something of a guess, but given the persistence of the trend it is likely that we are seeing a new generation here. This birthrate is still falling, and should it continue this trend, it will hit "bottom" around 1.5 births per family around 2025. This is already affecting elementary and secondary school districts in a way that should be familiar to any student of the 1960s, where expansive school districts consolidated. GenAA will likely end up having a significant impact upon the economy around 2045 or so.

This table actually helps explain a number of anomalies in Strauss & Howe's generations. For instance, music culture experienced its greatest renaissance when young singers (mostly raised initially in rural settings) became heavily known in the late 1950s and through the 60s, but those singers were in general in their twenties, not their teens, when they began performing and making it big. Buddy Holly was born in 1936, Presley in 1935, and they represent the start of the "rock and roll" era. This puts them at the start of the IP Boomer generation. This generation grew up with television, but it didn't enter the average home until the early 1950s — when an S&H boomer would have been five, but an IP Boomer would have been in her mid teens (which squares with the historical record).

From my own perspective, as someone born in 1963, I do not have values in accordance with Boomers, but am squarely in the IP GenX camp - and most of my acquaintances that are within a few years of this age clearly have different values than the Boomers. The Baby Boomers in general are much more inclined to be fairly religious, something very much in keeping with having grown up in a rural or semi-rural environment, while GenXers are much less religious, and something that would be expected from growing up in an urban or suburban environment.

On a similar thread, one thing that has long bothered me about S&H is that the Millennial generation seemed both extraordinary long and remarkably uniform, like there was a Millennium I and Millennium II generation. However, by looking at inflection points, it became clear that there was a distinct generation that spanned the period from 1976 to 1990. The oldest of this generation would have entered the workforce in the early 1990s and would have built much of the World Wide Web, while the youngest would have been involved in building mobile phones, probably as junior programmers to their GenX managers.

By the time GenZ came along, they would have been raised as children on computer CDs, would have learned to navigate social media from AOL and later Facebook, Twitter and Instagram, and would have learned social interactions with cell phones mediating their conversation. When was the last time you saw a GenZ kid actually knock on a door, rather than text their friend inside? This is a generation that has taken to streaming like ducks to water.

This shift upwards in age of the various generations has a number of interesting implications. First, it provides a hard definition for a generation, one based not on current events but on clear demographic markers. It gets rid of pesky labels like Boomers or Millennials, making it harder to ascribe behaviors that are broad age-centric generalizations at best. It also relies upon a fairly stark designator: "Are you comfortable enough about your future that you are willing or even able to raise a large family?"

By going to an inflection point measure, it also makes it easier for marketing campaigns that specifically target based upon demographics what is likely a more natural benchmark. If the average IP Boomer is seventy-two, not sixty-four, they will have less interest in retirement planning and more interest in vacations. Your typical IP Millennial is no longer living at their parent's house and is likely entering mid-level management now. Your IP GenZ daughter is motivated, politically aware and may be able to vote. Your IP GenAA granddaughter is more likely into the latest iteration of My Little Ponies and Marvel movies than she is in teething rings and blankies, but still loves her GenX grandpa and grandma.

The Semantic Web Comes of Age

How do you describe a business? What about a person, or an intellectual work? There's an interesting little secret that people in IT likely know, but that doesn't always get to the C-Suite. Programming, at its core, is all about creating models. Sometimes those models are of classes of things, sometimes they better describe processes, but it is rare for a piece of software in your organization to not have some relevance to perhaps a few dozen critical types of things.

In large enterprises, it's not at all uncommon for that organization to go through a form of fire drill known as "creating the enterprise data model" (in TLA-speak, "EDM"). This particular ritual is initiated by business analysts who talk in hushed tones about data dictionaries, cardinality rules, associations and constraints. There are almost always drawings drawn, typically with reference to entity relationship diagrams, with lots of boxes and ovals and arrows, all neatly tied up in hushed debates about whether UML 1 or UML 2 rules apply and whether JSON or XML schema is the better denormalized form for handling streaming. Blood has been known to be drawn in these encounters. The end result of this, almost invariably, is a big, complex document called a schema, which is then placed in a folder on Sharepoint while the programmers merrily ignore everything in it, until they get upset when their applications don't work with the ones across the hallway and realize that they needed to figure out inter-operability.

The effort of creating such schemas can often be time consuming, and in the process opens up the potential for different groups within an organization to seek solutions that are optimized for their requirements, even if they are inconvenient for others. For people who work with such data dictionaries - business analysts, taxonomists and ontologists - this struggle was an inevitable part of defining an organization's data language, but it didn't mean that it was an enjoyable part.

Today, there are nearly six hundred distinct types, in areas as diverse as

These can get into surprising detail - the Organization set itself includes sixty-one distinct properties (some strings of text or numbers, some other object types), and organizations in turn are referenced by dozens of other resource types.

Beyond this core, the automotive industry, health and life sciences, bibliographic information (where most of Dublin Core now resides) and the Internet of Things all have their own industry extension vocabularies, while insurance, financial services and aerospace players are currently examining adding to such an effort.

This may sound somewhat geeky, yet another fascinating arena of technology that nonetheless may seem to have little value to businesses, save for one huge aspect - online search. Beginning in 2017, both Google and Bing (Microsoft's search engine) announced that they would be supporting the use of embedded smart snippets in web content. A smart snippet is a bit of JSON (and common web standard for data interchange) that uses schema.org tags to identify what a web page contains. Google (and likely all other major search engines) would read the snippet and create a much more comprehensive record about that page than is done now for SEO searching. Smart snippets would have greater weight in search algorithms, and because such snippets could in fact be fairly complex, it would be possible to describe individual resources within these snippets in machine readable ways.

So, suppose that you have a catalog of products - say books. Normally a search engine scanning a web page for a particular book will attempt to use heuristic algorithms to attempt to get an idea about what the page is about. However, unless you have an army of SEO experts, chances are pretty good that the heuristic is pretty basic, and will typically rely upon keyword positioning presence and positioning that can be done as quickly as possible (CPU cycles cost money), and most of this is focused on the web page, not its content.

When a smart snippet is encountered, on the other hand, not only can the search engine get a much better idea about what the web page is about, but it is now able (if the snippets are set up properly) to actually describe things themselves. When Google reads that snippet, it actually creates a record about a particular book as a book, not simply as content on a web page. If you are looking to find books that are in the urban fantasy genre, feature a female doctor protagonist, is typically a two hour read, and is under $3 in price, Google will be able to bring up this particular book. Not only that, but because the book has a unique identifier, different reviewers can weigh in (from potentially multiple platforms) and these annotation reviews can then be linked to this identifier. Other applications can also read this same page and get this same information, and from it add other links that can be picked up by Google or other web applications.

This is called a knowledge graph, and it fulfills one of the basic visions of the Semantic Web when Web creator Tim Berners-Lee first described it publicly in a 2004 Scientific American article. This vision relies upon a common language for describing essential things ... the same language that is now emerging from schema.org. Smart snippets uses JSON-LD as a carrier, but it uses schema.org terms and relationships to describe not only things but how they relate to one another. The LD here stands for Linked Data, but can be thought of as ways of ascribing contextual information to resources that can be described in the virtual world.

In a business context, this same principle can be applied to both create and manage corporate knowledge bases. If you are a manufacturer, your catalog of goods also becomes a database. A potential customer (either consumer or business) could read the data from the catalog page for that book directly, rather than having to go through the complex process of setting up a data feed or a repository to extract content, can determine the retail price, the wholesale price, and relevant shipping information from that json-ld, and use that to place fifty orders for that book through other channels. A blogging or news site can generate JSON-LD smart snippets that describe specific events, and if those events happen to include a reference to your company or CEO, that relationship can be extracted (along with all other relevant information) and made available to your sales people to capitalize on those events or your data analysts to help factor the impact of this news to your company. In effect it makes it far easier for your own organization to create a dedicated mini-Google for retrieving relevant news, as well as making it far easier for other applications and search engines to identify metadata from your own press channels.

Similarly, by embedding Smart Snippets into your publishing or digital asset management systems, this provides a way to future proof classification - even if you don't currently have a way of doing anything with smart snippets, the ability to add such (either through manual production or entity extraction programs) insures that you can categorize your content so that it is not only consistent with internal standards, but also globally accessible when this media is published. Indeed, because schema.org also includes a number of rights management features and terminology, this metadata can be used to better insure that content goes only to the intended markets, and inappropriate or unavailable content is not inadvertently distributed, preventing liability headaches.

Communication can only take place when a common language exists. Schema.org very well has the potential to be that common language, and as such schema.org, linked data and json-ld should absolutely be on the watchlist for your digital transformation strategies.