Who Is John Smith: Adventures In Genealogical Data Modeling

I’m no physicist, but they say when you look for an atom’s component electrons, the electrons are both everywhere and nowhere at the same time. You can never actually see a single electron within the atom that houses the electron, yet everywhere you look, there is evidence of the electron in that exact point at that exact time.

In attempting to pinpoint where a person is within genealogical data, one runs into the same phenomenon. Consider the following. Here is some evidence that may be relevant to a fictional John Lester Smith It is not meant to be an authoritative list but merely illustrates the current issue:

Evidence

A birth certificate from Fayette County, Illinois, which states a male child named John Lester Smith was born at 401 South Third Street, Vandalia, Fayette County, Illinois, on 1 April, 1900 to parents John Smith and Darcy Smith.

An interview with George Rogers Smith in which George refers to his father’s brother, “Uncle John.”

Marriage certificate from Fayette County, Illinois, of John Smith and Darcy Madsen who were married 5 April, 1891 in Fayette County, Illinois.

An entry in the 1910 U.S. Federal Census from Enumeration District 12, Ward 3, Vandalia, Fayette County, Illinois, showing in dwelling 14 household 16 at 401 3rd Street family members John Smith, aged, 32, and wife Darcy Smith, aged 30, with son Johnny Smith, aged 10, all born in Illinois.

Now, suppose I “know” there was a person in my family named John Lester Smith. How? Grandma told me. I heard about him all my life. This was a real person, and I identify him in my genealogy database. Maybe I even talked to him during a séance. (If so, it would make for an interesting source citation.) The point is, I have made an analysis and judgment that there is a person named John Lester Smith. I further think that the pieces of evidence above refer to this person I “know” existed, John Lester Smith.

What does the data show, independent of any analysis, that might be relevant to John Lester Smith?

A male named John Lester Smith was born at 401 South Third Street, Vandalia, Fayette County, Illinois, on 1 April, 1900 to parents John Smith and Darcy Smith.

A male named John Smith distinct from John Lester Smith his son.

An Uncle John, brother of the father of George Rogers Smith.

A John Smith married Darcy Madsen on 5 April, 1891 in Fayette County, Illinois.

John Smith living in Vandalia in 1910 aged 32.

Johnny Smith, son of John and Darcy Smith, aged 10, living with his mother and father in Vandalia.

OK, great. Now I want to put this data my computer’s genealogy database.

Within my database, the identification of the person within a piece of evidence is done according to the name (or names) used in that evidence. Thus:

Person-name:

Person-name-number=1

Title=

Firstname=John

Middlename= Lester

Lastname=Smith

Referenced in a birth record

Person-name:

Person-name-number=2

Title=

Firstname=John

Middlename=

Lastname=Smith

Referenced in a birth record

Referenced in a marriage record

Referenced in a census record

Person-name:

Person-name-number=3

Title=Uncle

Firstname=John

Middlename=

Lastname=

Referenced in an interview

Person-name:

Person-name-number=4

Title=

Firstname=Johnny

Middlename=

Lastname=Smith

Referenced in a census record

But, where’s the part that identifies John Lester Smith, the data representation of the whole person? That piece isn’t here. To start describing this actual, real-life person, we need another object type entirely, an object which isn’t always available in genealogy software as a separate entity:

real-person:

PersonID=1

Whole-person-name=

Real-person-firstnameFirst=John

Real-person-middlename=Lester

Real-person-lastname=Smith

Ref=person-name:1:A

Ref=person-name:3:A

Ref=person-name:4:A

Also, why are these references to the previous person-name records part of this real-person record? Where can I put that information into my database? Sources and source citations can be added to genealogy programs for the entries in the person-name records, but there is no way to document the evaluation of this information in today’s software.

Now, you might say, “My database does that differently. It isn’t a problem.” Really? I challenge you to examine it carefully. I think you’ll find that somewhere your evidence and your analysis of that evidence (in this case, expressed as your determination of evidence referring to a single person) get all mixed together in inappropriate ways.

I am not trying to debate the old question, “How do we know anything is real?” Rather, I want to point out that there is a layer of analysis missing in our current tools with which we attempt to document genealogical information, and absence of this analysis layer causes our data to be distorted. This isn’t a particular problem for most cases, but when you get into a tricky case of an ancestor who has conflicting evidence or other need for more detailed analysis, the ability for our current genealogy databases to document these issues breaks down.

This document is far from complete, but it is meant to be a starting point for discussion. It needs further refinement to be sure. However, I hope it is useful in illustrating the need for a better, uniform analysis structure within genealogical software database applications.

Like this:

Considering that the best-selling genealogy program on the market does not have a good way to document evidence for a parent-child relationship, the lack of something more esoteric is not surprising. But I hope that by raising these questions and challenging genealogy software makers, the products we buy in the next few years will be much more tailored toward serious research instead of simply hanging leaves on the tree.

I agree that current genealogical software does not model both conclusions and “evidence” (or raw “information” that is used as evidence for conclusions). However, I think that what has been missing is the modeling of the evidence, not of the conclusions. I think that the data model used by GEDCOM and most desktop clients models the “real persons” (aka the “conclusion persons”), but gives you nowhere but the “notes” to capture what the original “source data” is. They give you the ability to cite sources, but it is up to the user to drive down to the library and look up the source; or to read through notes where you may have typed that information in. Storing the original source data as structured, computer-understandable data is what has been missing.

In your example, you have two powerful concepts: a “real person” (aka “conclusion person”); and “source persons” (aka “persona”, “appearances”, “mentions”). The “person-name” concept doesn’t seem helpful, though. There are a lot of different real people with the same name, so grouping “source persons” by exact name isn’t especially helpful. Better would be to just have the conclusion person reference directly all of the source persons (along with optional confidence levels and/or explanations of why you think they’re the same real person). Or, if there are some source persons who are very clearly the same person, perhaps group those (regardless of whether their names exactly match).

My weakly-formed illustration does indeed show a person-name concept that is not particularly useful, I agree. I merely meant to show that the evidence-person relied only on the information given within a source, and that information would (presumably) prominently include a name. My illustration meant only to convey that this name should be recorded just as it appeared in the record. However, there could certainly be multiple names, conflicting names or even no name at all. Then what?

My rationale is that in a source citation, one at least has some opportunity to discuss analysis of the citation. When moving from applying this information to an actual person, there is no such opportunity, because there is generally no such step or differentiation in today’s software products.

Your idea of grouping the evidence persons as an optional intermediate step rather than simply referring them to a conclusion/real person is very interesting. The way I think, this adds yet another level of complexity that isn’t now accommodated. I wold like to see how more talented folks might be able to achieve this intermediate grouping.

I’ve grown so frustrated with the constraints of my genealogy database software in analyzing data that I am, for the moment, reverting to spreadsheets and word processing programs. I will probably copy my written analyses to the program at some point, but for now I am concentrating on gathering, analyzing and pairing the appropriate evidence to my “real” people. The greatest benefit of the spreadsheets is they allow me to record the evidence I examine and reject as well as that I accept.

What I’d love is software that linked source persons to the real person in the same way multiple identities are linked in my contacts manager – Facebook, twitter, cell phone, email, skype, work phone identities are all connected yet remain distinct.

Yes, the idea of negative evidence is another critical piece that isn’t dealt with very well in our tools of today. I’m going to go start up a discussion at http://bettergedcom.wikispaces.com/ on that subject right now.

Your example of multiple online identities being linked together yet separate is perfect. If you don’t mind, can I use it, with attribution, over at BetterGEDCOM?