Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Flexible description: series description; dispersed collections Cooperative authority control: dispersed collections; but also creator of one collection is referenced in a collection created by someone else (co-referencing); economic and descriptive benefits Integrated access to cultural heritage: context for archival records, essential, but the descriptions can also provide context for all types of resources Archival authority records, like museum authority records, provide historical and biographical data that can enhance identification and understanding; (biographical dictionary; administrative histories)

Remember that we will solicit public evaluation and suggestions on drafts of the public interface, starting in the fall.

saa-2011-snac

1.
<ul><li>EAC-CPF and Social Networks </li></ul><ul><li>Society of American Archivists </li></ul><ul><li>Chicago </li></ul><ul><li>August 2011 </li></ul>Daniel V. Pitti § Institute for Advanced Technology in the Humanities § University of Virginia

3.
Funding and Timeline <ul><li>National Endowment for the Humanities </li></ul><ul><li>A Preservation and Access, Research and Development grant </li></ul><ul><li>Two-year project </li></ul><ul><li>May 2010-April 2012 </li></ul>Daniel V. Pitti § Institute for Advanced Technology in the Humanities § University of Virginia

4.
Project Team <ul><li>Daniel Pitti (PI) and Worthy Martin (Institute for Advanced Technology in the Humanities, University of Virginia) </li></ul><ul><li>Adrian Turner and Brian Tingle (California Digital Library, University of California) </li></ul><ul><li>Ray Larson (School of Information, University of California, Berkeley) </li></ul>Daniel V. Pitti § Institute for Advanced Technology in the Humanities § University of Virginia

5.
Project Objectives <ul><li>Archival finding aids currently intermix description of records with description of the creators of records and persons evident in the records </li></ul><ul><li>Further the ongoing process of transforming archival description using advanced technologies </li></ul><ul><li>By facilitating the separation of the description of people from the description of records </li></ul><ul><li>Using EAC-CPF, an International archival authority control standard </li></ul><ul><li>Goal: enhance the economy and effectiveness of archival description to enhance access and understanding of users of archives, libraries, and museums </li></ul>Daniel V. Pitti § Institute for Advanced Technology in the Humanities § University of Virginia

8.
Methods and Processing <ul><li>Extract EAC-CPF records from existing EAD-encoded archival descriptions </li></ul><ul><ul><li>Extracting both creators and referenced CPF names </li></ul></ul><ul><li>Match EAC-CPF records against one another and against existing authority records (ULAN, VIAF, LCNAF); merge records for the same entity </li></ul><ul><ul><li>Enhance EAC-CPF by normalizing entries, adding alternative entries, titles (VIAF), and historical data (ULAN) </li></ul></ul><ul><ul><li>Key challenge: two or more people with the same name; two or more names for the same person </li></ul></ul><ul><li>Create a prototype historical resource and access system </li></ul><ul><ul><li>Historical data and social-professional networks </li></ul></ul><ul><ul><li>Links to archive, library, and museum resources (by and about) </li></ul></ul>Daniel V. Pitti § Institute for Advanced Technology in the Humanities § University of Virginia

9.
EAD Source Data <ul><li>Encoded Archival Description </li></ul><ul><ul><li>Intermixes description of creators of records and, at the discretion of the archivists, names associated with the content of the records </li></ul></ul><ul><ul><li>Detailed description of creators of records </li></ul></ul><ul><li>Widely varying quality </li></ul><ul><ul><li>In the number of names identified and encoded </li></ul></ul><ul><ul><li>In the formation of the names (direct or inverted, capitalization, punctuation, and so on) </li></ul></ul><ul><ul><li>In the categorization of names (personal, corporate, or family </li></ul></ul><ul><li>Many names given but not identified as such </li></ul><ul><li>Most important of these in biographies/histories and in correspondence description </li></ul><ul><li>Extraction has focused on the “low hanging fruit,” that is the names tagged as names </li></ul><ul><li>Attention shifting to names not identified as such </li></ul>Daniel V. Pitti § Institute for Advanced Technology in the Humanities § University of Virginia

10.
Archival Records <ul><li>Records are the by-products of people living and working as individuals, in organized groups, in families </li></ul><ul><li>Records document people living and working </li></ul><ul><li>People exist in social-professional contexts, in relation to others </li></ul><ul><li>Records document these relations </li></ul><ul><li>All records created by the same entity are described together (a fonds or collection) </li></ul><ul><ul><li>Creators documented in detail </li></ul></ul><ul><ul><li>Many of the people documented in the record referenced in description </li></ul></ul><ul><li>Archival descriptions document interrelations among people and records (documents) </li></ul>Daniel V. Pitti § Institute for Advanced Technology in the Humanities § University of Virginia

13.
Daniel V. Pitti § Institute for Advanced Technology in the Humanities § University of Virginia <bioghist> <head>Biographical Sketch</head> <p>José Marcos Mugarrieta, prior to his term as Mexican consul in San Francisco 1857-1863, served in the Mexican army from 1837. He saw action in numerous battles and campaigns – Jamaica, under General Canalizo in 1841; Campeche, 1842-1843; Merida, 1843; Veracruz, 1845; Mexico City, 1846; Angostura and Cerro-gordo, 1847; Guanajuato, 1848, and Sierra-Gorda under Bustamante, 1848-1849; and Matamoros, 1849-1850. […] </p> <p>In April 1857 Mugarrieta received an appointment from the Comonfort government for the consulship in San Francisco. He did not actually begin his new duties until September 1, 1859, due to illness and to the political situation in Mexico. […]</p> </bioghist>

16.
Library and Archive Authority Control <ul><li>Library (or bibliographic) authority control is almost exclusively about the control of names </li></ul><ul><li>Archival authority control involves biographical-historical description of the CPF entity </li></ul><ul><ul><li>Descriptions based on controlled vocabularies or values, for example, occupations, place of birth and death </li></ul></ul><ul><ul><li>But also biographical-historical description </li></ul></ul><ul><ul><ul><li>Prose </li></ul></ul></ul><ul><ul><ul><li>Chronological list </li></ul></ul></ul><ul><li>Archival authority control provides context for understanding records, the context of their creation, the provenance </li></ul>Daniel V. Pitti § Institute for Advanced Technology in the Humanities § University of Virginia

22.
Year One Results-Extraction <ul><li>EAC-CPF records extracted </li></ul><ul><ul><li>LoC: 43,702 from 1,159 finding aids </li></ul></ul><ul><ul><li>OAC: 91,811 from ~15,400 </li></ul></ul><ul><ul><li>NWDA: 22,609 from 5,160 </li></ul></ul><ul><ul><li>VH: 15,175 from 8,390 </li></ul></ul><ul><ul><li>Total 173,297 </li></ul></ul><ul><ul><li>Note: in a more recent extraction: 196,218, but have not had time analyze the results </li></ul></ul>Daniel V. Pitti § Institute for Advanced Technology in the Humanities § University of Virginia

23.
Early Observations-Extraction <ul><li>Depth of analysis and quality of description of CPF entities varies widely in EAD-encoded finding aids </li></ul><ul><ul><li>LoC a lot of names under authority control </li></ul></ul><ul><ul><li>OAC and NWDA have less names and control varies </li></ul></ul><ul><li>To be fair, the finding aids were created without SNAC processing in mind! </li></ul>Daniel V. Pitti § Institute for Advanced Technology in the Humanities § University of Virginia

24.
Next on Extraction <ul><li>Refine extraction processing, incorporating some NLP-like processing, for example </li></ul><ul><ul><li>Verifying type of name: C or P or F </li></ul></ul><ul><ul><li>Massaging poorly formed names into better formed names </li></ul></ul><ul><ul><li>Identifying names in strings that are names-plus (but name not identified as such) </li></ul></ul><ul><ul><li>Provide context information to enhance matching, for example, date or dates of correspondence, or occupation of creator of records </li></ul></ul>Daniel V. Pitti § Institute for Advanced Technology in the Humanities § University of Virginia

27.
Social Networks and Archival Context Project: Matching and Merging EAC-CPF Records <ul><li>Ray R. Larson </li></ul><ul><li>Krishna Janakiraman </li></ul><ul><li>University of California, Berkeley </li></ul><ul><li>School of Information </li></ul>SAA 2011 - Chicago 2011-08-27 - SLIDE Thanks to Daniel V. Pitti of the Institute for Advanced Technology in the Humanities, University of Virginia, for many of the slides here

28.
SNAC Project <ul><li>The outlines of the project have been discussed by Daniel Pitti previously </li></ul><ul><li>The primary focus of the Berkeley group for the project is on combining data resources from multiple archives and other information sources </li></ul><ul><li>In this talk I will focus on our current methods used in the prototype (to be described by Brian Tingle later) </li></ul>SAA 2011 - Chicago 2011-08-27 - SLIDE

32.
Authority Control <ul><li>Identifying creator entities and referenced entities (correspondents, etc.) </li></ul><ul><li>Recording name or names used by and for them </li></ul><ul><li>Rule-based heading or entry formation and control </li></ul>SAA 2011 - Chicago 2011-08-27 - SLIDE

33.
Controlled Vocabularies <ul><li>Vocabulary control is the attempt to provide a standardized and consistent set of terms (such as subject headings, names, classifications, etc.) with the intent of aiding the searcher in finding information </li></ul><ul><li>That is, it is an attempt to provide a consistent set of descriptions for use in (or as) metadata </li></ul>SAA 2011 - Chicago 2011-08-27 - SLIDE

34.
The Problem <ul><li>Proliferation of the forms of names </li></ul><ul><ul><li>Different names for the same person </li></ul></ul><ul><ul><li>Different people with the same names </li></ul></ul><ul><li>Examples </li></ul><ul><ul><li>from Books in Print (semi-controlled but not consistent) </li></ul></ul><ul><ul><li>ERIC author index (not controlled) </li></ul></ul>SAA 2011 - Chicago 2011-08-27 - SLIDE

41.
Connect Exact Matches <ul><li>The EAC-CPF records provide the names without having to parse texts, etc. </li></ul><ul><li>Allows us to use some simple methods like exact matching </li></ul><ul><ul><li>Assume identical name entries means the same person/corporate body/family </li></ul></ul><ul><ul><li>Enter the full names and record IDs into a database and flag IDs with same names for merging </li></ul></ul>SAA 2011 - Chicago 2011-08-27 - SLIDE

43.
Search Authority Files <ul><li>For each name, formulate a search of the VIAF database using the Cheshire system (SGML/XML retrieval system with probabilistic and Boolean matching) </li></ul><ul><ul><li>Search both the “authoritative” and “non-authoritative” forms </li></ul></ul><ul><ul><li>Consider any name matching a non-authoritative form to be a candidate match for the authoritative form </li></ul></ul><ul><ul><li>Flag EAC records that match the same authority record as potential matches </li></ul></ul>SAA 2011 - Chicago 2011-08-27 - SLIDE

45.
Merge Flagged Records <ul><li>For all of the exact matches and authority matches </li></ul><ul><ul><li>Use the Authoritative form of the name </li></ul></ul><ul><ul><li>Combine data from each match into a single EAC-CPF record </li></ul></ul><ul><ul><li>Retain all source record IDs and information </li></ul></ul><ul><li>Finally, output the merged EAC-CPF records </li></ul>SAA 2011 - Chicago 2011-08-27 - SLIDE

52.
Another kind of failure <ul><li>Entry for “Zaphiropoulos” - no dates, no first name: </li></ul><ul><ul><li>The entry from VIAF was for “Zaphiropoulos, Lela, 1941-” </li></ul></ul><ul><ul><li>But the name in EAD came as an attribution for photos: </li></ul></ul><ul><ul><ul><ul><li>Box 113 </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Lot PP13 Zaphiropoulos. [Bas-relief at Troy], 1872. </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Physical Description: 2 photographs </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Scope and Content Note </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Photographs taken for Schliemann. </li></ul></ul></ul></ul><ul><li>Not sure that the Zaphiropoulos indicated is a person, and definitely not one born in 1941. </li></ul>SAA 2011 - Chicago 2011-08-27 - SLIDE

53.
Addressing the failures <ul><li>First we need to know where things are not working, and why </li></ul><ul><ul><li>We are planning to do a random sample and detailed evaluation of the database to help identify the problems </li></ul></ul><ul><li>Many of the problems we have seen already appear to be solvable using: </li></ul><ul><ul><li>Additional contextual clues from the EAD records </li></ul></ul><ul><ul><li>More sophisticated matching for phonetic variants </li></ul></ul><ul><ul><ul><li>Such as n-grams or phonetic schemes like phonex </li></ul></ul></ul><ul><ul><li>Additional normalization of names before merging </li></ul></ul><ul><ul><ul><li>For name order, etc. </li></ul></ul></ul><ul><ul><li>Use of advance matching methods </li></ul></ul>SAA 2011 - Chicago 2011-08-27 - SLIDE

54.
Testing new merging methods <ul><li>Work done in conjunction with SNAC for a I School Masters’ project called Biograph </li></ul><ul><ul><li>Krishna Janakiraman and Sean Marimpietri </li></ul></ul><ul><li>Using SNAC and merging with FreeBase and IMDB </li></ul>SAA 2011 - Chicago 2011-08-27 - SLIDE

63.
Conclusions <ul><li>There will not be a single merging method, but a staged set of approaches that will allow us to go from the simplest exact matches, to (we hope) reliably identifying various variant forms of a name, etc. when corroborated by contextual (date, etc.) information </li></ul><ul><li>Once records are merged, they are passed along to Brian for search and display… </li></ul>SAA 2011 - Chicago 2011-08-27 - SLIDE

65.
Meet the target users <ul><li>Randy: Graduate student working on a PhD that involves biographies and the study of diplomatic families and networks. Sometimes he comes to the site looking for information on specific people; other times he is looking for information on a specific subject or event. He also TAs an undergraduate history class and sometimes has to help students find topics for papers. </li></ul><ul><li>Connie: Works at an institution that contributed records to the project. Is going to be asking themselves how this site would be useful to their users. Wants to understand how their records were used and what the added value is. </li></ul><ul><li>Quincy: Library School Student working to QA record matching. </li></ul><ul><li>Adele: Person doing authority work during collection processing. </li></ul><ul><li>Lenny: Lenny likes linked data, and wants to be able to mine the links that have been established programatically. </li></ul>Personas are fictional characters created to represent the different user types within a targeted demographic, attitude and/or behavior set that might use a site, brand or product in a similar way. http://en.wikipedia.org/wiki/Persona_(marketing)