CASPAR Project

advertisement

CASPAR Framework and
Lessons Learned
David Giaretta
Overview
•
•
•
•
CASPAR
OAIS
Threats and Solutions
Validation
CASPAR Project
EU FP6 Integrated Project
Total spend approx. 16MEuro (8.8 MEuro from EU)
http://www.casparpreserves.eu
3
Digital Preservation
• Ensure that digitally encoded information
are understandable and usable over the
long term
– Long term could start at just a few years
• Easy to make claims
– Difficult to provide proof
• Reference Model for Open Archival
Information System (ISO 14721)
– The basic standard for work in digital pres.
– Defines terminology and compliance criteria
Information Model &amp; Representation Information
Information
Object
The Information Model is
key
1+
Data
Object
interpreted
using
1+ Representation
Information
interpreted
using
Recursion ends at
KNOWLEDGEBASE of the
DESIGNATED COMMUNITY
(this knowledge will change
over time and region)
Physical
Object
Digital
Object
1+
Bit
Sequence
5
Basic concept of CASPAR
• Digital preservation had been dominated by
libraries and (state) archives
• However there was a focus there on
“rendered objects” and “metadata”
• Tendency to think data is an “easy” add-on
HOWEVER
• Need to deal with DATA – processed to new
things, not just rendered
• Need to follow OAIS – finer grained view
• Need to test and prove that things work
Preservation Strategies




Emulation
Access software
Migration
 Transformation
Description techniques
Data…
Level 2 GOME Satellite
instrument data
Contains numbers – need meaning
9
...to process to this
10
...or this
11
...through complex processing schemes
12
Just Format?
sfqsftfoubujpo jogpsnbujpo svmft
You have a file
JHOVE tells you it is WORD version 7
13
..with some extra information..
representation information rules
Format Registries – useful but not enough: formats can be
used for multiple purposes e.g. audio files used to store
configuration parameters
14
Examples (cont)
• “504b0304140000000800f696….”
• “This is a ZIP file which contains Word
files, each of which contains an
encoded message which needs the key
‘!D$G^AJU*KI’ to decode it using
encryption method SHA7”
15
Examples (cont)
• LaTex file containing an EPS
(Encapulated Postscript) version of an
image
• Web page containing Java Applet
generating random numbers
• SWISS-PROT data
• Foreign Language emails
16
XML enough? – can stare at this and
&lt;family&gt; probably understand it
&lt;father&gt;John&lt;/father&gt;
&lt;mother&gt;Mary&lt;/mother&gt;
&lt;son&gt;Paul&lt;/son&gt;
&lt;/family&gt;
17
..but what about this?
&lt;VOTABLE version=&quot;1.1&quot;
xmlns:xsi=&quot;http://www.w3.org/2001/XMLSchema-instance&quot;
xsi:schemaLocation=&quot;http://www.ivoa.net/xml/VOTable/v1.1 http://www.ivoa.net/xml/VOTable/v1.1&quot;
xmlns=&quot;http://www.ivoa.net/xml/VOTable/v1.1&quot;&gt;
&lt;RESOURCE&gt;
&lt;TABLE name=&quot;6dfgs_E7_subset&quot; nrows=&quot;875&quot;&gt;
&lt;PARAM arraysize=&quot;*&quot; datatype=&quot;char&quot; name=&quot;Original Source&quot; value=&quot;http://wwwwfau.roe.ac.uk/6dFGS/6dfgs_E7.fld.gz&quot;&gt;
&lt;DESCRIPTION&gt;URL of data file used to create this table.&lt;/DESCRIPTION&gt;
&lt;/PARAM&gt;
&lt;PARAM arraysize=&quot;*&quot; datatype=&quot;char&quot; name=&quot;Comment&quot; value=&quot;Cut down 6dfGS dataset for TOPCAT demo
usage.&quot;/&gt;
&lt;FIELD arraysize=&quot;15&quot; datatype=&quot;char&quot; name=&quot;TARGET&quot;&gt;
&lt;DESCRIPTION&gt;Target name&lt;/DESCRIPTION&gt;
&lt;/FIELD&gt;
&lt;FIELD arraysize=&quot;11&quot; datatype=&quot;char&quot; name=&quot;DEC&quot; unit=&quot;DMS&quot;&gt;
&lt;DATA&gt;
&lt;FITS&gt;
&lt;STREAM encoding='base64'&gt;
U0lNUExFICA9ICAgICAgICAgICAgICAgICAgICBUIC8gU3RhbmRhcmQgRklUUyBm
b3JtYXQgICAgICAgICAgICAgICAgICAgICAgICAgICBCSVRQSVggID0gICAgICAg
ICAgICAgICAgICAgIDggLyBDaGFyYWN0ZXIgZGF0YSAgICAgICAgICAgICAgICAg
ICAgICAgICAgICAgICAgIE5BWElTICAgPSAgICAgICAgICAgICAgICAgICAgMCAv
IE5vIGltYWdlLCBqdXN0IGV4dGVuc2lvbnMgICAgICAgICAgICAgICAgICAgICAg
18
Representation Information
The Information Model is
key
Recursion ends at
KNOWLEDGEBASE of
the DESIGNATED
COMMUNITY
(this knowledge will
change over time
and region)
Representation Information Network
•Rep
•Info
•Virtualisation
/DISCIPLINE
README.txt
TEXT EDITOR
ENGLISH
LANGUAGE
Modules and Dependencies:
defining the Designated
Community
WINDOWS XP
FITS FILE
FITS
DICTIONARY
FITS
STANDARD
MULTIMEDIA
PERFORMANCE DATA
C3D
3D motion
data files
DirectX
MAX/MSP
3D scene
data files
motion to music
mapping strategy
PDF
STANDARD
PDF
s/w
FITS
JAVA s/w
DICTIONARY
SPECIFICATION
XML
SPECIFICATION
JAVA VM
UNICODE
SPECIFICATION
24
described
by
Archival
delimited
by
Packaging
Package
Package
derived
from
Content
further described by
Interpreted
using
*
Data
Object
Physical
Object
Interpreted
using
Digital
Object
1
1...*
1
Other
Structure
Reference
Provenance
Context
Fixity
Access
Rights
adds
meaning
to
Bit
25
Cost sharing
DRM
Preservable
infrastructure
USE DATA
• Use application to find data in
Repository
• Create DIP with enough RepInfo for the
user (via DC profile)
• Obtain more RepInfo from Registry if
necessary
Threat
Requirement for solution
Users may be unable to
understand or use the data e.g. the
semantics, format, processes or
algorithms involved
Ability toolkit,
to create
and and
maintain
Representation
RepInfo
Packager
Registryadequate
– to create and
store Representation
Information.
Information
In addition the Orchestration Manager and Knowledge Gap Manager help to
ensure that the RepInfo is adequate.
Non-maintainability of essential
hardware, software or support
environment may make the
information inaccessible
Ability
to share information about the availability of hardware
Registry and Orchestration Manager to exchange information about the
and
software
and their
obsolescence
of hardware
andreplacements/substitutes
software, amongst other changes.
The chain of evidence may be lost
and there may be lack of certainty
of provenance or authenticity
Ability to bring together evidence from diverse sources about
Authenticity toolkit will allow one to capture evidence from many sources
the
a digital
object
whichAuthenticity
may be used toof
judge
Authenticity.
Access and use restrictions may
make it difficult to reuse data, or
alternatively may not be respected
in future
Ability to deal with Digital Rights correctly in a changing and
Digital Rights and Access Rights tools allow one to virtualise and preserve
evolving
environment
the DRM and
Access Rights information which exist at the time the Content
Loss of ability to identify the
location of data
Persistent
Identifier
system:
a system
will allow objects to be located
An
ID resolver
which
issuch
really
persistent
The current custodian of the data,
whether an organisation or project,
may cease to exist at some point in
the future
Brokering of organisations to hold data and the ability to
Orchestration Manager will, amongst other things, allow the exchange of
package
together the information needed to transfer
information about datasets which need to be passed from one curator to
information
between organisations ready for long term
another.
preservation
The ones we trust to look after the
digital holdings may let us down
Certification process so that one can have confidence about
The Audit and Certification standard to which CASPAR has contributed will
whom
to trust to
preserve
data
allow a certification
process
to be set
up. holdings over the long term
The Representation Information will include such things as software source
code and emulators.
Information is submitted for preservation.
over time.
Accelerated Lifetime tests
As part of the validation the CASPAR
tested simulated the following:
• hardware changes
• software changes
• changes in the environment
(including legal framework)
• changes to the knowledge bases of
the Designated Communities
Test scenarios vs Threats to digital preservation
Threat
STFC ESA UNESCO IRCAM UnivLeeds CIANT
Users may be unable to
understand or use the
data e.g. the semantics,
format, processes or
algorithms involved



Non-maintainability of
essential hardware,
software or support
environment may make
the information
inaccessible
  

The chain of evidence
may be lost and there may
be lack of certainty of
provenance or authenticity


Access and use
restrictions may make it
difficult to reuse data, or
alternatively may not be
respected in future
The current custodian of
the data, whether an
organisation or project,
may cease to exist at
some point in the future

INA





STFC Testbed – various STP data
ESA testbed
UNESCO testbed
The Villa Livia dataset is a collection of files used within the &quot;virtual museum of
the ancient Via Flaminia&quot; project: a 3D reconstruction of several archaeological
sites along the ancient Via Flaminia, the largest of them being Villa Livia
This is an elevation grid (height map) of the area where Villa Liva is located.
It is an ASCII file in the ESRI GRID file format
Contemporary Art Testbed
Performance Viewer: side-by-side comparison and validation of the transformation. From left to
right: 3D visualization in Ogre3D, 3D model of the stage including the virtual dancer in VRML.
Figure 8 Some aspects of acousmatic production
CASPAR Validation
• In all cases members of the Designated
Community, with appropriate changes
to mimic changes over time, verified
that the metadata was adequate for the
use despite simulated changes of
hardware, software, environment and
Designated Community over time.
• Full details are available in the
validation report (CASPAR Validation
report, 2009)
Links
• CASPAR – http://www.casparpreserves.eu
•
•
•
•
•
CASPAR Source code - http://sourceforge.net/projects/digitalpreserve/
OAIS Reference Model http://public.ccsds.org/publications/archive/650x0b1.pdf
and the updated draft is available from
http://public.ccsds.org/sites/cwe/rids/Lists/CCSDS%206500P11/Overview.as
px
CASPAR Validation report
http://www.casparpreserves.eu/Members/cclrc/Deliverables/casparvalidation-evaluation-report/at_download/file
PARSE.Insight:
– www.parse-insight.eu
•
Alliance for Permanent Access:
– www.alliancepermanentaccess.eu
•
Digital Curation Centre:
– www.dcc.ac.uk
38
FUTURE
• Users may be unable to understand or use the data e.g. the
semantics, format, processes or algorithms involved
• Non-maintainability of essential hardware, software or support
environment may make the information inaccessible
• The chain of evidence may be lost and there may be lack of
certainty of provenance or authenticity
• Access and use restrictions may not be respected in the future
• Loss of ability to identify the location of data
• The current custodian of the data, whether an organisation or
project, may cease to exist at some point in the future
• The ones we trust to look after the digital holdings may let us
down
END