Analytical Access to the Domain Dark Archive

Developing new forms of access to a dark archive of UK websites (1996-2010). Funded by the JISC, led by the Institute of Historical Research in partnership with the British Library and the University of Cambridge.

Friday, 9 May 2014

In another of our series of researchers' final reports, I am posting a link to a PDF of a talk given by Martin Gorsky of the London School of Hygiene and Tropical Medicine at the recent European Social Science History Conference in Vienna. Martin goes into plenty of detail here about how he used the search interface to the Dark Archive to research public health in local government in England.

Tuesday, 12 November 2013

This is the third in our series of final reports by the AADDA project researchers, posted with their permission. This one is by Dr Carole Taylor, a researcher at the House of Lords:

I.Research Background and Methodology

My historical expertise lies in early Georgian music, art and politics and was not obviously suited to the Domain Dark Archive focus on UK websites extant between 1996 and 2010. However, my work as Research and Parliamentary Assistant to peers in the House of Lords seemed a more promising fit. I discussed this with colleagues in the Lords who immediately recognised the potential value of the web archive for MPs and Peers with “a range of policy interests which will map onto those of academic researchers”. With the particular encouragement and advice of Dr Elizabeth Hallam Smith (Director of Information Services and Librarian, House of Lords Library) I identified political engagement as an area of obvious interest to parliamentarians, as well as a theme noted by Peter Webster during the 13 June 2012 seminar as a category, among others, that lent itself well to web archive research of this kind, and wrote up a proposal.

I undertook an intensive period of research to familiarise myself with the present state of serious research on political engagement in the UK in order to identify a manageable research exercise to take to the AADDA interface. I was advised by several academic colleagues, particularly two PhD students in the Department of Government at the University of Essex, one of the two main centres (together with the University of Lancaster) of political engagement studies in the UK. I was also assisted in this information gathering exercise by a Senior Researcher at the House of Lords Library where serious efforts are made to understand how parliamentarians are listening to and engaging with the public.

In advance of our access to the AADDA I presented a scaled-down version of my research proposal to the IHR/BL team in March. I suggested a focus on social media forums used by parliamentarians, particularly the House of Lords blog, launched in 2008. The House of Lords was the first parliamentary chamber in the world to set up a bipartisan blog which makes it a compelling example in the history of political engagement. Disappointingly, in the ensuing months leading up to our encounter with the AADDA dataset, I learned that social media sites with the exception of .co.uk would not be included in the dataset, which meant my topic was no longer viable. I re-thought the proposal and decided on a very narrow, entirely new subject that felt manageable to complete within the parameters of the consultation – Heathrow’s Third Runway. In February 2013 I had a meeting with Jane Winters and Jonathan Blaney at the IHR to confirm this third version of the research proposal was acceptable (it was).

Thus I was on track for the purposes of the consultation. However, with such limited access to social media sites the value of this exercise for serious researchers at Parliament was considerably eroded. Even the significance of the results below on the “Third Runway” was questioned, albeit sympathetically, by parliamentarians who cautioned that I appeared to be accessing information that is already well-known to parliamentary researchers. Their interest is obviously about what this resource can offer over and above what they know already. It may be that areas that did not receive such widespread public airing (such as the Third Runway did) will deliver better results.

II.Research Results:

1st session: March 2013

I questioned the interface three times.

“third runway” – 171 items;

“third runway” AND “parliament” – 71 items;

“third runway” AND “heathrow” and “parliament” – 69 items

Yield:

a lot of travel companies;

.gov.uk (5 items) – entirely predictable;

public suffixes important to investigate engagement, but I didn’t readily grasp how usefully to link left and right side of the search results page

Questions arising:
How many of the 100 that were dropped between first two searches might have included useful information? In this respect, I agree with GM at the 21 March 2013 meeting who said “there needs to be a ‘search within’ option, for when there are many thousands of results.” PW’s response that “in such cases adding more search terms should have the same effect” is helpful to reduce “many thousands” to a couple of hundred; however, at this point I might not want to lose potentially useful information in the course of adding a new search term.

What about people who are undecided or don’t express their views? an obvious but important qualitative question for historical researchers.

Suggestions:
It would be a great help if a preview screen were available to the right of each item. Through all my searches (in March and September), I clicked on countless items that were duplicates of what I’d clicked on two or three items earlier. (Titles of items often differ, so titles alone are not a dependable indicator.)

At the March meeting I asked Peter and Andrew how to turn the search data into ngrams; the answer was that AADDA will have a “click to create ngram” function – not there yet: would be a great help

“third runway” AND “soley” – 122 items: Lord Soley was Chairman of “Future Heathrow”, the pro-expansion group; among the 122 items was an interview (helpful, though repeated twice); the first 21 items were all the same and most were inaccessible or gobbledygook (cooking recipes); several references had no mention of Soley or third runway at all, eg, travel sites (nothing to do with Soley and no mention of his name)

“third runway” AND “house of lords” – 206 items; and “third runway” AND “aviation” – 2000+. For both these, I checked out two extremes ends of sentiment analysis (“very positive” and “very negative”). Many of these items failed to link the two search filters in any way. Nearly all of the 206 items in the first set were bbc – this was not only a problem of repetition (though there was plenty of this), but these are also widely public documents of little use to parliamentarians (who are already well-equipped with knowledge at this level).

I checked “third runway” AND “Howard Davies” on the off chance he was mentioned in this connection before he became Chairman of the Airports Commission in 2008 – eight items, all identical (a pdf report of the Association of British Insurers that had no mention of “third runway” or Davies) – disappointing!

Also checked “third runway” AND “future of aviation”; “third runway” AND “environment”; “third runway” AND “economy” – no new observations.

Suggestions:
It would be a great help if we could print the page with search results (or somehow export this material).

Questions and Concerns:
Clearly in this September round of questioning the dataset I was encountering problems with the Boolean AND search that didn’t arise in March. At best I seemed in September to be accessing OR rather than AND; at worst there was no connection to either search filter. I corresponded with Richard Deswarte about this and he could not see where the problem lay and I have no idea what the problem was either.

Sentiment Analysis, where it hits items to do with the search term(s), was at least consistent and might therefore be of interest in early stages of research.

Repetition: This is my biggest concern about keyword searching: does the repetition of material occurring from one crawl to the next, render the number totals listed on the search results meaningless for the historian? And will this problem be multiplied by 200 when the entire dataset is available? Peter cautioned users to avoid taking numbers of results for 2009 and 2010 as evidence of patterns in relation to the previous years; does repetition compound this problem for all years?

III.Concluding Remarks

The Digital History and Archives seminar presented by Peter Webster and Richard Deswarte at the IHR on 23 September 2013 was an invaluable guide to my second round of searches on the interface: http://historyspot.org.uk/podcasts/digital-history/web-archives-new-class-pr - click on “Web Archives: A New Class of Primary Source for Historians?” I’d particularly like to highlight Peter’s observation that the traditional separation of historian and keeper of archives no longer holds in digitized systems of this kind. During the Q&A Tim Hitchcock expanded on this point, remarking that models of society now being digitized – newspapers, etc – were of course not digitized at the time. These changes demand a new skillset now being shaped by and for C21 historians. To this I would add that scholars will have questions about subjects they know well and subjects they are addressing for the first time, and this fact needs also to be built into the process of curating datasets of this kind – particularly in the present, pioneering stage of digital research.

Wednesday, 23 October 2013

This is the second in our series of final reports by the AADDA project researchers, posted with their permission. This one is by Saskia Huc-Hepher:

AADDA Testing Report:

The French Community in London

by Saskia Huc-Hepher

1 -
Methodology

The initial purpose of this research was two-fold: firstly,
to use the geo-indexing tool to map out the areas of London with the greatest
concentrations of French inhabitants on the basis of the post-codes associated
with 'French' Web sites / spaces; and, secondly, to identify French community websites in the
Domain Dark Archive (DDA) appropriate for subsequent multimodal analysis on the
basis their visual and textual meaning potentialities. The ultimate objective
of the former was to triangulate the
findings of additional empirical research conducted within the framework of my
PhD, which sought to ascertain the actual numbers and hot-spots of the London
French community, thereby serving to dispel the exclusively, or at least
predominantly, South Kensington myth. Whilst the aim of the latter was to
scrutinise the visual landscape of the London French over the period of the DDA
data set, as (re)presented through the images – still or moving, in parallel to
the technological advances of the Internet – displayed on the French community
websites found in the DDA. It was envisaged that this historical visual data
would provide the study with greater temporal contextualisation and depth, and,
using social semiotic theory, in particular multimodality, would allow meaning
to be inferred and ethnographic conclusions drawn from the images, on such
subjects as the community's sense of belonging; how they perceive and conceive
London and its inhabitants; how they (re)present and define their own identity
through images; what elements of France and Frenchness they portray and
promote; and whether any of these have changed over time.

Similarly, it was
hoped that the geo-indexing analysis would be of historical value, determining
whether or not there was any relationship between the areas most associated
with the London French today and those districts favoured in previous waves of
migration to the capital.

The final objective
of the DDA research proposed here was for the image-tagging analytical tool to
enable a word, or combination of words, such as 'French' and 'London', to
search for photographs or images only, the visual data thereby potentially
serving to triangulate the findings of the geo-indexing investigation in that
the images and spaces associated with key words such as 'London', or specific
areas within London, could have coincided with the places and spaces that were
identified as being particularly French through the geo-indexing process and/or
historically. This micro-investigation was therefore to be binary in its
objectives: visual data for ethnosemiotic analysis and geo-indexing data for
triangulation of previous qualitative research.

The methodology
outlined above was adopted on several occasions over the course of the AADDA
project time-span: firstly in March 2013, later in August 2013 and September
2013, with a final trial, using the most functional interface and comprehensive
data set, in October 2013. The results, at every stage, however, were
disappointing.

2 – Deep Search Data Testing

March 2013

The first trial session was carried out in the knowledge
that at that point in time the DDA included only a random subset of the entire
cohort of data, but one which was evenly spread over the archive in temporal
terms. Therefore, in theory, trends, developments and patterns should have been
identifiable, despite sentiment analysis and geographic options not being
available at that stage. In practice, however, a number of basic search hurdles
prevented any valuable findings from materialising. These included:

the
lack of clarity regarding the need to click on the crawl date to access a
website; choosing the website title would have been more intuitive. Such
functionality was updated at the subsequent meeting (21/03/2013);

the
lack of clarity regarding the purpose of the bar charts at the top of the
page; they have since been removed;

the
fact that not all web captures functioned at that time – e.g. Le Petit
Parisien restaurant had no images and almost no text (but enabled me
to do a current Google search for the website, only to find out that the
restaurant – and website – is now closed; this is therefore an example of
the potential historical worth of the DDA, had it been operating
correctly, in allowing the analysis of obsolete Websites);

some
websites cited in the list of 'hits' subsequently being found to be
unavailable; the links to alternative sites proved to be useful, however;

time
being wasted revisiting Websites which had already been scrutinised. Once a
site has been viewed, it would be helpful and more time-efficient if the
visited link appeared in a different colour (e.g. purple, cf. Google) from
the others on the list;

the
fact that search tools operated extremely slowly and the interface was not
yet user-friendly. Speeds and appearance have since improved and the
latter is no doubt a work in progress;

http://web.archive.org/web/20080601000000*/http://www.guardian.co.uk/world/2008/jul/12/france.islam
Here, every separate date in the July (burka scandal) peak (as well as all
the other dates in August and October 2008, the two snapshots available
from 2009 and the single one from 2012) showed the same snapshot
from The Guardian (12 July 2008). If the online material is
unchanged in relation to another date, this should be immediately visible
on the list of data (possibly via colour coding, as suggested for the
pre-visited Web pages, or grouping by content & date);

the
majority of search results not being particularly useful for my purposes;
they were either not relevant (for instance displaying large numbers of
Websites related to French tourism for English users) or not
French-specific (that is, 'Londres' retrieved results in Portuguese,
Spanish, etc., not French exclusively; while English search words
retrieved sites aimed at Francophiles as opposed to Francophones);

phrase
searching using the “double inverted commas” being equally disappointing
(nothing of relevance was found following a search for “French community
London”, or indeed '“French” and “community”', trialled at a later stage);
“French London” was therefore tested, resulting in a list of sites
relating to French teachers & jobs in London.

Conversely, it was useful to have the
'media' / 'pdf' search options at the bottom of the screen, as this enabled
access to images and audio 'texts' (of relevance to the multimodal methodological
/ theoretical approach taken in my research);

Overall, the initial testing was found to be useful in
assessing the lasting impact, or otherwise, of the French community on London,
in a temporally comparative manner. That is, by identifying French restaurants/cafés/businesses
through their retrospective on-line presence before submitting the titles to a
live Google search at the time of testing, I was able to discover if such
enterprises were growing, in decline or defunct. Whilst that limited use was of
potential value to my research in assessing the lasting contribution of French
businesses to London's cultural and economic landscape, I was nevertheless
acutely aware (given my curation of the London French Special Collection for
the UK Web Archive) of the mass of relevant data – such as community websites
and blogs – which had not been detected or listed as featuring in the DDA. It
was hoped at the time that this was due to the incomplete and arbitrary state
of the data set.

August 2013

This trial was more successful than the last as regards the
speed and efficiency of the data search tools, despite there still being only a
five per cent random, if temporally representative, sample of websites
available. Somewhat paradoxically, those searches which pinpointed the early
years of Internet use, namely 1996 and 1997, proved to be the most valuable.
Several different searches were tested on this occasion, as follows:

a) A search for the terms “French
community” was filtered by language, using the “French” option. This
functionality was found to be extremely useful in reducing the large amount of
irrelevant data to a more manageable subset. Again, by filtering further, this
time by year (in this case 1996 and 1997), I was able to focus in on yet more pertinent
Web pages. Thus, when I began to analyse the <Associations Françaises>
site, I noted that the landing page directed the visitor to separate sites, one
for French expatriates and one for Belgians. Not only are these sites an
indication of the relative establishment of the said Francophone communities in
the UK, each warranting an on-line home for the long list of associations set
up in the country of residence, but the fact that a distinction is made between
Belgian and Franco-French populations has implications regarding identity.

Using the same search terms, another site <Les
Grenouilles Cablées>, harvested in 1996, proved worthy of an initial
analysis. Firstly, the landing page pointed the visitor in the direction of
three separate sub-sections: <Grenouilles du monde>, <Grenouilles des
USA> and <Grenouilles de Californie>. These distinctions suggest that
either the French expatriate community was more significant in the USA than
elsewhere at that time (including London, which is no longer the case and
perhaps related to the opening of European borders) or that US residents,
including French ones, were earlier adopters of Internet technology than in the
UK. When examining the site more closely and entering the

<Grenouilles du monde> space, it was telling that the
first choice was then <Nouvelles de France> (before the hyperlink to
Quebec), which suggests that this website is indeed aimed at the French expat
diaspora worldwide, linked together by their shared affinity to France, and
keen to maintain links with the homeland. Further, when choosing the French
news link, the selection of newspapers available was a left-leaning one. Again,
the possible implications of this are two-fold: either the political leanings
of the newspapers featured are an indication of the papers' social commitment,
i.e. making information freely available to all, or they are an indication of
the profile of the diaspora visiting on-line sites at that time, i.e. Libération
and Charlie Hebdo both target a young, left-wing readership. If this is
the case, it is thus a profile at odds with the predominantly right-wing
(particularly at that time) expat community of the South Kensington stereotype,
which serves to substantiate the hypothesis posited at the beginning of this
report. There are also hyperlinks to <Metéo France> (suggestive of a need
for a physical sense of proximity to the homeland, despite the geographical
distance separating the community from it) and to <Les dernières nouvelles
d'Alsace' and <Pariscope>, both of which could be indicative of a longing
for insignificant local minutiae in the globalised age, made possible through
the worldwide Web, as well as pointing towards greater emigration from eastern
France (and Belgium, as confirmed by the first website) and the French capital
than other geographical zones.

This site offers links to French audiovisual sites
including radio and TV and, perhaps more importantly for my research, to two
on-line fora, <French Talk> and <Francopolis> which are evidence of
the formation of both Internet and French communities (despite other
empirical evidence suggesting that the French community per se does not exist,
or if at all, in South Kensington alone). Finally, this website creator's
recommended sites are telling in terms of identity (just as a Blog would be
today in its related networks) especially within the theoretical framework of
Pierre Bourdieu's Habitus, with the Vatican, Charlie Hebdo, the RATP (equivalent to TFL in London) and various
French sports sites (football, Formula 1 and rugby) featuring among others.

Another site displayed following this search was the
<Association des Francophones de Cranfield> in which advice is provided
on low-cost means of transport to France and Belgium. This in itself
demonstrates that the target audience are medium- to long-term French residents
of the UK, rather than short-term visitors, and that they have been attracted
to England by its (Higher) education system – a point which, as incongruous as
it may appear, is compounded in the qualitative data gathered outside the AADDA
project.

b) The second search undertaken in the
August trial was “London French” by “content type”, notably “image”. This was
highly disappointing and of little use given that the few images which were
displayed related to French football or simply contained a set of codes, with
no discernible image.

c) To counter the insufficiency of the
image search above, a “format search” was instead chosen from the AADDA
homepage. This was more successful in terms of number, with some 6,369 items
listed for the “French London + format” search trialled, filtered by year
(2006). However, given that the images were not tagged and stood in complete
isolation, their usefulness was questionable, as many appeared to relate not to
the French community in London, but linked to websites on French property or university
Webpages.

d) This search attempted to assess the
value of the post-code filter, which initially was again rather disappointing.
Given the lack of pertinence of the majority of the sites identified after the
early years (1996, 1997), their related post-codes were of equal irrelevance.
Furthermore, there were no apparent clusters of London websites, with many
coming from outside London; no
micro-geographical/demographic conclusions could therefore be drawn. A
subsequent search (“French community” filtered by language and year), despite
listing only one Website, revealed two potentially telling post-codes, N7 and
NW5, for 2010, which could have been related to the forthcoming opening of a
new French State school in Kentish Town (NW5) (but the insignificant numbers
involved are again inconclusive).

e) A search for “communauté française”, filtered by year (2001) and
language (French) identified a Blog, which would have been of particular
pertinence to my research. However, it transpired that the said Blog was the
work of an English-speaker, practising their written French, rather than a
French Londoner's Blog. The lack of Blogs retrieved by the DDA search engine
was perplexing, as many are known to me within the framework of my UK Web
Archive Special Collection work. The question of whether this is due to the
domains favoured by the London French Bloggers as hosts for their
autobiographical logs is therefore worth consideration, and if so, the
possibility of accessing them through the DDA should also be contemplated.

f) The same search as in item (e), this time written in and filtered by
the English language for the year 2010, found only one Website, the <Ile aux
enfants> school in North London. Despite the unexpected limitedness of the
search results in this case, the “links to host” tool was telling, particularly
in terms of “mapping the field” and Bourdieu's “three-stage analysis” paradigm.
That is, by scrutinising the – predominantly institutional – list of Websites
linked to the <Ile aux enfants>, such as <ambafrance>,
<assemblee-afe>, <bienvenuealondres> and <edufrance>,
socio-cultural assessments were facilitated. Nevertheless, it was frustrating
that these links to the host site were not functioning during the trial,
directing the visitor back to the host page as opposed to opening the linked
Webpage itself. It was not clear, therefore, whether their inclusion was
exclusively for quantitative analysis (the number of visits was in brackets),
as they were of no qualitative worth without access to the content of the
linked Websites.

September 2013

The most notable and satisfying difference between this trial and the
preceding ones was that all the links to related Websites were at least
partially, and in the great majority of cases completely, successful. This
meant that the discovery of one website (from

a long list of still relatively
futile others), namely the “Londoscope” reference pages of the
<www.acticours.freeserve.co.uk> proved to be invaluable through its
hyperlinks, as opposed to the content of the site itself. Thus, several
pertinent results were attained, as detailed below:

a) The apparition of London French social-networking-type pages, known as
<Londoscope> is perhaps indicative of the growing numbers of French
Londoners seeking a physical sense of community by means of digital linking and
dissemination mechanisms. Entries such
as “Eglise protestante française de Londres: Soirée anti-stress” and the enumeration
of French films on show at the Ciné Lumière and the NFT, together with other
French cultural events at the Institute of Contemporary Arts bears witness to
the importance of French culture to London's overall cultural capital and is
also evidence of community belonging in
practice.

b) The <Londoscope> pages from 2003 enabled the identification of a
culturally and historically pertinent French amateur dramatics group which has
been performing in London since 1929: Le Cercle dramatique français (CDF). My
research into this amateur theatre company can now be taken forward in an
effort to ascertain whether it is still in existence and, if so, its place in
French community life today.

c) Another link on the same Website, from 2004, referred to the Francophone
television channel TV5 celebrating its 20th anniversary and revealed
some useful viewer figures, including it being watched in 167 million
households in 2003, with some 56 million weekly viewers. This constitutes
further evidence as to the impact of the French language and culture worldwide
and potentially to the growing French diaspora.

d) The final finding of relevance during this trial session was the
<Londoscope> link to the ADFE (Association Démocratique des Français à
l'Etranger), created in 1980 'par des Français qui voulaient, pour les
représenter, une association dynamique et correspondant aux nouvelles réalités
de l'expatriation' (i.e. by French people who sought representation through a
dynamic association in tune with the new realities of expatriation). This
quotation alone is of worth for a number of reasons; firstly the notion of
'representation' itself is key, as it begs the question of 'representation to
whom?', which, reading further, it appears is to the French authorities.
This in turn indicates that the need to be politically represented in France
has its roots much further back historically than the election in 2012 of the
first ever Député for
French overseas residents implies, as well as demonstrating an unwillingness to
integrate fully in the London socio-political scene and an attachment to the
homeland. Similarly, the notion of “new realities” suggests a shift from an old
form of migration to a new one, acting as a temporal forerunner to the massive
wave of cross-Channel immigration which began in the early nineties and
continues to this day. The term “dynamique” could also be seen to illustrate
the London “pull factor” for French expats living in the capital; that is, many
are arguably escaping the inertia and complacency of French institutions and
mindsets in their decision to emigrate to London, as exemplified in other forms
of empirical evidence gathered for this research. Here, therefore, the data
gathered from a single Website in the DDA has served to triangulate several key
findings in my PhD.

October 2013

Having exhausted most of the available search options during the
previous trial sessions, this was the shortest and least enlightening of all.
It was necessary, nonetheless, to conduct a final test
with the most functional
interface to date and a now complete data set. The search tools also provided
an opportunity for sentiment analysis, unavailable in previous trials.

This experiment
involved a phrase search for “London French community” combined with “English language”
and “very negative” sentiment filters. No results were identified. When the
French language was used and chosen as a filter, 240 matches were found, but
these were of little relevance to my research given their pedagogical focus.
One potentially valuable find for historians of the French presence in the UK
was a Website on Augustine monks, in which the flight of monks from France
during and after the French Revolution, and the creation of brotherhoods in
York (1802), Bristol (1818) and Ealing (1897), where a Benedictine monastery
was founded, were reported. However, in view of the contemporary emphasis of my
research, this proved of little relevance, once again.

Further searches,
using different phrases/words, content types and language/sentiment filters
were also trialled, to no avail. Furthermore, it was disappointing to note that
the post-code and media filters appeared to have been removed, or were not
readily visible.

Overall, if not the
least successful of the trials conducted to date, this was the most frustrating,
given the unfulfilled aspirations of working with the complete data set.

3 -
Lessons Learnt

The lessons learnt from this exercise are as follows:

“Think
small” – minimising one’s research objectives is perhaps the only way of
navigating the enormity of the data.

Maximise
material – as the deep search process is akin to searching for the
proverbial needle in a haystack, any relevant data identified as being
pertinent should be analysed immediately, or saved for subsequent analysis,
due to the apparent randomness of the retrieval process.

Use
big data for its quantitative value, but not for drawing representative
conclusions or in an attempt to test large-scale hypotheses, due to the
apparent fallibility of the findings. Therefore, restrict qualitative
research to the micro-findings of those Web sites and Web pages found to
be of value – albeit somewhat arbitrarily – and optimise this data for its
comparative and preservation worth.

4 -
Future research and AADDA Recommendations

As regards my own research, I intend to explore the identity / Habitus
evidence found in early Websites (1996 / 1997) in greater detail and compare it
with contemporary Blogs to establish whether the same affiliations are present
and the same sense of group, or otherwise, identity. These findings will also
be compared and triangulated with the qualitative data gathered from one-to-one
interviews with members of the contemporary French population in London. It is
also possible that I will study sample historical Websites / Webpages alongside
their contemporary equivalents, from a multimodal perspective, to gain an
understanding of how technological constraints might influence the making of
meaning to varying degrees over time.

It is unlikely that the post-code filter searches will be used to
inform my research, given the weakness of the findings, but the process was
worthwhile in its disproving of my theory, and some cautious, small-scale
conclusions could be drawn from the associations with the NW5 district.

With respect to the AADDA project looking forward, the following
recommendations have been tentatively made:

(colour?)
coding to indicate both sites already visited and replica Webpages
(identified repeatedly according to the sweep date)

The lasting impression, having carried out several trial sessions using
the DDA data and its current search tools, is that the results can present
islands of valuable resources within a sea of irrelevant material, but that the
likelihood of finding them is dictated by chance rather than design. Throughout
this testing process, I have pondered the reason for the seemingly arbitrary
nature of my AADDA findings and for my failure to access a greater amount of
material relevant to my research; that is, the question of whether my lack of
technological expertise was the cause or whether such outcomes are inherent to
searching this vast set of data has been recurrent and remains unanswered.
Instructions offering clear guidelines on the best ways to use the archive and
acknowledging its limitations would therefore be both helpful and reassuring to
researchers.

Wednesday, 16 October 2013

Our project researchers on AADDA have kindly written up the research the planned to do with the web archive, a summary of how it went and problems that they encountered. I'll be posting these as blog posts over the next few months. Here is the first, from Helen Taylor:

AADDA Report: Sentiment Analysis and the Reception of the Liverpool
Poets

My project and the AADDA: a
lesson in ‘digging down’

When I proposed my research
project for the Analytical Access to the Domain Dark Archive project, it was
based on a ‘wish list’ of tools that scholars might want to use to access this
resource. The tools my proposed project required were sentiment analysis,
proximity search, and geo-indexing. This latter was not available during this
test period, but the first two were. However, this report is not so much a
record of my findings, but about not making assumptions with the data produced
via these two tools.

I sought to access information
about the reception of the Liverpool Poets (in practise, I focused solely on
Adrian Henri). With the Domain Dark Archive I could find avenues – fan pages,
forums, and the like – which would provide me with information to consider
alongside newspapers, interviews, and archival material. I wanted to see what
labels were attached to the poets, and how they were viewed, in informal
recollections and non-academic contexts. I would then combine and compare this
data with searches for the same terms from newspaper and published works. There
is a marked difference in academic and popular attitudes to the poets, and the
internet archival searches should be able to provide evidence for how the
people who actually received the work viewed their experiences.

Methodology: considerations and consequences

It must be noted that the AADDA
project involved only a slice of the full dataset, and that my results will
almost certainly differ greatly when it goes live. (Just as an example, a
search for “Adrian Henri” on the AADDA browser returns 1847 results, compared
to over 8,200 current UK hits on Google.) The lack of references is almost certainly due to the smaller dataset,
rather than the data not being there at all (1).

Another issue was that very
search term, “Adrian Henri”. Searching for just ‘Adrian’ or ‘Henri’ rather than
‘Adrian Henri’ is unhelpful in that it throws up results of which the majority
are not relevant: ‘“Henri” NEAR “painter”’ might give you Matisse; ‘“Adrian”
NEAR “poet”’ might give you Mitchell. My own research and interview experience
has been that people are likely to refer to him as ‘Henri’ or as ‘Adrian’, so
the fact that I was only searching for ‘Adrian Henri’ might have excluded some
results. However, articles on online magazines and the like do usually follow
academic and journalistic traditions of referring to the subject by their full
name in the first instance, and then surname, so therefore are caught by the
crawl.

I had to decide what labels to
search for in relation to Henri, and my initial searches – using what terms I
was already aware of – may have excluded other labels and ways of talking about
Henri. I also found that my own academic assumptions were not the standard –
there were 203 results for the label ‘Liverpool poet’, versus only 3 for
‘Merseybeat poet’, the term I am using in my thesis!

Search for ‘“Adrian Henri”
AND …’

Number of items returned

“painter and poet”

5

“poet and painter”

2

“painter/poet”

5

“poet/painter”

10

“performance poet”

0

“performer”

10

“entertainer”

16

Fig 1 – examples of search terms and
results

The five results for both
“painter and poet” and “painter/poet” were all from the Tate Archives.(2) This – with search terms placing the artistic side of his output first – is not
surprising, given that the Tate is an art gallery. It did surprise me that
“performance poet” did not prove a useful search term, although this is perhaps
an academic designation rather than a layman’s term – as evidenced by the
results for “entertainer”. But none of these results can be taken at face
value, as this report shall discuss.

Boolean searching: How near is NEAR?

These initial exploratory
searches bring me to my first problem with the data. Throughout this report
what I refer to as problems are not faults with the dataset or the browser but
rather potential issues for the users interacting with it. Parameters for how close
together the two search terms can differ, but I found that the NEAR search was
sometimes not near enough here. I found two issues when reading the actual
results: firstly, that the terms were often not that close together; and
second, that the second term was not actually being used to discuss Henri:

Therefore, the results in the
table listed above are not a reliable source for enumerating the most common
labels attached to Henri – one cannot rely on reading only the initial search
results.

Crawl dates: Encountering a display problem

I have already stated that some
results could not be ‘clicked through’ and their content displayed past that
initial search results page, such as the Tate results for “painter and poet”
and “painter/poet”. There is therefore no way of knowing what the pages
actually contained. At other times, there were results which could not be viewed
for a different reason: they did not even appear on the search results page.

This revealed itself to me when
running an exploratory query. After a basic search for “Adrian Henri”, one of
the things that I noticed is that there is a ‘jump’ in the number of hits in
the year 2000. Whilst this is not the highest number (2007 has 345), I thought
that this could be explained by this being the year that he died – obituaries,
tributes, more ‘noise’ around his name.

Fig 3 – showing results for “Adrian
Henri” by crawl year (4)

Clicking through to filter these results by that year – and
hoping to find relevant obituary results – I encountered my first problem. From
242 results on the initial search, the “Search found 202 items”:

Fig 4 – filtering “Adrian Henri” results
by crawl year “2000” (5)

Furthermore, when clicking
through to the second page of these already shrinking items, the number jumped
down again to 186:

This was repeated elsewhere – for
example, the following year, 2001, went from 53 potential results to 37 search
items being displayed. It was not the case that the items were only those which
could be ‘clicked through’ – as the Tate example above shows, those which the
Wayback Machine could not display were still included in the search items.

One potential explanation for the
discrepancy between the total number of results and number of items which the
“search found” is that the results returned here might omit duplications,
perhaps where a second crawl finds nothing different from the first. I am
unsure whether this is a valid response, as I have found many instances of
crawls where the Wayback Machine’s results are exactly the same from crawl to
crawl. Furthermore, of the 242 results for 2000, 235 were from Amazon.co.uk,
and not related to his death. I would, therefore, propose that the ‘jump’ came
simply from there being more crawls in that year, as it must be remembered that
the dates are dates at which the sites were recorded, not the dates at which
the material was published.(7) Whatever the reason, this shows that the results must be interrogated further
along the line from the initial search, as however innocent the numbers appear,
they cannot be presented without ‘digging down’ to the actual website results
themselves.

Sentiment Analysis: Don’t take it on face value

Taking a quick look at the totals
when doing a basic search for “Adrian Henri” reveals mostly neutral results, as
one might expect from an analysis over a large amount of text, but the results
are also far more positive than negative, if a sentiment is found – 136 “very
positive” versus 11 “very negative”. However, this is another lesson is
‘digging down’ and not taking the results at face value.

Fig 6 – showing sentiment totals for the
“Adrian Henri” search (8)

The success of sentiment analysis
relies in part on how positivity or negativity is determined across the whole
search parameters. This quote from a 1998 school newsletter is clearly – and
does indeed appear under the term – very positive:

Many thanks to Stockport Art Gallery staff for the
invitation to bring our Junior children to meet Adrian Henri, the famous artist
and poet, on Wednesday 21 October. Adrian was terrific, telling us the stories
behind many of the pictures currently on exhibition at the Gallery and reading
from his poetry collections. We can really recommend a visit to see his work.
Many thanks to Adrian for a great day with you in Stockport! (9)

However, other results which were
listed as “very positive” must be discounted from this total for the same
reason as the proximity searches above: the positive nature of the whole is not
related to Henri’s part. See, for example, the discussion of Carol Ann Duffy’s The World’s Wife in an AQA English
Literature Examiner’s Report from June 2005:

Once again, The World’s Wife proved highly popular:
more centres study this text than any other on the paper. As last year,
examiners were impressed by the enthusiasm and engagement with which many candidates
approach Duffy’s poetry … Examiners were also concerned that intrusive, and
often irrelevant, biographical material (such as lengthy character
assassinations of Adrian Henri) prevented candidates from meeting the
Assessment Objectives.(10)

Whilst
this, therefore, means one cannot blithely cite all 136 “very positive” results
in Henri’s favour, we also need to revise the total of “very negative” results.
Firstly, of the 11 results, the 6 items which can be displayed are all the same
Peter Finch interview:

And
secondly, in this interview Henri actually appears very favourably:

The Liverpool Scene arrived, and with it the merging
of music and poetry with Roger McGough, Brian Patten, Adrian Henri, and others.
I eventually met Adrian Henri, who was also a painter, and the most
interesting, I thought, of the three. We became frends and he pointed me in
some new directions.(12)

The
Wayback Machine has 12 captures of this page on this site, from October 2006 to
July 2013. Each crawl obviously takes a snapshot of whatever is on the page at
the time, and the crawl date is clearly indicated in the results, but the 11
apparently different “very negative” results are, in practise, all the exact same
interview, the text of which has not changed (bat the removal of the first line
under the title), although the formatting of the page itself has slightly
changed (see the links beneath the header), as illustrated here:

I have suggested that one reason
for the discrepancy between the total number of results and the items which can
be displayed is that the duplications might not be shown, and the snapshots for
this page do show that there have been changes over time, but what this also
shows is the need to interrogate the results, at the level of those snapshots,
rather than making assumptions based on the initial totals. Whilst this may be
deliberately simplifying the issue, the message to take away here is not to
take the results on face value: there aren’t 11 “very negative” results – there
are none at all!

Brief Conclusions

This report has attempted to
present some of the potential mishaps involved with looking at the Web Archive results
on the surface, at face value. What my exploratory searches have shown is that one
cannot make assumptions based purely on looking at the initial search results –
you have to dig down.

Being involved in the AADDA
project was certainly useful for my own research, as I found sources of
information which I wouldn’t have found otherwise, such as pages which are no
longer live, or places I hadn’t thought to look. It was also fascinating to
read non-academic histories of performance poetry and the 1960s underground,
where Henri and the Merseybeat poets appear as far more important than in
‘official’ criticism.[15] These histories
were also presented as if public knowledge, proving my theory that those
‘ordinary’ people who received the work did have an idea of its importance, and
that the audiences for this kind of poetry were significant, particularly in
terms of recognising the legacy of the Merseybeat poets where academia has
dismissed them. However, what my research experiences have been far more useful
for, I believe, is pointing up some of the potential issues – both with the
interface (display problems) and the users (making assumptions) – before the
Domain Dark Archive goes live.

(1) I am
aware of sites which were not included in the slice available for this initial
project, as well as those without a UK domain suffix which are beyond the scope
of the project, such as www.my-liverpool.co.uk
or www.mudcat.org.

(7) This is
something which we have discussed at AADDA meetings, and I feel that the
interface does make this clear, it is just something which should be stressed
to users in any guidance material, to avoid misunderstanding.

Thursday, 13 June 2013

It is
commonplace to describe something new in relation to something that is known:
think 'motion picture', 'spaceship', 'email' or 'smartphone'. The word
'webpage' is no different. And indeed in a sense many webpages are similar to
the pages found in books or newspapers: they hold static media (text, image);
core elements of them read from top to bottom; their headers, footers,
cut-aways and advertisements orientate, guide and entice the reader; and in
URLs they possess a (relatively) unique system of identifiers. It is hard to
think of another name these digital objects could have been given.

It is also
commonplace for the new thing to - linguistically speaking - replace the old
thing: think 'motion picture' and 'the pictures', 'spaceship' and 'ship',
'email' and 'mail', or 'smartphone' and 'phone'.
The same goes for 'webpage' and 'page'. Here by virtue of this act of
redefinition, the 'page' absorbs features of the webpage not (or less) possible
in book or newspaper pages: features such as dynamic content, user interaction,
and direct links to other pages (or, more precisely, other pages that are not
part of a sequence defined by the author whose work is the main content held by
the page).

All of this
makes the webpage-cum-page appear both familiar and unsettling, conservative
and disruptive, old and new. These elements of lineage are crucial, for they
have allowed us (among other things) to think of preserving the webpage as akin
to preserving the page. Yes the challenges of novelty and disruption are
discussed and debated (on which I'm not qualified to comment), but at the most
basic level the webpage stuff that is being collected by Internet
Archive or the UK Web Archive is page level stuff.
(This is not to say I don't think page level stuff should be archived. Far from
it, the fragility of webpages is well known (see Rosenzweig, 2003) and without these
efforts valuable data on our society would be lost.)

Does this
make our nomenclature for what this stuff is problematic? For to call a
webpage a page is to potentially place it into a category for which it is
ill-suited and the techniques for investigating that category under
huge-strain. Take a normal news article from the Guardian
website as an example. The page contains a story, framing, context and
advertisements: all very page like. But those adverts are dynamic as opposed to
static, their content quite possibly targeted depending on the IP address
accessing the URL and different each time the page is refreshed. The page also
contains moderated comments, ranked as default by oldest first but malleable to
user preferences. In short, when you visit the website it is unlikely to be the
same as when I visit the website, so an archived version can only be one possible
version of a webpage at a particular historical moment. Not very page like
behaviour. Of course we might (quite rightly for the most part) say that the
'core' of the page, the textual content that historians are likely to be
interested in will remain the same regardless of these peripheral changes. And
yet as the growth of mainstream live blogs demonstrates (such as those
covering the Taksim Square protests), the web
is moving toward dynamic content over static content as default: embedded
video, maps and text content streams are now commonplace, and are likely to
become more so as the web develops.

The webpage
then is a rapidly evolving beast whose capacity to change whilst still being
called a 'page' complicates how we do research using webpages and how we
preserve the internet. It is a page but not a page as we knew it, a semantic
shift worth keeping in mind as we prepare for an era of born-digital historical
scholarship.