Menu

Tag Archives: eresearchnz2013

Connecting Genetics Researchers to NeSI James Boocock & David Eyers, University of OtagoPhil Wilcox, Tony Merriman & Mik Black, Virtual Institute of Statistical Genetics (VISG) & University of OtagoTheme of conference “eResearch as an enabler” – show researchers that eresearch can benefit them and enabling them.There’s been a genomic data explosion – genomic, microarray, sequencing data. Genetics researchers need to use computers more and more. Computational cost increasing, need to use shared resources. “Compute first, ask questions later”.

Galaxy aims to be web-based platform for computational biomedical research – accessible, reproducible, transparent. Has a bunch of interfaces. Recommends shared file system and splitting jobs into smaller tasks to take advantage of HPC.

Goal to create an interface between NeSI and Galaxy. Galaxy job > a job splitter > subtasks performed at NeSI then ‘zipped up’ and returned to Galaxy. Not just file spliting by lines, but by genetic distance. Gives different sized files.

Used git/github to track changes, and Sphynx for python documentation. Investigating Shibboleth for authentication. Some bugs they’re working on. Further looking at efficiency measures for parallelization, building machine-learning approach do doing this.

Myths vs Realities: the truth about open data Deborah Fitchett & Erin-Talia Skinner, Lincoln UniversityOur slides and notes available at the Lincoln University Research Archive

Some rights reserved: Copyright Licensing on our Scholarly record Richard Hosking & Mark Gahegan, The University of Auckland

Copyright law has effect on reuse of data. Copyright = bundle of exclusive rights you get for creating work, to prevent others using it. Licensing is legal tool to transfer rights. Variety of licensing approaches, not created equal.

Technical, semantic, and legal challenges.Research aims to capture semantics of licenses in a machine-readable format to align with, and interpret in context of, research practice. Need to go beyond natural language legal text. License metadata: RDF is a useful tool – allows sharing and reasoning over implications. Lets us work out whether you can combine sources.

This won’t help if there’s no license, or legally vague, or for novel use cases where we’re waiting for precedent (eg text mining over large corpuses)

Compatibility chart of Creative Commons licenses – some very restricted. “Pathological combinations of licenses”. Computing this can help measure combinability of data, degree of openness. Help understanding of propagation of rights and obligations.

Comment: PhD student writing thesis and reusing figures from publications. For anything published by IEEE legally had to ask for permission to reuse figures he’d created himself. Not just about datasets but anything you put out.

Comment: “Best way to hide data is to publish a PhD thesis”.

Q: Have you started implementing?A: Yes but still early on coding as RDF structure and asking simple questions. Want to dig deeper.

Q: Get in trouble with practicing law – always told by institution to send questions to IP lawyers etc. Has anyone got mad at you yet?A: I do want to talk to a lawyer at some point. Can get complex fast especially pulling in cross-jurisdiction.Comment: This will save time (=$$$) when talking to lawyer.A: There’s a lot of situations where you don’t need a lawyer – that’s more for fringe cases.

Scientific process getting reduced to database problem – instead of querying the world we download the world and query the database…

UoW eScience Inst to get in the forefront of research in eScience techniques/technology, and in fields that depend on them.

3Vs of big data:volume – this gets lots of attention but variety – this is the bigger challengevelocity

Sources a longtail image from Carol Goble showing lots of data in Excel spreadsheets, lab books, etc, is just lost.Types of data stored – especially data data and some text. 87% of time is on “my computer”; 66% a hard drive…Mostly people are still in the gigabytes range, or megabytes, less so in terabytes (but a few in petabytes).No obvious relationship between funding and productivity. Need to support small innovators, not just the science stars.

Problem – how much time do you spend handling data as opposed to doing science? General answer is 90%.May be spending a week doing manual copy-paste to match data because not familiar with tools that would allow a simple SQL JOIN query in seconds.Sloan Digital Sky Survey incredibly productive because they put the data online in database format and thousands of other people could run queries against it.

MetadataHas been recommending throwing non-clean data up there. Claims that comprehensive metadata standards represent a shared consensus about the world but at the frontier of research this shared consensus by definition doesn’t exist, or will change frequently, and data found in the wild will typically not conform to standards. So modifies Maslow’s Needs Hierarchy:Usually storage > sharing > curation > query > analyticsRecommends: storage > sharing > query > analytics > curationEverything can be done in views – cleaning, renaming columns, integrating data from different sources while retaining provenance.

Bring the computation to the data. Don’t want just fetch-and-retrieve – need a rich query service, not a data cemetary. “Share the soup and curate incrementally as a side-effect of using the data”.

Convert scripts to SQL and lots of problems go away. Tested this by sending postdoc to a meeting and doing “SQL stenography” – real-time analytics as discussion went on. Not a controlled study – didn’t have someone trying to do it in Python or R at same time – but would challenge someone to do it as quickly! Quotes (a student?) “Now we can accomplish a 10minute 100line script in 1 line of SQL.” Non-programmers can write very complex queries rather than relying on staff programmers and feeling ‘locked out’.

Data scienceTaught an intro to data science MooC with tens of thousands of students. (Power of discussion forum to fix sloppy assignment!)

Lots of students more interested in building things than publishing, and are lost to industry. So working on ‘incubator’ projects, reverse internships pulling people back in from industry.

Q: Have you experimented with auto-generating views to cleanup?A: Yes, but less with cleaning and more deriving schemas and recommending likely queries people will want. Google tool “Data wrangler”.

Q: Once again people using this will think of themselves as ‘not programmers’ – isn’t this actually a downside?A: Originally humans wrote queries, then apps wrote queries, now humans are doing it again and there’s no good support for development in SQL. Risk that giving people power but not teaching programming. But mostly trying to get people more productive right now.

Idea of a virtual laboratory as a container for data (from variety of disciplines) and a number of tools. But many existing tools are like virtual laboratories themselves, often specific to disciplines.

Defined project as linked open data project. Humanities data into HuNI triple store (using RDF), embedded in HuNI virtual lab to create user interface. Embellishments include to provide linked open data in SPARQL, and publish via OAI-PMH; and to use AAF (Shibboleth) authentication; to use SOLR search server for virtual lab.

Have ideas of research use-cases (basic and advanced eg SPARQL queries) and desired features, eg custom analysis tools. The challenge is to get internal bridging relationships between datasets and global interoperability. Aggregating doesn’t solve siloisation.

“Technology-driven projects don’t make for good client outcomes.”

Q: What response from broader humanities community?A: Did some user research, not as much as wanted. Impediment is that when building database tend to have more contact with people creating collections than people using them. Trying to build framework/container first and idea is that researchers will come to them and say “We want this tool” and they’ll build it. Funding set aside for further development.

Q: You compared this to Galaxy, but you’ve built from ground-up where Galaxy is more fluid. A person with command-line can create tools in Galaxy but with HuNI you’d have to do it yourself.A: Bioinformatics folk tend to be competent with Python – but we’re not sure what competencies our researchers will have, less likely to be able to develop for themselves.

Requirements for a New Zealand Humanities eResearch InfrastructureJames Smithies, University of CanterburyVast amounts of cultural heritage being digitised or being born online. Humanities researchers will never be engineers but need to work through the issues.

International context:Humanities computing’s been around for decades but still in its infancy. US, UK, even Aus have ongoing strategic conversations, which helps build roadmaps. NZ is quite far behind these (though have used punchcards where necessary). “Digging into Data Challenge” overseas but we’re missing out because of lackk of infrastructure and lack of awareness.

Fundamentals of humanities eresearch:HuNI provides a good model. Need a shift from thinking of sources as objects to viewing them as data. Big paradigm shift. Not all will work like this. But programmatic access will become more important.

National context:19th century ship’s logs, medical records from leper colonies. Hard to read, incomplete, possibly accurate. Have traditional methods to deal with these but problems multipy when ported into digital formats. Big problem is lack of awareness of what opportunities exist. So capabilities and infrastructure is low. Decisions often outsourced to social sciences.At the same time, DigitalNZ, National Digital Heritage Archive, Timeframes archive, AJHR, PapersPast, etc are fantastic resources that could be leveraged if we come up with a central strategy.

Requirements:

Need to develop training schemes

Capability building. Lots of ideas out there but people don’t know where to start. Need to look at peer review, PBRF – how to measure quality and reward it.

International collaboration

Requirements elicitation and definition

Funding for all of the above including experimentation

Q: Data isn’t just data, it’s situated in a context. Being technology-led and using RDF is one thing. But how do we give richness to a collection?A: Classic example would be researcher wanting access to object properly marked up and contribute to the conversation by adding scholarly comments, engage with other marginalia. Eg ancient greek text corpus (is I think describing the Perseus Digital Library). Want both a simple interface and programmatic access.

Q: Need to make explicit the value of an NZ corpus. Have some pieces but need to join up. Need to work with DigitalNZ. Once we have corpus can look at tools.A: Yes, need to get key stakeholders around table and talk about what we need.

Capturing the flux in Scientific KnowledgePrashant Gupta & Mark Gahegan, The University of AucklandEverything changes – whether the physical world itself or our understanding of the world:* new observation or data* new understanding* societal driversHow can we deal with change and make our tools and systems more dynamic to deal with change?

Ontology evolution – have done lots of work on this. Researchers have updated knowledge structure and incorporated in forms of provenance or change logs. Tells us “knowledge that” eg What is the change, when it happened, who did it, to what, etc. But we still don’t capture “knowledge how” or “knowledge why”.

Life cycle of a category:Processes, context, researchers’ knowledge are involved in birth of a category – but these tend to be lost when the category’s formed. We’re left with the category’s intension, extension, and place in the conceptual hierarchy. Lots of information not captured.

“We focus on products of science and ignore process of science”.

Proposes connecting static categories and the process of science to get a better understanding. Could act as a fourth facet to a category’s representation. Can help address interoperability problem and help track evolution of categories.

How we doing and how can we work better with Australia? * NJ: Have been working closer recently, but big gaps in data especially, and unevenness in various disciplines. * SC: Working to identify gaps and work across organisations. REANNZ working closer with AARnet than have in the past which is bearing fruit re bandwidth. * Political overlay – need to be able to say we’ve got the scientific partnership working. * RF: Fair amount of partnership. But have found that governance separates things. “I don’t believe in uninterpreted data.” Need to figure out combo of data and tools to get results. * Plenty of opportunity to work with Australia. Useful to look at infrastructures and what they’ve done right and haven’t done right – lessons to be learn. * AR: Problems faced here are not unique so you can avoid our mistakes and make your own instead. 🙂

National Science Challenge signals government would like to roll framework out further. How do researchers engage with this? * NJ: At many workshops people already know what they want to work on; at others there’s range of possibilities. Need to build networks so not everyone has to be at table. * RF: eResearch and IT isn’t mentioned in challenges – but these are embedded in everything. If you want to be world-class at X, you need to be good at computer science.

How would you benchmark and measure return on investment? * AR: Instance where in early days govt felt that if people wanted to keep investing, it must be valuable. This is changing now that investments are bigger. Hesitant about benchmarking because don’t really want to be doing the same as anyone else. * RF: How do you go from 0 to world’s best supercomputer overnight? No idea how to measure that. It’s a commitment to the advancement of knowledge but the govt doesn’t have a KPI about that…

NZ had to set up Tuakiri because differences in law meant we couldn’t use Australia’s system. What other things the two countries might have to do to overcome differences in legislation? * (Other audience member) – Yes there are differences so have needed to build systems that deal with both privacy acts and have been successful. * (Anne Berryman) – Have started conversation with counterparts overseas and chief science advisors in Aus/NZ have a line of communication. There are platforms and issues we can deal with.

One goal is to achieve self-sustainability, eg user charging, member contributions. What’s the Australian experience in user-pays and sustainability? * RF: Financial benefits are overwhelming. If went to commercial provider it’d cost more and do less. Sustainability needs constant flow of funds to keep supercomputing running. There is a sustainability cliff. Govt keeps putting money in. * SC: MBIE have removed self-sustainability requirement. Charging to make sure researchers have skin in the game does prove that service is needed; but not everyone can participate who should be.

Unlocking the Secrets of 3 Billion Pages: Introducing the HathiTrust Research CenterKeynote from J. Stephen Downie, Associate Dean for Research and a Professor at the Graduate School of Library and Information Science at the University of Illinois at Urbana-Champaign.

Of the 3.4 million volumes in the public domain, about a third are in public domain only in the US; the rest are public domain worldwide (4% US govt documents so public domain from point of publication).

Services to member unis: * long term preservation * full text search * print on demand * datasets for research

Data:Bundles have for each page a jpg, OCR text, xml which provides location of words on each page.METS holds the book together – points to each image/text/xml file. And built into the METS file is structure information et table of contents, chapter start, bibliography, title, etc.Public domain data available through web interfaces, APIs, data feeds

“Public-domain” datasets still require a signed researcher statement. Stuff digitised by Google has copyright asserted over it by Google. And anything from 1872-1923 is still considered potentially under copyright outside of the US. Working on manual rights determination – have a whole taxonomy for what the status is and how they assessed it that way.

Non-consumptive research paradigm – so no one action by one user, or set of action by a group of users, could be used to reconstruct works and publish. So users submit requests, Hathi does the compute, and sends results back to them. [This reminds me of old Dialog sessions where you had to pay per search so researchers would have to get the librarian to perform the search to find bibliographic data. Kind of clunky but better than nothing I guess…]

Meandre lets researcher set up the processing flow they want to get their results. Includes all the common text processing tasks eg Dunning Loglikelihood (which can be further improved by removing proper nouns). Doesn’t replace a close-reading – answers new questions. Correlation-Ngram viewer so can track use of words across time.

OCR noise is a major limitation.

Downie wants to engage in more collaborative projects, more international partnerships, and move beyond text and beyond humanities. Just been awarded a grant for “Work-set Creation for Scholarly Analysis: Prototyping Project”. Non-trivial to find a 10,000-work subset of 10million works to do research on – project aims to solve this problem. Also going to be doing some user-needs assessments, and in 2014 will be awarding grants for four sub-projects to create tools. Eg would be great if there was a tool to find what pages have music on.

Ongoing challenges:How do we unlock the potential of this data? * Need to improve quality of data; improve metadata. Even just to know what’s in what language! * Need to reconcile various data structure schemes * May need to accrete metadata (there’s no perfect metadata scheme) * Overcoming copyright barriers * Moving beyond text * Building community

Synergies between sectors, between Australia/New Zealand. Ability to move to researcher-centric rather than infrastructure-centric.

No connections apparent to government systems which are needed by digital humanities.

From experience researchers need lots of help. Australian ideal seems to be it’s all there and easy-to-use on desktop. Nice ideal but how practical?

Data management and data curation are still “dragons in a swamp. We know there’s dragons there, don’t know what they look like, but we’re planning to kill them anyway.”

Need data management policy and a national solution. And if going to invest all this money in research don’t want to delete all the data so need to work on preservation too.

Good to see REANNZ looking at service level and tools. Lots to learn from Australia about where we need to put our efforts.

There is a policy direction from government around access and reuse of data. Challenge is around how to most effectively implement this. Especially re publically funded research (cf commercially sensitive) there’s an expectation that there’d be access to the results and, where possible, the data. But still work to do.

Users who don’t get help can get something out of the system; but users to do get help can do a whole lot more. Hence software carpentry sessions. [Cf this blog post about software carpentry I coincidentally read today.]

Peer instruction becomes very important – need someone who’s doing similar things to come in and teacher researchers and students.

Windowing: So send more than one packet at a time and see which stuff arrived and which didn’t. With selective acknowledgements can say “Am missing #4 but got the rest so don’t resend that.” Send enough data to fill the whole pipe. Send it again while waiting for acknowledgements. Amount depends on round trip time. Eg sending 360MB/s with a 280ms latency then BDP = 360×280 = 100.8Mb = 100.8/8MB so need a 12.6MB window

Default OS settings slow you down. To get the best out of REANNZ you need to tune for “elephant” flows (Long Fat Network aka LFN). Tuning your TCP can massively improve performance.

Apply this to lab science, in a linked science setting. Take a “Photons alive” design pattern (using light to virtualise biological processes in an animal). See example paper. Can take a sentence re methodology and annotate eg “imaging” as diagnostic procedure. This using current ontologies gives you the What but not the Why. Need to tag with a “Force” concept eg “immobilisation”. Deeper understanding of process – with role of steps. And can start thinking about what other methods of immobilisation there may be.

So how can we make these patterns? Need to use semantic web methods.A wiki for lab semantics. (Wants to implement this.) Semantic form on wiki – a template. Wiki serves for attribution, peer review, publication – and endpoint to RDF store.

Q: How easy is this to use for a domain expert?A: Semantic modeling is iterative process and not easy. But semantic wiki can hide complexity from enduser so domain expert can just enter data.

Q: We spend lots of time pleading with researchers to fill out webforms. How else can we motivate them, eg to do it during process rather than at end?A: Certain types of people are motivated to use wiki. This is first step, proof of concept. Need a critical mass before self-sustaining.

Q: How much use would this actually be for domain experts? Would people without implicit knowledge gain from it?A: Need to survey this and evaluate. It’s valuable as a democratising process.

Q: What about patent/commercial knowledge?A: Personally taking Open science / linked science approach – intended for research that’s intended to be maximally shared.

Have preferred to do one-to-few applications rather than google-style one-to-billions. Now changing. Because themselves experiencing trouble sending large files. Scraped up own file transfer system, marketed as cloudstor though not in the cloud and doesn’t store things. Expected couple hundred uses, got 6838 users over the last use. Why linear growth? “Apparently word of mouth is a linear thing…” Seem to be known by everyone who have file-sharing issues.

FAQs:Can we keep files permanently?Can I upload multiple files?Why called cloudstor when it’s really for sending?

“cloudstor+ beta” – looks like dropbox so why doing this if already there? They’re slow (hosted in Singapore or US). Cloudstor+ 30MB/s cf 0.75MB/s as a maximum for other systems. Pricing models not geared towards large datasets. And subject to PRISM etc.

Built on a stack:Anycast | AARNetownCloud – best OSS they’ve seen/tested so far – has plugin system and defined APIsMariaDBhadoop – but looking at substituting with XTREEMFS which seems to work with latencies.

Distributed architecture – can be extended internationally. Would like one in NZ, Europe, US, then scale up.

Bottleneck is from desktop to local node. Only way they can address this is to get as close to researcher as possible – want to build local nodes on campus.

Some thoughts on what eResearch might glean from official statisticsLen Cook, * Research-based info competes with other sources of info people use to make decisionsPoliticians like weathercocks – have to respond to wind. Sources of info include: official stats, case studies, anecdote, and ideology/policy framework. More likely to hear anecdotes than research. NZ’s data-rich but poor at getting access to existing data. Confidentiality issues: “Statisticians spend half the time collecting data and the other preventing people from accessing it.” Need to shift ideas – recent shifts in legislation a step to this.

* Official statistics has evolved over the last few centuries19th century: measurement developed to challenge policy. Florence Nightengale wanted to measure wellbeing in military hospitals because it was like taking hundreds of young men, lining them up and shooting them. Mass computation and ingenuity of graphical presentation – all by hand.20th century: development of sampling, reliability, meshblocks. Common classifications, frameworks.1990s and beyond: mass monitoring of transactions. Politics of info access/ownership important. Obligations created when data collected. Registers and identifiers now central. Importance of investing in metadata to categorise and integrate information.

* Managing data not just about technology – probably the reverse.

* Structural limitations. Need strong sectoral leadership. Need a chief information office for a sector not for government as a whole.

NeSI’s Experience as National Research InfrastructureNick Jones, New Zealand eScience InfrastructureNZ is very good at scientific software. Also significant national investments in data (GeoNet, NZSSDS, StatsNZ, DigitalNZ, cellML, LRIS, LERNZ, CEISMIC, OBIS). But also significant (unintended) siloisation and no investment to break down barriers and integrate. However do have good capability. NeSI wants to enhance existing capabilities but also help people meet each other. Build up shared mission, collegiality.

CRIs are widespread. So are research universities. All connected by REANNZ (KAREN). Research becoming more highly connected, collaborative. National Science Challenges targeted to building collaboration too. But sector still fragmented and small-scale. “Each project creates, and destroys, its own infrastructure.”

Need government investment to overcome coordination failure. Institutions should support national infrastructure. NeSI to create scalable computing infrastructure; provide middleware and user-support; encourage cooperation; contribute to high quality research outputs. In addition to infrastructure have team of experts to support researchers.

And trend to need for lossless networking. Easy to predict capacity for youtube etc. But when simulating global weather patterns, datasets are giant and unpredictable – big peaks and troughs in traffic. TCP good at handling loss for small packets, but can be crushed by a large packet loss – 80x reduction in data transfer rates for NZ-type distances. So can’t rely on commercial networks.

Higgs-Boson work example of network as part of the scientific instrument and workflow.

Australian Research Informatics InfrastructureRhys Francis, eResearch Coordination ProjectSustained strategic investment over a decade into tools, data, computation, networks, and buildings (for computation). (Personnel hidden in all of these.) Tools are mission critical, data volumes explode, systems grow to exascale, global bandwidth scales up. High ministerial turnover; each one takes about six months then realises we need this infrastructure. Breaking it down into these areas helps explain it to people.

Hard to explain top, and when you chop it up into bits people think “Any university could have done that bit.” But need expertise and need to share it.

In last 7 years added fibre and super-computing infrastructure. Many software tools and lab integration projects. Hundreds of data and data flow improvement projects. Single sign-on. Data commons for publication/discovery. Recruit overseas but still only so much they can resource.

These things are hard, and it was data slowing it down because didn’t know where collections would physically be. If you’re dealing with petabytes, the only way to move it is by forklift.

eResearch infrastructure brings capabilities to the researcher.NCI and Pawsey: do computational modeling, data analysis, visualise resultsNeCTAR: use new tools, apps, work remotely and colaborate in the cloudANDS and RDSI: keep data and observations, describe, collect, share, etc.

Current status (I’m handpicking data-related bulletpoints): * 50,000 collections published in research data commons * coordination project to work with implementation projects to deploy data and tools as service for key national data holdings