3
CERN Ian.Bird@cern.ch3 LCG - Goals The goal of the LCG project is to prototype and deploy the computing environment for the LHC experiments Two phases: –Phase 1: 2002 – 2005 –Build a service prototype, based on existing grid middleware –Gain experience in running a production grid service –Produce the TDR for the final system –Phase 2: 2006 – 2008 –Build and commission the initial LHC computing environment F LCG is not a development project – it relies on other grid projects for grid middleware development and support

4
CERN Ian.Bird@cern.ch4 LCG - Timescale Why such a rush – LHC wont start until 2007 ??? TDR must be written in mid-2005: –Approval of TDR –Need 1 year to procure, build, test, deploy, commission the computing fabrics and infrastructure – to be in place end 2006 In order to write the TDR, essential to have at least 1 year experience –In running a production service –At a scale that is representative of the final system (50% of 1 expt) –Running data challenges – including analysis, not just simulations It can easily take 6 months to prepare such a service F We must start now … goal is to have a service in place in July

10
CERN Ian.Bird@cern.ch10 Deployment Goals for LCG-1 Production service for Data Challenges in 2H03 & 2004 –Initially focused on batch production work –But 04 data challenges have (as yet undefined) interactive analysis Experience in close collaboration between the Regional Centres –Must have wide enough participation to understand the issues Learn how to maintain and operate a global grid Focus on a production-quality service –Robustness, fault-tolerance, predictability, and supportability take precedence; additional functionality gets prioritized LCG should be integrated into the sites physics computing services – should not be something apart –This requires coordination between participating sites in: Policies and collaborative agreements Resource planning and scheduling Operations and Support

11
CERN Ian.Bird@cern.ch11 Middleware Deployment LCG-0 was deployed and installed at 10 Tier 1 sites –Installation procedure was straightforward and repeatable –Many local integration issues were addressed LCG-1 will be deployed to these 10 sites to meet the July milestone –Time is short – integrating the middleware components took much longer than anticipated –Planning under way to do the deployment in a short time once the middleware is packaged –LCG team will work directly with these sites during the deployment –Initially testing activities to stabilise service will take priority –Expect experiments to start to test the service by mid-August

13
CERN Ian.Bird@cern.ch13 LCG-1 Distribution Packaging & Configuration –Service machines – fully automated installation LCFGng – either full or light version –Worker nodes – aim is to allow sites to use existing tools as required LCFGng – provides automated installation Installation scripts provided by us – manual installation Instructions allowing system managers to use their existing tools –User interface LCFGng Installed on a cluster (e.g. Lxplus at CERN) Pacman? Distribution –Distribution web site being set up now (updated from LCG-0) Sets of rpms etc organised by service and machine type User guide, Installation guides, release notes, etc., being written now

14
CERN Ian.Bird@cern.ch14 Middleware Status Integration work of EDG 2.0 has taken longer than hoped –EDG has not quite released version 2.0 – imminent LCG has a working system – able to run jobs: –Resource Broker: many changes since previous version, needs significant testing to determine scalability and limitations –RLS: Initial deployment will be single instance (per VO) of LRC/RMC Distributed service with many LRC and indexes not yet debugged Initially will run LRC for all VOs at CERN with Oracle service backend –Information system: R-GMA is not yet stable We will initially use MDS: work to improve stability (bug fixes), and redundancy – based on experience with EDG testbeds and Nikhef, NorduGrid work Intend to make direct comparison between MDS and R-GMA on certification testbed Waiting for bug fixes – of several components Still to do before release: –Reasonable level of testing –Packaging and preparation for deployment

15
CERN Ian.Bird@cern.ch15 Certification & Testing This is primary tool to stabilise and debug the system –Process and testbed has been set up –This is intended to parallel the production service Certification testbed: –Set of 4 clusters at CERN – simulates a grid on LAN –External sites that will be part of cert. tb. U. Wisconsin, FNAL – currently Moscow, Italy – soon This testbed is being used to test the release candidate –Will be used to reproduce and resolve problems found in the production system, and to do regression testing for updated middleware components before deployment

16
CERN Ian.Bird@cern.ch16 Infrastructure for initial service - 2 Security issues –Agreement on set of CAs that all LCG sites will accept EDG list of traditional CAs FNAL on-line KCA –Agreement on basic registration procedure for users LCG VO where users sign Acceptable Usage Rules for LCG 4 experiment VOs – will use existing EDG services run by Nikhef Agreement on basic set of information to be collected –All initial registrations will expire in 6 months – we know the procedures will change –Experiment VO managers will verify bona fides of users –Acceptable Use Rules – adaptation based on EDG policy for now –Audit trails – basic set of tools and log correlations to provide basic essential functions

19
CERN Ian.Bird@cern.ch19 RegionA1 GIIS RegionA2 GIIS BDII A LDAP BDII B LDAP RB RegionB1 GIIS RegionB2 GIIS CE1 GRIS CE2 GRIS SE1 GRIS SE2 GRIS SiteC GIIS CE1 GRIS CE2 GRIS SE1 GRIS SE2 GRIS SiteD GIIS CE1 GRIS CE2 GRIS SE1 GRIS SE2 GRIS SiteA GIIS CE1 GRIS CE2 GRIS SE1 GRIS SE2 GRIS SiteB GIIS Query Register /dataCurrent/../dataNew/.. BDII LDAP Swap&Restart Query While using the data from one directory the BDII will query the regional GIISes to fill another directory structure. If this has finished the BDII is stopped, the dirs swapped and the BDII is then restarted. The restart takes less than 0.5 sec. To improve the availability during this time it was suggested that the TCP port should be switched off and the TCP protocol should take care of the retry (David). This has to be tested. Another idea worth testing is to remove the site GIIS and configure the GRISes to register directly with the region GIISes secondary primary Using multiple BDIIs requires RB changes LCG-1 First Launch Information System Overview

20
CERN Ian.Bird@cern.ch20 LCG-1 First Launch Information System Sites and Regions A Region should contain not too many sites since we have observed problems with MDS if a large number of sites are involved. To allow for future expansion, but not to make the system too complex I suggest to start with two regions and if needed split later to smaller regions. The regions are: West of 0 degree and East. The idea is to have a large region and a small one and see how they work For the West 2 region GIISes and for the East 3 should be setup at the beginning, RAL FNAL BNL WEST1 RegionGIIS WEST2 RegionGIIS CERN CNAF LYONMOSCOW FZK TOKYO TAIWAN EAST1 RegionGIIS EAST2 RegionGIIS EAST3 RegionGIIS

22
CERN Ian.Bird@cern.ch22 Plans for 2003 – 2 Middleware functionality Top priority is problem resolution and issues of stability/scalability RLS developments –Distributed service – multiple LRC, and RLI –Later: develop a service to replace client command-line tools VOMS service –To permit user and role-based authorization Validation of R-GMA –And then deployment of multiple registries – initial implementation has singleton Grid File Access Library –LCG development: POSIX-like I/O layer to provide local file access Development of SRM/SE interfaces to other MSS –Work that must happen at each site with a MSS Basic upgrades –Compiler support –Move to Globus 2.4 (release supported through 2004) Cut-off for functionality improvements is October – in order to have a stable system for 2004

24
CERN Ian.Bird@cern.ch24 Expansion of LCG resources Adding new sites –Will be a continuous process as sites are ready to join the service –Expect as a minimum 15 sites (15 countries have committed resources for LCG in 1Q04), reasonable to expect 18-20 sites by end 2003 –LCG team will work directly with Tier 1 (or primary site in a region) –Tier 1s will provide first level support for bringing Tier 2 sites into the service Once the Tier 1s are stable this can go in parallel in many regions LCG team will provide 2 nd level support for Tier 2s Increase grid resources available at many sites –Requires LCG to demonstrate utility of service – experiments in agreement with site managers add resources to the LCG service

25
CERN Ian.Bird@cern.ch25 Operational plans for 2003 Security –Develop full security policy –Develop longer term user registration procedures and tools to support it –Develop Acceptable Use policy for longer term – requires legal review Operations –Develop distributed prototype operations centres/services Monitoring developments driven by experience –Provide at least 16 hr/day global coverage – problem response –Basic level of resource use accounting – by VO and user –Minimal level of security incident response and coordination User Support –Development direction depends strongly on experience in the deployed system –Operations and User Support must address the issues of interchanging problem reports – with each other and with sites, network ops, etc.

26
CERN Ian.Bird@cern.ch26 Middleware roadmap Short term (2003) –Use what exists – try and stabilize, debug, fix problems, etc. –Exceptions may be needed – WN connectivity, client tools rather than services, user registration, … Medium term (2004 - ?) –Same middleware, but develop missing services, remove exceptions –Separate services from WNs – aim for more generic clusters –Initial tests of re-engineered middleware (service based, defined interfaces, protocols) Longer term (2005? - ) –LCG service based on service definitions, interfaces, protocols, - aim to be able to have interoperating, different implementations of a service

27
CERN Ian.Bird@cern.ch27 Inter-operability Since LCG will be VDT + higher level EDG components: Sites running same VDT version should be able to be part of LCG, or continue to work as now LCG (as far as possible) has goal of appearing as a layer of services in front of a cluster, storage system, etc. –State of the art currently implies compromises …

28
CERN Ian.Bird@cern.ch28 Integration Issues LCG will try to be non-intrusive: –Will assume base OS is already installed –Provide installation & config tool for service nodes –Provide recipes for installation of WNs – assume sites will use existing tools to manage their clusters No imposition of a particular batch system –As long as your batch system talks to Globus (OK for LSF, PBS, Condor, BQS, FBSng) No longer requirement for shared filesystem between gatekeeper and WNs –was a problem for AFS, NFS does not scale to large clusters Information publishing –Define what information a site should provide (accounting, status, etc), rather than imposing tools But … maybe some compromises in short term (2003)

29
CERN Ian.Bird@cern.ch29 Worker Node connectivity In general (and eventually) it cannot be assumed that the cluster nodes will have connectivity to remote sites –Many clusters on non-routed networks (for many reasons) –Security issues –In any case this assumption will not scale BUT… –To achieve this several things are necessary: –Some tools (e.g. replica management) must become services –Databases (e.g. conditions db) must either be replicated to each site (or equivalent), or proxy service, or … –Analysis models must take this into account –Again, short term exceptions (up to a point) possible Current additions to LXbatch at CERN have this limitation

32
CERN Ian.Bird@cern.ch32 Conclusions Essential to start operating a service as soon as possible – we need 6 months to be able to develop this to a reasonably stable service Middleware components are late – but we will still deploy a service of reasonable functionality and scale –Much work will be necessary on testing and improving the basic service Several functional and operational improvements are expected during 3Q03 Expansion of sites and resources foreseen during 2003 should provide adequate resources for 2004 data challenges There are many issues to resolve and a lot of work to do – but this must be done incrementally on the running service

33
CERN Ian.Bird@cern.ch33 Conclusions From the point of view of the LCG plan – we are late in having testable middleware with the functionality that we had hoped for We will keep to the July deployment schedule –We expect to have the major components – the user view of the middleware (i.e. via the RB) should not change –Expect to be able to do less testing and commissioning than planned –But hopefully, with a suitable process we will incrementally improve & add functionality as it becomes available and tested