Service dependencies - so that a failure in one component does not generate a plethora dependent alarms.
False alarms - many generated by network firewall issues - addressed with additional network cards.
Demo - going through Nagios pages.
See new features for extended information for a specific service. Has links into RT page.
Operations to know: schedule a downtime, schedule a test, acknowledge a problem.
Resolve problem in RT and then in Nagios, and force a final test.
Need to formalize Nagios operations - go to tactical overview to ack the prob, take ownership of ticket, resolve, close, etc.
Integration RSV probes into Nagios
Options - run as a Nagios plugin.
Need feedback from users. What/who/notification policy (email/pager). Use of event handlers - want/need?
Question about assignment of tasks automatically from RT that may not be "owned" by a Tier2 admin, eg. Tier3 problems. Answer is that Tier3 points of contact must receive the tickets.
Need to provide feedback from Tier2 admins and shift crew.
Network optimization and load tests Dantong Yu
See slides
Network configuration slide - which way does the traffic flow (T1-T1, T2-T1) - is there a bottleneck introduced by the firewalls?
Load testing goals - a ten-fold increase over current performance.
Looking at details between BNL and AGLT2.
Setting up Tier2 dynamic network links BNL to NET2 using Terapaths. What are the policy issues?
Rich Carlson - explaining policy from Internet2 perspective. There is an infrastructure for setting up and tearing down a reserved slice of the bandwidth.
Would like to work out what we can do for special replications for AOD replication.
Nebraska was able to move data at 9 Gbps and isolate it from their campus traffic.
What about Esnet - Internet2 connections? There is on-going work in this area.
Working through example using a vdt client to copy a file into the dcache
Replica manager allow's resilient pools
SRM v2.2 access latency and retention policy - defined w/ WLCG MOU. Mappings for TapeXDiskY. When making a reservation you can specify these attributes.
Link groups are used to capture attribute specifications for space reservations.
Question from Doug: dCache appropriate for a small Tier3? Concerns about equipment overhead for admin services and doors.
SRM on filesystems - Bestman from VDT is available as well.
Question from Wu: what about pool-to-pool transfers - which protocol used. Ans dcap.
Question about queues - in Tomcat, in SRM both have queues. Tomcat for example can open up to 100 threads. Queues in SRM depend on how busy the pools are.
Optimizing data transfers - Shawn McKee
See slides
Goal is to get 200 MB/s T1-T2 for all sites
Disk-to-disk tests are poor for a variety of reasons.
Start w/ network stack...then to other disk
Will need examine the details at each site - mostly in terms of their storage services, gridftp doors
Timescale: should be able to do this now at AGLT2.
OSG Grid Middleware - Robert Gardner
See slides
Questions for follow-up - problems with SE info-provider?
Scalability with the gatekeeper, and failover. There is some work on this at Fermilab.
Glide-in factory to schedule analysis jobs locally to reduce load on gatekeeper.
US ATLAS Services: panda-based release installation - Tadashi Maeno (BNL)
See slides
Someone needs to be identified to 'operate' the system
There may sometimes be problems w/ installs over NFS
Question whether we want to continue this process or use the global ATLAS system. We will still need to put operational manpower behind this system.
Please note that this site is a content mirror of the BNL USATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your BNL USATLAS account.