pyutils that didn't support xrootd (needs to be checked in Release 15.1.0). An then an old release, 14.4.0. A few failures because the xrootd server failed. Also moving release install area off the xrootd server.
Yuri's summary from the weekly ADCoS meeting: http://indico.cern.ch/materialDisplay.py?contribId=1&materialId=0&confId=71379 1) 10/15: From Tomasz, regarding recent issues with nagios: It seems that by now we have a partial understanding of what caused the connectivity problems. The probes are back green and in a few moments I will re-enable nagios notifications. 2) 10/15: From Wei, regarding Bestman / CA certs issue: The issue between Bestman and users with Canada's westgrid certificates (see my original email below) is addressed via a workaround provided by the LBL team. The workaround is to replace $VDT_LOCATION/bestman/lib/globus/cry*.jar in bestman 2.2.1.2.i5 or above by the .jars in bestman 2.2.1.2.i3. 3) 10/16: Jobs were failing at MWT2_UC with the error "Input file DBRelease-7.5.1.tar.gz not found." From Sarah: DBRelease file was transferred with __DQ2 extension, which is incompatible with panda/pilot/athena software. I've renamed the file and updated the lfc registration. 4) 10/17: Storage upgade at SLAC completed-- no major interruptions. 5) 10/19: Job failures at MWT2_UC likely due to disk cleanup removing still-needed files -- from Charles: I think the production job failures at MWT2_UC may have been due to overzealous cleanup of MWT2_UC_PRODDISK triggered by our site almost running out of space over the weekend. ggus 52475. 6) 10/19: Tadashi modified PandaMover to delete redundant files. (Thanks Charles for the feedback.) 7) 10/20: Large number of failed jobs at BNL, with the message "Get error: Too many/too large input files." Ongoing discussions about how to deal with this issue (i.e., in the pilot, split the jobs, etc.) 8) 10/21: Jobs were failing at all U.S. sites due to them attempting to access the Oracle db at BNL. Affected tasks were aborted. (Number of failed jobs >25k.) Follow-ups from earlier reports: (i) ATLAS User Analysis Test (UAT) re-scheduled for October 28-30.
Yuri's summary from the weekly ADCoS meeting: http://indico.cern.ch/materialDisplay.py?contribId=1&materialId=0&confId=72960 1) 10/28 - 30: UAT -- a postmortem announcement to follow. 2) 10/29: Jobs failing at MWT2_UC with "Get error: lsm-get failed (51456): 201 Copy command failed." See eLog 6462 for details from Charles. https://prod-grid-logger.cern.ch/elog/ATLAS+Computer+Operations+Logbook/6462. Also RT 14475. 3) 10/29: A new test instance of the RT server at BNL was announced by Jason (message to the usual mail lists). Try it out at: https://rt.racf.bnl.gov/rt3/ 4) 10/30: BNL -- 32 TB of storage in MCDISK was offline for a period of time. Resolved -- see: https://prod-grid-logger.cern.ch/elog/ATLAS+Computer+Operations+Logbook/6489 5) Over this past weekend a problem arose with FTS proxy delegation at BNL. Hiro tracked it down to a clock skew, possibly related to the daylight savings time change. 6) 11/3: All-day outage at BNL for major core network upgrades -- completed as of ~8:00 p.m. CST. See: https://prod-grid-logger.cern.ch/elog/ATLAS+Computer+Operations+Logbook/6582 7) 11/3: Maintenance outage at MWT2 sites for a dCache upgrade (1.9.5). Completed, test jobs submitted by site admins, and the queues are back to 'online' as of this morning (11/4). 8) 11/3: Short (~3 hour) maintenance outage at UTD-HEP. Once it was over test jobs were submitted by EU shifter, but they were using old releases (v12 & 13). New jobs have been submitted this morning -- waiting for results. 9) 11/4: SLAC set offline to investigate problem where jobs are failing with " Required CMTCONFIG (i686-slc4-gcc34-opt) incompatible with that of local system." RT 14512, eLog 6610. Follow-ups from earlier reports: (i) Reminder -- the next tier 2/3 meeting will be held at UTA 11/10 - 11/12. See: http://indico.cern.ch/conferenceDisplay.py?confId=71428 (ii) Shift summary from last week available at: www-hep.uta.edu/~sosebee/ADCoS/Shift-summary--10_28_09.html
As we agree, we must consolidate all US T3 DDM related services: DQ2 SS and LFC to BNL. As the first step, I would like to bring all DQ2 SS to BNL tomorrow. Basically, I need to ask you to turn off DQ2 since BNL's DQ2 SS will serve your sites. If you run DQ2 SS serving the following sites, please stop your DQ2 (or remove them from your configuration dq2.cfg). UCT3 OUHEP (is this T3?) WISC XYZ UTD XYZ ILLINOIS XYZ (done) DUKE XYZ (done) ANL XYZ (done) If you know any other sites, please let me know. Please keep your LFC's running. That will be the second step. I would like to do this tomorrow at 12PM US Eastern time. If you have any questions, please let me know. Hiro
Meeting Notes from USATLAS Throughput Call
------------------------------------------
Attending: Shawn, Dave, Jason, Sarah, John, Horst, Karthik, Hiro, Doug
Discussion about perfSONAR status.
AGLT2 (rc4 version?), BNL, MSU, MWT2_IU (previous version) all lost configurations (apparent disk-problems, read-only disk, rebooting loses the config and data on disk). Possible issue for the KOI hardware? Needs to be debugged.
Issues with perfSONAR-BUOY Regular Testing (Throughput and/or One-Way Latency) changing from "Running" to "Not Running". Happening at AGLT2, BNL, MWT2, OU.
Dave reported running perfSONAR on a box for a few weeks (non-KOI; Intel box) and has had no problems. However tests only running a few days.
Internet2 developers will be looking into these two problems...may be the same problem (I/O errors may cause services to stop and/or loss data). May request files and/or access to problematic perfSONAR instances.
Next discussion was about using perfSONAR results to find issues:
1) SWT2-UTA possibly an issue. OU, MWT2_IU and AGLT2 show low throughput. IU was showing lots of packet lose in latency tests. However BNL->SWT2-UTA and MWT2_UC->SWT2-UTA seems OK. Needs looking into. Perhaps path differences may show where the problem lies.
2) AGLT2-BNL showing asymmetry again. AGLT2->BNL is good (>910Mbits/sec) but BNL->AGLT2 is poor (40-70 Mbits/sec) since October 15th. Need to investigate.
Discussion about Hiro's new testing. Now Tier-2 to Tier-2 testing in place. Only 10 files moved in tests and only using DATADISK. Select Tier-2 site (_DATADISK version) from http:///www.usatlas.bnl.gov/dq2/throughput and you can see results (scroll-down for graphs).
Slowdown to IU Tier-2 at the end of October: reason was found. The gridftp2 settings lost on the FTS channel was lost, preventing door-to-door transfers. Hiro fixed and throughput restored.
Discussion about LHC network (LHCOPN CERN<->BNL). BNL/John Bigrow wants to test new LHCOPN link to BNL for load-balancing. Needs to arrange high-throughput test from CERN <-> BNL (could use iperf). Will work with Hiro/Eduardo/Simone on this sometime next week.
Milestone: BNL->(set of Tier-2's at 1GB/sec for 1 hour). Test this Friday (October 23 at either 10 or 11 AM). Need to get feedback from other's on which time is best.
We will NOT be meeting next week unless someone wants to chair the meeting in my absence ( I will be on a plane to Shanghai). Next meeting TBD. --------------------------------------------------------------------------------------------------------
This is a report of Installed computing and storage capacity at sites.
For more details about installed capacity and its calculation refer to the installed capacity document at
https://twiki.grid.iu.edu/twiki/pub/Operations/BdiiInstalledCapacityValidation/WLCG_GlueSchemaUsage-1.8.pdf
--------------------------------------------------------------------------------------------------------
* Report date: Tue Sep 29 14:40:07
* ICC: Calculated installed computing capacity in KSI2K
* OSC: Calculated online storage capacity in GB
* UL: Upper Limit; LL: Lower Limit. Note: These values are authoritative and are derived from OIMv2 through MyOSG. That does not
necessarily mean they are correct values. The T2 co-ordinators are responsible for updating those values in OIM and ensuring they
are correct.
* %Diff: % Difference between the calculated values and the UL/LL
-ve %Diff value: Calculated value < Lower limit
+ve %Diff value: Calculated value > Upper limit
~ Indicates possible issues with numbers for a particular site
-----------------------------------------------------------------------------------------------------------------------------
# | SITE | ICC | LL | UL | %Diff | OSC | LL | UL | %Diff |
-----------------------------------------------------------------------------------------------------------------------------
ATLAS sites
1 | AGLT2 | 5,150 | 4,677 | 4,677 | 9 | 645,022 | 542,000 | 542,000 | 15 |
2 | ~ AGLT2_CE_2 | 165 | 136 | 136 | 17 | 10,999 | 0 | 0 | 100 |
3 | ~ BNL_ATLAS_1 | 6,926 | 0 | 0 | 100 | 4,771,823 | 0 | 0 | 100 |
4 | ~ BNL_ATLAS_2 | 6,926 | 0 | 500 | 92 | 4,771,823 | 0 | 0 | 100 |
5 | ~ BU_ATLAS_Tier2 | 1,615 | 1,910 | 1,910 | -18 | 511 | 400,000 | 400,000 | -78,177 |
6 | ~ MWT2_IU | 928 | 3,276 | 3,276 | -252 | 0 | 179,000 | 179,000 | -100 |
7 | ~ MWT2_UC | 0 | 3,276 | 3,276 | -100 | 0 | 179,000 | 179,000 | -100 |
8 | ~ OU_OCHEP_SWT2 | 611 | 464 | 464 | 24 | 11,128 | 16,000 | 120,000 | -43 |
9 | ~ SWT2_CPB | 1,389 | 1,383 | 1,383 | 0 | 5,953 | 235,000 | 235,000 | -3,847 |
10 | ~ UTA_SWT2 | 493 | 493 | 493 | 0 | 13,752 | 15,000 | 15,000 | -9 |
11 | ~ WT2 | 1,377 | 820 | 1,202 | 12 | 0 | 0 | 0 | 0 |
-----------------------------------------------------------------------------------------------------------------------------
Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.