Begin forwarded message: For LFC: Just yesterday I got it building on one platform and hope to have it building on multiple platforms today. So it's in good shape. For lcg-utils: I upgraded (it was a painful set of changes in my build, but it's done) and Tanya did a test that showed everything working well. But about 30 minutes ago ago, Brian Bockelman told me that I might need to upgrade again to avoid a bug--I just contacted some folks in EGEE, and they confirmed that I should upgrade to a new version. *sigh* Hopefully I get can this done today as well. All of that said: I can almost certainly give you something for testing this week. -alain
Yuri's weekly summary presented at the Tuesday morning ADCoS meeting: http://www-hep.uta.edu/atlas/World-wide-Panda_ADCoS-report-%28Aug25-31-2009%29.html [ ESD reprocessing -- if no issues are discovered during the current final testing phase will likely start later this week. More details in Yuri's weekly summary. Run 91890 is being used for this final testing -- shifters were requested to ignore errors from these jobs.] [Production generally running very smoothly this past week -- most tasks have low error rates. ] 1) 8/26: UTD-HEP set 'offline' while security patches were installed. Production was restarted over this past weekend after submitting test jobs, but then yesterday problems were noticed with the site, this time related to a file server in the cluster. Jobs are currently completing successfully, but no word on a resolution of the problem? 2) 8/27: Request to delete BU-DDM entry from DDM listing was posted -- Hiro did the removal. Savannah 54901. 3) 8/28: Sites needed to update their lcg-vomscerts package, in preparation for the update of the host cert for voms.cern.ch on Monday, 8/31. There were a large number of job failures on Monday with the error "Failed to get LFC replicas: -1 (lfc_getreplicas failed with: 2703, Could not secure the connection)" at those sites where the cert update had not yet been done. By Monday evening all sites appeared to have the completed the update. 4) 9/1: New pilot version from Paul (v38a): The code has many internal changes (such as code optimizations as suggested by Charles Waldman) but of more general interest are the following new features and fixes: * Get and put functions in lcgcp/cr site movers are now using the proper timeout options depending on which LCG-Utils version is available locally. The current options are: --connect-timeout=300 --sendreceive-timeout=3600. Transfers on sites using older LCG-Utils versions have -t 3600. Problems were seen during the testing phase at Brunel where the new timeout options caused segmentation violations with lcg-cp. A site wide reinstallation of the command solved the problem. A similar problem is also seen at UNI-DORTMUND. Due to known difficulties with debugging jobs on this site (stdout logs are not available) Rod has set the site offline until we have solved the problem there. * Direct access mode is now available for analysis sites using dCache in combination with the following site movers: dCacheSiteMover, BNLdCacheSiteMover, lcgcpSiteMover, lcgcp2SiteMover. Direct access mode was previously only available for xrootd sites. (LocalSiteMover will be updated next). * Direct access can now be skipped for individual files (RAW files e.g.) via job attribute prodDBlockToken. * Correction for missing ./ in the trf name for Nordugrid analysis trfs. 5) Follow-ups from earlier reports: (i) 7/23-7/24 -- Ongoing work by Paul and Xin to debug issues with s/w installation jobs at OU_OSCER_ATLAS. Significant progress, but still a few remaining issues. (ii) SLC5 upgrades will most likely happen sometime during the month of September at most sites.
USATLAS Throughput Meeting Notes
September 1, 2009
==============================
Attending: Shawn, Rich, Saul, Jay, Karthik
1) perfSONAR discussion – RC3 is available now. Feedback from currently testing sites (you know who you are!) is requested by the end of the week. The hope is RC3 will become the real release assuming no major problems are found (30 bugs were addressed in RC2->RC3).
2) Tier-3/DQ2 testing -- No report this week
3) Data-movement tests -- Missing NET2 results in the summary graphs on the main page. Hiro, can you add them? AGLT2 results have gotten very bad after upgrade from 1.9.2-5 to 1.9.4-2. Considering downgrade to see if it addresses the issue.
4) Circuit issues/discussion. UC testing status? No report this week.
5) Site reports
a. BNL
b. AGLT2 - dCache upgrade may be causing some issues.
c. MWT2 – No report but still need to continue circuit testing as soon as path is instrumented at UC/OmniPoP.
d. NET2 – New Myricom 10GE NICs have been in place for a while. No issues but no real testing yet.
e. SWT2 – Karthik is planning to test the new RC3 of perfSONAR
f. WT2
g. Wisconsin
6) AOB -- Thanks to Rich and Jay for their extremely valuable participation in the working group. Both are moving on to new things. We will very much miss them on our weekly calls.
Please send along and edits or updates to the list. Plan is to meet again at the regular time next week.
Shawn
Notes for US ATLAS Throughput Meeting September 8, 2009
===============================================
Attending: Dave, Doug, Sarah, Shawn, Horst, Karthik, Hiro, Rich
1) perfSONAR status – RC4 out ? Some good feedback provided by current RC3 testers. Need to verify status of target release date (September 21 ?). Karthik reported RC3 worked pretty well but some services are disabled when they shouldn’t be. Aaron will be following-up (in RC4). Plan for the near future is to have all US ATLAS Tier-2’s deploy the “released” perfSONAR within 1 week of its release (by the end of September). Then we need to verify that it works as expected and gain enough experience with its configuration and use to be able to make a recommendation within a month (by the end of October). The recommendation would be concerning whether or not Tier-3’s should deploy this version of perfSONAR (presumably 1 box devoted to bandwidth testing) OR await the next perfSONAR version (in case of too many issues still being present that prevent wider deployment).
2) Updates on “Tier-3” related testing/throughput - Lots of discussion which touched upon (US)ATLAS policy, DQ2 and related topics. Our group is focused upon throughput and testing and we need to determine what to do for Tier-3’s. Draft idea is that each Tier-3 should be assigned a Tier-2 as a testing partner. Data movement tests will be configured using Hiro’s data transfer testing. Tier-3’s will have only 7-8 test files sent from their associated Tier-2 once per day. This is compared to the Tier-1 to Tier-2 tests which send 20 files, twice per day to two space-token areas. This testing along with a properly configured perfSONAR instance at each Tier-3 should provide sufficient initial monitoring/testing. Also, Shawn is assembling pages on the Twiki Rob has setup which are the beginning of a “how-to/cookbook” for throughput debugging, tuning and testing. **Please send along any URLs or info that would be appropriate to include**. Also Doug pointed out that Tier-3’s need examples of existing storage systems and configurations in place already to help guide their selection of hardware purchases. **We need to get feedback (also on the Twiki) from each of the Tier-2’s on their hardware and configurations which can be used by the Tier-3’s**.
3) Automated data-movement testing status – Logging of errors is to be added. In addition Hiro will update the software to allow the “3rd party mode” which will enable tests from any site to any site in the US. Rob had some requests for updates to the plots that Sarah and Hiro will follow-up on.
4) Virtual circuit testing status – Still waiting to hear that the UC networking folks have the path instrumented so we can repeat the tests. **Shawn will send another email asking for a status update**. If things are ready we will try to reschedule the test sometime in the next few days.
5) Site reports – Skipped till next meeting
6) AOB – Skipped till next meeting
Please send any corrections or additions to the list. We plan to meet again next week at the regular time.
Shawn
Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.