Note there is related issue Panda specific reporting - Arron from UTA is looking into this.
BNL_PANDA vs BNL_OSG; BNL_OSG seems to be under-reporting.
Discussion of scaling factors - can we agree on a set of numbers for processor types?
In terms of reporting - there is way of correcting data for usage that fell outside the accounting system.
Michael: propose small working group to agree on a path for determining these numbers. Facility personnel (Tony Chen) at BNL have worked on the scale for SI2K? . Setup initial phone conference.
Operations: Production (Kaushik)
Production summary (Kaushik). Reconstruction production - input files sent to BNL; but there were bugs found yesterday; so the massive reconstruction has been postponed for two weeks. Note that pre-stager in Panda is working now, so datasets on disk can be accessed. Release? 13.0.20.3 -- problems. 13.0.20.4? 13.0.30? Probably continue with Xin's mechanism for release installation. This week busy with issues of panda mover and site info database. Also working on multi-cloud scheduling for tasks.
Are there dcache problems at BNL - since yesterday. Gabriele: there were heavy load problems on the pnfs server, but they were solved this morning.
Some users have submitted very large user analysis tasks which required thousands of RDO files. Tadashi has put in a fairshare mechanism to prevent users from doing this.
Site information about sites is now available through panda monitoring.
Shifters report and other production issues (Mark) - covered most above.
Operations: DDM (Alexei)
Kors asked to postpone functional tests for LCG sites - awaiting more info from Miguel. Believes next week may be a good time to revisit this.
Propose half-day to work together on this (Hiro, Patrick, Miguel) - next Wednesday.
There will be a large number of subscriptions cancelled on the LCG sites. Large number of small files, causing the problems, clogging the system.
From Jay: The monalisa developers helped me through my problems and there are now some graphs up on the monalisa client that you can assess. There is dummy test data for last Friday. I wanted to wait until the sites were tuned before starting the live data since a gigabyte is required to get statistics to some sites which takes a very long time to sites with a slow transfer rate. To view the graphs, open the monalisa client by clicking: http://monalisa.cacr.caltech.edu/ml_client/MonaLisa.jnlp. Once the monalisa client is open, click "Groups" on the left. Make sure usatlas is checked in the Groups menu at the top. Expand it in Farms. Click on BNL_ITB_Test1 within the "Farms" section. Expand Loadtest in Clusters. Click on a transfer such as "bnl->bu (MB/s)". Select all the parameters in the "Parameters" box. Then click "history plot" and then the "history 2d plot". Click "View->Plot interval->Relative" in the menu and then choose an end date/time of Friday at 00:00:00 and 1440 (24 hours) for number of minutes. You should see the comparison of network, gridftp_m2m, and gridftp_d2d over time. My biggest complaint with the graphs is that I can only see time in the axis and not date. There are also other plots besides the "History plot" that you can play with.
Network Performance and Throughput initiative (Shawn)
MWT2 now done. Initially 40-50 Mbs; now full Gbps.
What about the issue of jumbo frames, generally. Concern about mixing client-server frame sizes - losing connectivity. Sometimes "path-MTU-discovery" does not work properly. More important for 10G systems.
Rich notes that host, router, switch and VLAN are all affected by MTU size.
Site admins need to investigate these detail, to reach our scalability targets.
Update on validation of Panda on OSG ITB 0.7 (UC_ITB). Ran more than 20 full Panda jobs over three days last week (changed UC_ATLAS_MWT2 siteinfo in pilots to point to UC_ITB site). Info on jobs:
Agree we should checkoff green in the OSG VO validation table.
Nagios monitoring (Tomasz)
Nagios service groups and hierarchy. carry-over.
Dantong reports progress wrapping RSV probes for Nagios.
Site news and issues (All Sites)
T1: FTS upgrade in progress; Thumpers being installed (now on rack #2). dcache issues have been addressed, and more work on understanding the effect.
AGLT2: Expected ship dates for new cores: Oct 2-5 + three days. 300 UM/400 MSU in November. Expect operational week afterwards.
NET2: no report
MWT2: new IU_OSG site progressing; panda mover issue; 10G NIC replacement; UC_ATLAS_MWT2 nearly rebuilt.
SWT2_UTA: debugging autopilot issues; setting up host for ; SWT2 gatekeeper not reporting.
SWT2_OU: expect Dell visit Oct 8 and rocks/osg/dq2 deployment thereafter
WT2: Tier2 meeting - November/December busy at SLAC, book hotel early. Addressing issue of panda mover at slac. Getting probation to run without gsi-authentication. Backport old job accounting information into Gratia.
Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.