No numbers this week due to backlog of transferring jobs.
Will update the plot for next week.
Things are slowly getting back to normal, pending DDM issues. 6000 jobs transferred overnight. 20,000 remain to be transferred.
We don't have enough jobs - again?
Patrick: "we kicked the box". A lockup of the agents on the bnlpanda site service. Restart, followed by another lock-up, followed by another restart (w/o a recreation of the database). Is there a race condition among the agents? At MWT2_IU - Xin increased the number of FTS streams. Looks like the new selection algorithm (datasets-to-files) in DQ2 grabs files differently than before, leading to longer times to have jobs move to "finished". Need better FTS, DQ2 monitoring. DQ2 also transfers bad files repeatedly rather than moving good ones first.
Shifters report and other production issues (Mark)
Big issue has been the transferring problem - had been doing well ~3000 jobs concurrently.
Dantong has results for NDT vs iperf performance, Detailed report next week. - Dantong not available.
Site news and issues (All Sites)
Follow-up from last week's news:
T1: will be a major dcache upgrade tomorrow.
AGLT2: working on accounting - trying to get it working for a couple of reasons/projects in a standardized schema. Analysis queues - 4 setup - follow-up next week.
NET2: nothing new - have not run since Monday, waiting for jobs. No known problems at the site, just a job shortage. Problem with Eoywyn. Jobs are coming slower than usual - seems to be suffering from trying to update job status in the prod DB.
MWT2_IU: Looking at transfer issues with Xin, see plots.
MWT2_UC: disk firmware upgrade went well. UPS upgrade as well.
SWT2-UTA: New cluster being installed this week, Dell onsite. Power outage at SWT2 cluster pushed back to September. Dell SC1435 (200 cores), dual dual opts, 75 TB raw dell 10 md1000's, 500GB drives.
SWT2-OU: All running okay, oscer interruption this pm, still waiting for final date for the move (~labor day). Will be getting 37 500 GB drives. Add 23 quad nodes. reconigure cluster. 4 head nodes.
WT2: all working okay. Still trying to figure out if 30% AOD replication is complete. Power outage on Aug 27. US ATLAS analysis workshop next week, may not attend.
UC Teraport - coming back online, after 64bit RHEL4 upgrade. Charles notes that setting LD_DEBUG=files gives you the files in your path.
Carryover action items
Encryption to syslog-ng Still to do, carryover.
Install NDT at each site - put in site certification table. Follow-up next week.
RT tickets - Dantong notes we need to get the queues cleaned up. Dantong and team will draft an intial policy.
Dantong has guidelines.
There are a couple of tickets that are not getting attention.
Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.