News from ATLAS softwre week - validation session - there will be a single executor?
Panda server down last night, resolved, but resulted in large number of 'lost heartbeat' jobs.
IU_OSG - still some transfer issues, perhaps related to firewall issues.
BNL FTS issue - there was an oracle backend problem, resolved.
LUT - will be down for two weeks.
Wensheng: Tape-stage-in issues on Monday, resulting in job starvation. These are evgen input files. Can they be pinned on disk? At least it would be good to have a plan in place. Has this been the main problem with lack of input files? Finding some panda-mover jobs still running after 5 days - is there a problem with the scheduling?
Operations: DDM (Alexei)
M4/M5 replication - see note.
DQ2 0.4 deployment (Hiro, Patrick, Shawn)
See further DQ2SiteServices to capture deployment experience, known issues.
Results from AGLT2
Seems to be running fine.
Installation went smoothly. Were running a late version of Mysql.
Hiro had fixed some config problems, but they were minor.
No problems to report.
There was a fairshare test made by Alexei - all went well.
Shawn installed both agents and transfer queue on a dual quad core server - loads very low.
Next sites: BU, MWT2, WT2 - starting Monday next week.
One issue: subscription control. Have seen cases of users subscribing datasets w/o site admin's knowledge.
Great to have the page, no immediate comments from the group.
Mysql LRC (John)
Some progress over the past week.
Waiting on some repository information from BNL's OS group - need to mirror some CERN repositories. Has to pull libraries from BNL, not CERN, for security purposes.
Also waiting on a test dataset from Hiro.
Looking into the code base - deeply coupled w/ DQ2 code base. Use's DQ2 web services infrastructure. Looks like it cannot be separated from a DQ2 install.
RSV, Nagios, SAM (WLCG) site availability monitoring program (Tomasz)
Follow-up on: Tomasz will write a generic wrapper for Nagios to RSV probes - plan to release this Friday.
Has send new wrappers to Arvind and Rob.
Round-up of RED Nagios issues:
gk01.swt2.uta.edu - work going on.
tier2-osg.uchicago.edu - working on it.
OU - LRC back up.
Will provide a second Nagios server for site admins to grant admin privs.
Please alert Tom about false-postives.
Site news and issues (All Sites)
T1: Atlas panda database migrated last week; addressing firewall problems. Last night panda server crashed - out of memory. Oracle FTS problems recovered. Investigating Oracle redundancy.
AGLT2: Shawn had to leave.
NET2: No probs - could use more.
MWT2: production clusters okay - still working on uc-prototype.
SWT2_UTA: still working on Ibrix probs.
SWT2_OU: Installed OSG 0.6, basically ready. Copied LRC back from backup, ran cleanse.py. ipmi baseboard management of headnodes not working, preventing remote power cycles. Solved rocks-ganglia problems of last week, as well as Condor version probs. Expect to be online tomorrow.
WT2: Production going well. End of the month will turn on 10G network. Making progress running NTP server at SLAC - CD w/ flash stick.
Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.