Trying to get final prices for Dell; Intel processes. Sun Thor system will go out today. MSU and UM have received Koi systems.
SWT2 - no update - getting started.
MWT2 - Preliminary pricing from Dell in-hand for storage servers and networking, needs small iteration. Current planning is 364 TB (useable) procurement.
NET2 - negotiations still in progress with IBM, combining in a large order. No news.
Fully deployed infrastructure at all sites by Sep 30
Operations overview: Production (Kaushik)
MC production on/off.
Jamboree week.
DQ2 going slow.
Follow-up issues:
Job eviction problems - work still going on. Trying out Condor-G fixes.
Subversion server loading - was heavily loaded; Squid server deployed last week on Thursday, backed out. Back to default now (no Squid).
Checksum errors - a corrupt dataset was scrapped entirely. This was caused by replacing a file at CERN. There is a new proposal to handle this case. No developments on checking checksums in data transfers.
PRODDISK integration
Next step: Paul needs to be involved as we go through the sites. Yuri will supervise migrating the sites one by one with Paul and site admins. Start with AGLT2 - put it fully into production.
AGLT2 still in test mode
Fine at Michigan; UC and IU are being tested.
Space tokens
Would like to do an inventory of deployed space tokens site-by-site next week.
USERDISK and pathena analysis jobs.
Need official page for this.
Shift report
Production up and down, no major site issues to report at this time.
A couple site issues have come up, but have been responded to quickly.
Operations: DDM (Hiro)
Follow-up issues:
Checksums for the US - waiting for Paul to put Adler32 into the pilot (for output files, in the registration), but checking dCache checksum and the catalog. Not implemented in Bestman, but Wei believes they can implement it. Paul will work on this after the space tokens are complete.
Wei has discussed the issue with Hiro - a list of pros/cons, sent to Jean-Philippe Baud. Has started discussion with FTS and DQ2 folks. There needs to be a discussion on where these checks.
In xrootd itself, you can get any checksum you want; not sure what is needed for Bestman - will discuss w/ Hiro.
Fred notes there have been problems at the CERN level (RSV is reporting correctly), but these should be fixed in the monthly summary report.
Will put up testing interval information.
SE from UTA (gk03) - all is fine. Mark in contact with GOC to rename the common name. Should show up in the reporting at some point.
Site news and issues (all sites)
T1:
last week: there was an issue with WAN connectivity last Friday and Saturday - primary link from CERN to BNL went down; failover didn't work. Policy based routing removed from border router, but Panda services were broken after the change. Primary link came back up and previous configuration was restored. Discussing on how to fix this problem, only happens at BNL Tier 1 due to the firewall. High priority to find a solution. Considering moving resources closer to the interface of the OPN. Probably would require at least a day of downtime.
this week: Hiro: dcache gridftp doors almost ready, testing next week. New thumpers will be ready next week (all will be deployed). ~20 thumpers.
AGLT2:
last week: turned back on - waiting for pilots.
this week: been getting autopilots since last week, and analysis queues are working. DQ2 end user tools cannot fetch files from the site. Mario Lassnig aware, ticket open. Probably will require a new release.
NET2:
last week: no problems - still trying to get some new hardware up (Harvard).
this week: all systems go, only one analysis job. Still working on HU networking. Will probably need Panda help soon.
MWT2:
last week: autopilot adjuster has been disabled not to interfere pilot eviction troubleshooting.
this week: no big news. Space token based tests going on.
SWT2 (UTA):
last week: no problems
this week: all is well.
SWT2 (OU):
last week: no report
this week: Nothing much to report for OU, all is well, but the old OSCER topdawg cluster will be decommissioned on Friday, so I just asked the pandashift people to turn off submission. We'll get the new grid gatekeeper for the new sooner cluster up and running soon, so hopefully we can restart production again soon. Everything else is running fine. Thanks, Horst
WT2:
last week: still working on the conditions database access. AGLT2 confirmed similar latency issues for the database access. Will be taking the issue to the 3D meetings at CERN. Is the time required for access significant compared to the total job time. Exception for access to CERN. There is a lot of effort required to setup another stream to a site. Still working on the network monitoring equipment. There is still some concern about the Web100 kernel, and reliability of the hardware.
this week: still working on conditions database.
Carryover issues (any updates?)
Release installation via Pacballs + DMM (Xin, Fred)
Testing pacball downloads, working on getting releases transfered by DQ2. Getting timeouts today (has been working fine before). Will send Savannah report.
Xin: still waiting for Alessandro to finish last version of scripts. Will be working with Torre's group to setup submission system. Expect by end of the week.
Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.