In a holding pattern; tokens not deployed, and space-token-enabled Panda not quite ready.
168 TB is for data, which is ATLASDATADISK. This can be kept smaller until real data (only used for FT). 10 TB for now.
MCDISK - for simulations; deploy in chunks. Start with 20 TB.
GROUPDISK - not in use, yet.
PRODDISK, USERDISK - later.
Kaushik will send reminder via email.
No scheduled tasks coming in. Complete breakdown in the production effort in ATLAS.
Shall we re-direct our focus on analysis at Tier2s?
Site's responsible for checking analysis queue functional, SE services, etc.
Support beyond this has to be provided by analysis
Need to ramp-up analysis benchmarks.
Software availability issue - people are waiting on a new release for 10 TeV? .
Shifts (Mark)
Production proxy and upgrade to VOMS server. Thought it had been resolved, but appears to be intermittently re-appearing.
SLAC - local proxy and port for ssl traffic - Wei solved.
SWT2_CPB - stopped autopilot submission - will be offline while integrating new hardware and storage.
File transfer backlog at AGLT2. Under investigation.
NFS outtage at BNL, cleared.
Sporadic feed of jobs.
Analysis queues, FDR analysis (Nurcan)
Follow-up:
Regular exercising of analysis queues over data sets, especially when there are new releases. Nurcan is doing this.
Problems w/ FDR2 datasets; all sites successful except SLAC, surprising. There is a parameter in queue definition that needed to be changed (Paul and Wei) - need to setup communication w/ Nurcan for any analysis queue changes. Problem w/ syntax of file URLs - adapted by Tadashi.
Points out need for continuous testing.
Two analysis jobs at MWT2 - change in pilot wrapper script for pilot child timeout processes contributed by Charles.
AGLT2 - user sends job that makes TAG selections; works fine BNL, but not at Michigan. libdcap.so patch at BNL. There is a dccp client inside the ATLAS release (for dcap linking). Need to mitigate with SIT.
Reconstruction jobs will need to be tried at Tier2s. But there are other job types. Nurcan will be making a list - regular and advanced.
Will saturate analysis queues with jobs.
Mark and Nurcan will meet to discuss some systematic testing at the sites and will re-consider the analysis benchmarks.
Metric - define a standard for time required to process a standard dataset
Consider site availability monitor which indicates basic functionality indicating site-readiness; this would help users distinguish "site" problems for "user-code" problems.
Panda monitor still probably the best place for users to analyze jobs.
There is a new version of elog that might be useful.
Request from user for a link to the Panda monitor for users giving some status information for sites, indicating downtimes, etc, - well advertised for users.
Twiki page to collect problems w/ analysis queues.
Operations: DDM (Hiro)
All is well?
What about "file exists" problems - this is a bad sign.
There is a problem with a site service with files that came to BNL, but weren't registered, realized only after the service was restarted. Under investigation.
MCDISK - being run through two separate site services. The BNL service.
Kaushik reports this problem has been solved by Miguel at ADC development meeting.
Need to follow-up next week. (Hiro to discuss with Miguel)
Dantong: has setup to front-end nodes w/ backend Oracle cluster.
Hiro will start slow migration today
Panda group will need to use the testbed machine.
There are problems with the wlcg-client having two versions of globus.
LFC pieces are from a binary distribution - solves DQ2 utilities, but introduces probs with other client programs.
Possibility of adding an http interface to LFC - would provide clear separation between the client and service; request from Kaushik to bring up with LFC developers. Dantong will contact LFC developers.
RSV SE & CE probe update status (Fred)
Follow-up from last week:
SRM probes needed for AGLT2, SWT2, NET2
AGLT2 - has 2.0 probes, just not enabled. Will run configure.
BU - has RSV 2.0 running, but not reporting. Saul will follow-up, will install OSG 1.0 by next week.
SW - need SRM probes. Did upgrade, but may not have enabled SRM probe.
BNL - why not reporting? Xin claims its reporting fine locally. Are they going into Gratia correctly? Fred will follow-up with Xin.
glexec - deployed in production at BNL; required at SLAC, especially for analysis jobs.
glexec needs outbound access.
Has been tested on the ITB at BNL.
Site news and issues (all sites)
T1: Dantong: yesterday had major NFS downtime, 4-5 hours. Autopilot submit host voms proxy certificate issue resolved (Marco).
AGLT2: Currently having issues getting files back to BNL - Wenjing working w/ Hiro, some not available. Pools got filled by resilient dcache mechanism. Urges to use new production space.
NET2: all is well.
MWT2: all is well.
SWT2 (UTA): CPB going down for major overahaul, addition of new machines.
SWT2 (OU): all is well. 3 headnodes now onto 10 G networks. Ready for testing w/ BNL after local tests.
WT2: talking with ATLAS database group to setup conditions database at SLAC (compute nodes have no IP connectivity). Analysis job issue - DQ2 and Panda problems with port number missing from URL. Hits only SLAC since they are doing direct reads from xrootd. Need a consistent convention. Is a single convention used by the pilot? Look at what the LRC interface is providing to see what the pilot is using (storage default location).
Carryover issues (any updates?)
Pilot upgrade for space tokens (Kaushik (Paul))
A bit of development to do. Carry-over
No results yet from tests at AGLT2.
Release installation via Pacballs (Xin)
Follow-up
Progress - this morning to discuss this. Fred - hoping this week to have first set of pacballs installed in DQ2. Will test with some older releases on some test machines.
Need official naming scheme.
Get installed with a special Panda pilot job using the software role. Expect performance to improve.
Expect a couple of weeks of testing.
Goal to bring into production by end of the month (June).
There is pacball release available which Xin has tested.
Saul: factorize into two problems - pacballs which define the release versus the delivery mechanism.
Please note that this site is a content mirror of the BNL USATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your BNL USATLAS account.