MinutesMar2
Introduction
Minutes of the Facilities Integration Program meeting, March 2, 2011
- Previous meetings and background : IntegrationProgram
- Coordinates: Wednesdays, 1:00pm Eastern
- 866-740-1260, Access code: 7027475
Audio Details:
Dial-in Number:
U.S. & Canada: 866.740.1260
U.S. Toll: 303.248.0285
Access Code: 7027475; Chair passcode: 8734
Registration Link:
https://cc.readytalk.com/r/bd2w3deu2kkg
Attending
- Meeting attendees: Fred, Aaron, Shawn, Charles, Rob, Karthik, Michael, Dave, Saul, AK, Jason, John DeStefano, Torre, Booker, Tomasz, Sarah, Patrick, Horst, Wei, Bob, Tom, Mark, Alden, Armen, Hiro,
- Apologies: Kaushik
Integration program update (Rob, Michael)
- IntegrationPhase16
- Special meetings
- Tuesday (12 noon CDT) : Data management
- Tuesday (2pm CDT): Throughput meetings
- Upcoming related meetings:
- Program notes:
- last week(s)
- Reminder: Next face-to-face facilities meeting co-located with OSG All Hands meeting (March 7-11, Harvard Medical School, Boston), http://ahm.sbgrid.org/. US ATLAS agenda will be here.
- Updates to SiteCertificationP16
- Integration
- Funding is likely to be cut. There will be consequences for operations and planning. Where can we save? Which areas? Provide input to Michael
- Planning is difficult - at open EB meeting Jim Shank presented resource requirements until 2013 which accounts for computing model changes. A number of changes: reducing ESD (the bulk of the volume). Proposal: one raw copy on disk distributed over all T1; and only 10% "rolling buffer" ESD on T1 (10% time window). 10 copies of AOD over all clouds, reduce previous versions to 2. 10 DESDs, same amount as AODs. Each region receives one copy. From this model to resource requirements - no change required at T1s (small increase in 2012). Disk requirements reduced by 5%. For the T2s - expected increase of users - CPU goes up significantly - 44%, 100kHS06 at the US T2s. Probably already covered with current capacity. 2013 - 30% increase. T2 disk: 2012 - 27% over what was predicted previously. 2013 - less than 5%. Has to fit into a new funding envelope - which is flat. Currently working on numbers given the new input yesterday.
- this week
- Next week is the US ATLAS Computing Facilities Workshop, co-located with the OSG All Hands, see: OSG Agenda page.
- Discussion regarding normalization:
- CPU is being reported accurately by OSG
- GIP information collected about sub-clusters
- Sent to WLCG as a SI2K? , unfolded from there
- OSG is adjusting the SI2K? value - should be HS06.
- Request should be to get SI2K? out of the picture.
- Two other issues - hyperthreading - should be logical CPU.
- WLCG policy is to use normalized CPU time.
- Resource requirements are being intensely discussed within ATLAS, there will be changes coming. Eg. Tier2-D's, being actively pursued by ADC. Simone setting up direct connections, most of our Tier 2's are already involved, expect new connections from beyond the US cloud. Will need to see how the network cooperates. Resources would be used differently - more dynamically, less static data - more cache like.
- Close to data-taking again at LHC. Be prepared for new data.
- Tier2-D - will eventually be part of every T1. Expect to see lots of connections, x5 or x10. Gridftp servers.
-
-
Tier 3 Integration Program (Doug Benjamin, Rik Yoshida)
Tier 3 References:
- The link to ATLAS T3 working groups Twikis are here
- T3g Setup guide is here
- Users' guide to T3g is here
last week(s):
- xrootd rpm under test, needs work.
- Arizona coming online.
- Rik migrating out of Tier 3 management - to analysis support, but will stay closely involved since T3 and analysis closely related.
this week:
Operations overview: Production and Analysis (Kaushik)
- Production reference:
- Analysis reference:
- last meeting(s):
- Plenty of jobs .. no major issues
- Charles - reporting on walltime efficiency with onset of pileup jobs from CERN cloud (at MWT2), causing NFS load. A report to Borut and ADC operations list. Kaushik: should forward to software group.
- Brings us to CVMFS - to mitigate the load on NFS.
- this week:
- US cloud has been quiet with few issues; getting low on jobs at many sites overnight, picking up now.
- Significant new pilot release; one of the changes might have broke a MWT2 - fix in the works.
- Panda monitor issues over the weekend - understood.
- Lack of defined jobs - taking a bit of time. Is panda-mover moving data quickly enough? Mark will check.
- Saul: noticing a large number of looping analysis jobs, there were some legitimate jobs getting removed. pilot was looking for a log output in the wrong place. People are getting annoyed - will get switched back to the 12 hour time limit.
-
-
Data Management & Storage Validation (Kaushik)
- Reference
- last week(s):
- MinutesDataManageFeb22
- MCDISK - cleanup finished at T2 - sites requested to check leftover dark data in pools assigned to MCDISK
- 2PB of data being converted at BNL
- Cleanup at SWT2
- AGLT2 - 200 TB of mc09.
- "Old GROUPDISK" - process of deleting pre-physics group
- this week:
- No meeting this week.
- Armen: all is well.
-
-
Shifters report (Mark)
- Reference
- last meeting: Operations summary:
Yuri's summary from the weekly ADCoS meeting:
http://indico.cern.ch/getFile.py/access?contribId=0&resId=0&materialId=slides&confId=128505
1) 2/16: SWT2_CPB - file transfer failures from the site to BNL/other tier-2's. High load on the storage caused the XroodtFS to become unresponsive. Load gradually came down,
SRM service restarted. Monitoring this issue. RT 19475 / ggus 67576 closed,
2) 2/17: From Dave at Illinois - Yesterday afternoon, by accident, the MTU on a 10Gb interface was set incorrectly (reset from 9000 back to 1500). This caused half of the worker
nodes at IllinoisHEP to hang when trying to write back to the SE. I found and fixed the MTU problem this morning, but unfortunately, the jobs that were running on those worker nodes
died in the process. So I assume that sometime in the next few hours many production jobs (120 or so) will show up as failing with lost heartbeat.
3) 2/17: Jobs failures at BU_ATLAS_Tier2o with the error "No space left on device." Issue was with analysis jobs occasionally requesting a large number of input files. Resolved by
setting the panda schedconfigdb parameter 'maxinputsize' from 14 to 7 GB. ggus 67565 closed, eLog 22381.
4) 2/18: New pilot release from Paul (SULU 45e) to address an issue with the storage system at RAL.
5) 2/20: From Alessandra Forti (update to issue with the new ddmadmin cert): New DN without email field has been deployed. Related ggus tickets closed.
6) 2/20: File transfer failures between SLACXRD_PRODDISK to BNL-OSG2_DATADISK with source errors "failed to contact on remote SRM." From Wei - One of the data servers
went down. This is fixed. ggus 67690 closed, eLog 22283. Later, 2/21, from Wei: We may need to take a short power outage on Tuesday to reset the host's AC.
7) 2/21: From Rob at MWT2 - We have had an AC unit fail this morning and as a result have had to shut down a number of worker nodes (storage and head node services are unaffected).
There will be production and some analysis job failures as a result. Later
that day: AC has been restored; bringing nodes back online. eLog 22303/08/17.
8) 2/23: WISC file transfer failures with SRM errors like "DESTINATION + srmls is failing with "Connection reset." Issue understood - from Wen: It's fixed now. Just now one cron job
failed update the grid CAs. It caused that all grid certificates are not able to be authenticated. ggus 67836 can probably be closed at this point. eLog 22379.
Follow-ups from earlier reports:
(i) 1/9: AGLT2 - low-level of job failures with the error "Put error: lfc_creatg failed with (2704, Bad magic number)." Site is investigating.
(ii) 1/14: AGLT2_PHYS-SM - file transfer failures due to "[USER_ERROR] source file doesn't exist." ggus 66150 in-progress, https://savannah.cern.ch/bugs/?77036.
Also https://savannah.cern.ch/bugs/index.php?77139.
1/25: Update from Shawn:
I have declared the 48 files as "missing" to the consistency service. See https://twiki.cern.ch/twiki/bin/view/Atlas/DDMOperationProcedures#In_case_some_files_were_confirme and you
can track the "repair" at http://bourricot.cern.ch/dq2/consistency/ Let me know if there are further issues.
Update 1/28: files were declared 'recovered' - Savannah 77036 closed. (77139 dealt with the same issue.) ggus 66150 in-progress.
Update 2/20: Last failed transfers reported have completed successfully. ggus 66150 closed.
(iii) 1/19: BNL - user reported a problem while attempting to download files from the site - for example: "httpg://dcsrm.usatlas.bnl.gov:8443/srm/managerv2: CGSI-gSOAP running on
t301.hep.tau.ac.il reports Error reading token data header: Connection closed." ggus 66298. From Hiro: There is a known issue for users with Israel CA having problem accessing BNL and
MWT2. This is actively investigated right now. Until this get completely resolved, users are suggested to request DaTRI request to transfer datasets to some other sites
(LOCAGROUPDISK area) for the downloading.
(iv) 1/21: File transfer errors from ALGT2 to MWT2_UC_LOCALGROUPDISK with source errors like "FTS State [Failed] FTS Retries [1] Reason [SOURCE error during
TRANSFER_PREPARATION phase:[USER_ERROR] source file doesn't exist]." https://savannah.cern.ch/bugs/index.php?77251, eLog 21440.
Update 2/23: This issue is apparently resolved. Related/similar problems that were being tracked in other tickets now closed.
(v) 1/27: all U.S. sites received an RT & ggus ticket regarding the issue "WLCG sites not publishing GlueSiteOtherInfo=GRID=WLCG value." Consolidated into a single goc ticket,
https://ticket.grid.iu.edu/goc/viewer?id=9871. Will be resolved in a new OSG release currently being tested in the ITB.
(vi) 2/6: AGLT2_PRODDISK to BNL-OSG2_MCDISK file transfer errors (source) - " [GENERAL_FAILURE] RQueued]." ggus 67081 in-progress, eLog 21935.
Update 2/20: Failed transfers have now completed. ggus 67081 closed.
(vii) 2/10: File transfer errors between BNL & RAL - " [HTTP_TIMEOUT] failed to contact on remote SRM [httpg://dcsrm.usatlas.bnl.gov:8443/srm/managerv2]." Issue was reported as
solved (intermittent problem on the wide area network link between RAL and BNL), but later recurred (high load on the dCache core servers), and the ticket was re-opened.
ggus 67214 in-progress, eLog 21973.
Update 2/23: Issue considered to be resolved (no recent errors of this type). ggus 67214 closed.
(iix) 2/10: A bug in the most recent OSG software release (1.2.17, released on Monday, February 7th) affects WLCG availability reporting for sites. Sites may go into an UNKNOWN state
one day after updating. Thus it is recommended that sites defer upgrading their OSG installations until a fix is released. See: http://osggoc.blogspot.com/
- Michael: HI recon jobs failing at BNL - report was "out of memory". Investigated - found by setting the stack size (ulimit setting). If this is set to unlimited, it results in 400 MB less. Consulting experts for guidance for setting this parameter. (Un-tuned, its 8MB for SL5). General - not only for T1.
-
-
- this meeting:* Operations summary:
Yuri's summary from the weekly ADCoS meeting:
http://indico.cern.ch/getFile.py/access?contribId=1&resId=0&materialId=0&confId=129380
1) 2/24: MWT2_UC - job failures with " lsm-get failed: time out after 5400 seconds" errors. From Aaron: We performed a dcache upgrade yesterday, 3/1
which has improved our stability at the moment. This can probably be closed, as new tickets will be opened if new failures occur. ggus 67887 in-progress
(and will be closed), eLog 22425.
2) 2/25: From Rob at MWT2_UC: Overnight there were dCache failures at MWT2. Experts are investigating. eLog 22437. Late that day, from Sarah: We've past some
mass-transfers tests, and completed test jobs. We're turning back on FTS channels, and will continue to monitor the situation.
3) 2/25: UTD-HEP set off-line due to A/C compressor problem. eLog 22454
4) 2/26 - 2/27: shifters reported some problems with the panda monitor (certain pages throwing errors, etc.). Issue eventually went away - Valeri reported the problem
was fixed. https://savannah.cern.ch/bugs/index.php?78770. Also see https://savannah.cern.ch/bugs/index.php?78780 - voatlas20 was down for a period of time. eLog 22530.
5) 2/28: New pilot release from Paul (SULU 46a). See details here: http://www-hep.uta.edu/~sosebee/ADCoS/New-pilot-version-SULU-46a.html
6) 2/28: Xin noticed a large backlog of stale panda pilots for several U.S. sites. They were cleaned out.
7) 3/1: MWT2_UC maintenance outage (update dCache, perform local network tests). Work completed, queues back on-line as of ~3:15 p.m. CST. eLog 22578.
8) 3/2 a.m.: MWT2_UC job failures with errors like:
"Error details: pilot: Get error: Failed to get PoolFileCatalog|Log put error: Could not figure out destination path from dst_se (guid=3920b517-03bb-4ae6-8ddf-d7c298a79a96
lfn=log.261508._039922.job.log.tgz.20): list index out of range." Apparently a problem with the new pilot release (#5 above). A fix is being prepared. ggus 68156 in-progress, eLog 22594,
Follow-ups from earlier reports:
(i) 1/9: AGLT2 - low-level of job failures with the error "Put error: lfc_creatg failed with (2704, Bad magic number)." Site is investigating.
(ii) 1/19: BNL - user reported a problem while attempting to download files from the site - for example: "httpg://dcsrm.usatlas.bnl.gov:8443/srm/managerv2: CGSI-gSOAP running on
t301.hep.tau.ac.il reports Error reading token data header: Connection closed." ggus 66298. From Hiro:
There is a known issue for users with Israel CA having problem accessing BNL and MWT2. This is actively investigated right now. Until this get completely resolved, users are
suggested to request DaTRI request to transfer datasets to some other sites (LOCAGROUPDISK area) for the downloading.
(iii) 1/27: all U.S. sites received an RT & ggus ticket regarding the issue "WLCG sites not publishing GlueSiteOtherInfo=GRID=WLCG value." Consolidated into a single goc ticket,
https://ticket.grid.iu.edu/goc/viewer?id=9871. Will be resolved in a new OSG release currently being tested in the ITB.
(iv) 2/10: A bug in the most recent OSG software release (1.2.17, released on Monday, February 7th) affects WLCG availability reporting for sites. Sites may go into an UNKNOWN
state one day after updating. Thus it is recommended that sites defer upgrading their OSG installations until a fix is released. See: http://osggoc.blogspot.com/
(v) 2/23: WISC file transfer failures with SRM errors like "DESTINATION + srmls is failing with "Connection reset." Issue understood - from Wen: It's fixed now. Just now one cron job
failed update the grid CAs. It caused that all grid certificates are not able to be authenticated. ggus 67836 can probably be closed at this point. eLog 22379.
Update 2/25: Issue is resolved, ggus 67386 closed. eLog 22402.
DDM Operations (Hiro)
- Reference
- last meeting(s):
- All is well.
- Throughput transfers - monitoring to-from T2-T2
- Discussion about troubleshooting cross-cloud transfers
- this meeting:
- Nothing to report this week
- Complaints from users using DaTri about time getting datasets.
-
-
-
Throughput and Networking (Shawn)
- NetworkMonitoring
- https://www.usatlas.bnl.gov/dq2/throughput
- Now there is FTS logging to the DQ2 log page at: http://www.usatlas.bnl.gov/dq2log/dq2log (type in 'fts' and 'id' in the box and search).
- last week:
- Jason: OPN meeting discussed adding new sites to the network. A WG will be released soon. Rely on open exchange points, rather than Monarc model. What does it mean to connect? Still working on Nagios monitor -
- this week:
- Had a meeting yesterday - see email for notes.
- Good news - perfsonar plots for throughput and latency nearly green.
- OU, BNL, Illinois issues addressed
- MWT2_IU and AGLT2 path has a unique component - slowing things down, and its asymmetric. The only issues we've seen so far.
- Action item all T2's to get another load test in. Sites to contact Hiro, monitor the results. An hour long test. ASAP.
- More problems in the network likely with the new ATLAS computing model - could our monitoring system be more broadly adopted in ATLAS? Encourage new sites to adopt a perfsonar infrastructure.
- Will ATLAS do something globally? Part of LHCONE, for example.
- Timeframe for 10G monitoring. Testing with a server at UM - dual integrated 10G NICs. Probably with the next hardware purchase. Can a single box run both roles (throughput and latency)?
-
-
Federated Xrootd at sites: Tier 3 (Doug), Tier 2 (Charles)
last week(s):
this week:
- Running tests with the current architecture. At MWT2, using the xrootd as a front end to dcache now - more stable than dcap. (we do see a performance hit for the wide-area, requiring tuning in the xrootd client)
- Working on re-architecting with LFC. Removal of DQ2 timestamps - working with Hiro and Simone. Will need to test this against storage.
- Renormalization of paths at MWT2, for the global namespace.
- Working with sites in Europe with DPM backend.
- Will talk on this Tuesday morning.
-
Site news and issues (all sites)
- T1:
- last week(s): Two major issues - more for the future: 1) cloud computing initiative at BNL; elastically expand computing capacity installed at BNL - dynamically adding cloud resources. Configure a worker node and make it available operationally - put in the cloud. Working with Magellan, and anticipate Amazon. Taking shape. Time horizon - 1-2 months. 2) Expand to a grid site on the cloud- gradually adding functionality. All will be done w/ R & D activities with ADC. 3) CVMFS - way up on our list: setup a replica server at BNL, synching from CERN; more testing requiring firewall. 4) Deploying auto-py factory. There was a missing job wrapper - now provided by Jose. 5) Another R & D area, Alexei and Maxim invited to work on non-SQL DB evaluation (Cassandra). Completed install of 3 powerful nodes to be used for benchmarks and evaluation.
- this week: BNL has its own PRODDISK area now. Deployed about 2PB of disk, in production. Will need to remove some of the storage.
- AGLT2:
- last week: All is well. One dcache pool server acting up w/ a NIC problem.
- this week: All is working well. Have had some checksum failures - chasing this down. Users attempting to get files that were once here, but no longer. Is the user job unknowingly removed files under the usatlas1 account? Looking at options to trap the remove command, and log these. Want to get the lsm installed here, to instrument IO.
- NET2:
- last week(s): Improvements for the upcoming run - ramp up IO capacity to above 1GB/s; internal rearrangements. Will be ramping HU analysis. Anticipate requiring a second 10G link. Looking at merging two large GPFS volumes. Multiple nodes for lsm mover to HU, multiple nodes for gridftp; evaluating ClusterNSF. Gatekeeper - will be doubling its capacity in CPU and memory. Low-level issues - WLCG reporting verification. pcache related problem at BU.
- this week: Tier3 hardware is on the way, ordering a new rack of worker nodes (looking at R410). Working to get additional 10G links, maybe even a 40G link. DYNES application approved!
Since someone was asking last time...
debris from MCDISK
- MWT2:
- last week(s): running smoothly - doing mostly cross-cloud production; Want to make sure performance and contribution is associated with the US cloud - consult with Valeri. Panglia needs to be checked.
- this week: Downtime yesterday - dCache upgraded to 1.9.5-24. Evaluating CVMFS at MWT2_IU. Migrated monitoring services (Cacti, ganglia, etc.) onto a new machine using kvm. Finishing last plans for new server room at UC - adding additional 30 ton CRAC unit; some construction already complete - raised floor, cooling infrastructure, new transformer and UPS delivered. At IU - we'll have to take a downtime to re-arrange server room, no exact date, but will announce. Hiro notes that there were some additional subscriptions made over the wekend - could have caused the lockup.
- SWT2 (UTA):
- last week: All is well. Storage load-issue caused SRM failures. Load resolved after restart.
- this week: The grid monitor has been getting lost - causing load issues - a cron job was not running correctly not deleting gass-cache files. Maintenance yesterday at SWT2_UTA. Network connectivity into the analysis cluster is currently 1G links, working with networking folks to get a 10G switch. Will be looking to update OSG, hopefully the OSG will be released. Periodic failures in SAM testing, probably one of the storage nodes is getting too busy.
- SWT2 (OU):
- last week: March 20 for Dell install.
- this week: Hiro's throughput test showing 400 MB/s.
- WT2:
- last week(s): Storage node developed a problem... power cycles and resets didn't help. Moving data off - triggering some DDM errors.
- this week: Last week problem with a Dell machine storage - replaced CPU and memory, though not stressed. Planning 3 major outages - each lasting a day or two: March, April, early May. Will be setting final dates soon.
Carryover issues ( any updates?)
Release installation, validation (Xin)
The issue of validating process, completeness of releases on sites, etc. Note:
https://atlas-install.roma1.infn.it/atlas_install/ - site admins can subscribe, and get notified of release installation & validation activity at their site.
- last report(s)
- IU and BU have now migrated.
- 3 sites left: WT2, SWT2-UTA, HU
- Waiting on confirmation from Alessandro; have requested completion by March 1.
- Focusing on WT2 - there is a proxy issue
- No new jobs yet to: SWT2, HU - jobs are timing out, not running.
- There is also Tufts. BDII publishing.
- this meeting:
- One of the problems at SLAC is lack of outbound links, and the new procedure will probably use gridftp. Discussing options with them.
-
-
AOB
- last week
- this week
- No meeting again next week
-
--
RobertGardner - 01 Mar 2011
About This Site
Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.
Attachments