r4 - 24 Jun 2009 - 14:29:19 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesJune24

MinutesJune24

Introduction

Minutes of the Facilities Integration Program meeting, June 24, 2009
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
    • (309) 946-5300, Access code: 735188; Dial *6 to mute/un-mute.

Attending

  • Meeting attendees: John De Stefano, Rob, Saul, Sarah, Kaushik, Tom, Michael, Nurcan, Rupam, Mark, Charles, Horst, Wei, Wensheng, Rich, Torre, Jim C, Hiro, Karthik
  • Apologies: none

Integration program update (Rob, Michael)

Operations overview: Production (Kaushik)

  • Reference:
  • last meeting(s):
    • step09 - all went well
    • there will be plans for more user analysis later this week - Jim C
    • putting together pieces of the sample together into the containers (150 M events); can only add completed datasets to containers
    • rate currently is 10M-15M events/day
    • size of analysis container ~20TB, i.e. need about 25 TB in MCDISK.
    • Capacity: SWT2 (35 TB); NET2 (30 TB); WT2 (45 TB)
    • Will need space for cosmic data too - not sure of requirements.
    • Michael - notes that the effort required for step09 was quite high, how do we improve? Need input from sites providing feedback on experiences.
    • Saul - still seeing HU running below capacity - not understood? Problem w/ pilots?
    • What is the schedule for analysis stress testing - probably next week. There is a list prepared by Jim - and some jobs have already been tested.
    • 65K directories at the top level in PRODDISK - limit for GPFS - solution. Discussion w/ Simone to change way directory structures are defined by DQ2. Need to follow same scheme for all proddisk areas. Competing effects of nesting versus number.
  • this week:
    • Production for user analysis test is about finished.
    • Will build containers today - Wensheng will distributed - probably by next week
    • Weekend drain
    • KD has defined a number of tasks; UWISC group. 100M fast sim queue filler. ATLAS universe is idle.
    • RAC meeting to be called. JC will gather user needs.
    • Some tasks coming require tape access, at ~5% level in background
    • Defined enough tasks for two weeks.

Shifters report (Mark)

  • Reference
  • last meeting:
    • From Mark:
      ===========================================================================
      1)  Ongoing work to bring Tufts into production.  Issues found with release installations.  https://rt-racf.bnl.gov/rt/index.html?q=13205
      2)  Storage issues apparently resolved at UTD-HEP, and test jobs submitted yesterday succeeded.  Site set back to 'on-line'.
      3)  Minor pilot updates from Paul:
      (v37h) 6/10:
      The exception handling in the pilots' internal time-out function has been updated. The exception handler did not catch one of the thrown exceptions, leading to the reported error "Get error: dccp failed with output: ec 1, output filedescriptor out of range in select()". This fix addresses an issue seen with [at least] a user job at ANALY_MWT2_SHORT where dccp got stuck and was not timed out cleanly.
      (v37i) 6/12:
      (i) File staging query at BNL now using http method instead of dc_check (user analysis jobs).
      (ii) Changed time-out from 1800 s to 3600 s for lcg-cp/cr since 1800 s appears to be too low.
      4)  Last Wednesday Xin reported on an issue that had been at least partly responsible for some of the problems with pilot submissions from gridui11 at BNL:
      For the record, the problem was that condor-g got stuck at gridui11, due to a known bug of globus client used in condor-g gahp server. Condor-G developers are aware of this and have plan to automatically detect and restart it in the future.
      5)  Job recovery by the pilot re-enabled for BNL by Paul (6/11).
      6)  AGLT2 -- Condor schedd process crashed Saturday evening at around 6:00 p.m.  ~1400 resulting "lost heartbeat" jobs.
      7)  Job eviction issue at IllinoisHEP resolved (6/15) -- Condor preemption / suspension had to be disabled.
      8)  Queues at BNL began to be drained Tuesday evening (6/16) in preparation for for today's dCache upgrade.
      9)  Maintenance outage at SLAC next week (6/21 - 6/25).  Wei plans to upgrade OS and ZFS on all storage servers during this time.
      10)  Bug in dq2 affected some of the subscriptions for STEP09.  From Simone:
      
      The bug appears as following: the subscription T0->T1 is working trying to put files at T1. In the meanwhile there is a subscription T1->T2. The latter looks for files in T1 which is marked as only possible source in the subscription (--source option), but the files are not yet at T1. The DDM decided that the source (the T1) is broken and gives up at the subscription. This should not be and will be cured in next releases.

  • this meeting:
    • Generally production running well over this past week. With a few exceptions job failures rates have been low. Yuri's weekly summary presented at the Tuesday morning ADCoS? meeting here.
      1. ) Wei announced that the previously scheduled maintenance outage at SLAC would have to be rescheduled due to the impending cosmic run.
      2. ) Wednesday afternoon (6/17): ~30 minute panda monitor outage -- services restored after problematic code was backed out.
      3. ) Following resolution of condor job eviction issue at IllinoisHEP? test jobs succeeded and site set back 'on-line' (6/18).
      4. ) Power cut at CERN late night Wednesday temporarily affected panda servers -- issues cleared up by Thursday morning (6/18).
      5. ) Pilot update from Paul (v37j) -- details:
        • (i) A new error code (1122/"Bad replica entry returned by lfc_getreplicas(): SFN not set in LFC for this guid"/EXEPANDA_NOLFCSFN) is now used to identify stress related problems with the LFC. It has been observed that the lfc_getreplicas() in some cases return empty replica objects for a given guid (especially when the LFC is under heavy load). The problem can occur when the distance between the client and server is large, and/or if several guids are sent with the lfc_getreplicas() call - as was recently introduced in the pilot. Jean-Philippe suggested that this can occur in older LFC server versions (problem partially fixed in v 1.7.0, and fully fixed in 1.7.2).
        • (ii) Added file size info to DQ2 tracing report in all relevant site movers.
        • (iii) The file size test in the dCacheLFC site mover has been dropped since it's comparing the checksums anyway. Problems have been observer at SARA where the local and remote file sizes failed due to an unexplained mixup of file sizes of different files (the DBRelease file was seemingly compared to another file..). It is not guaranteed that this fix will solve that problem. The problem has only been observed at SARA to my knowledge.
        • (iv) An annoying warning message ("Release XXX was not found in tags file") has been corrected when searching for releases in the tags file. Previously, the release number was expected to appear at the end of the string (e.g. VO-atlas-production-13.0.40 but was missed in e.g. VO-atlas-production-14.2.23.2-i686-slc4-gcc34-opt).
      6. ) Slow network performance between UTD-HEP and BNL under investigation.
      7. ) AGLT2 -- ran out of space used by PostgreSQL? db for dCache -- problem resolved (6/19).
      8. ) dccp timeout errors at BNL on Saturday (6/20) -- from Pedro: "both machines have been restarted. acas0015 has also been restarted. during this period the pilot could have gotten some timeouts copying files but the pools have been restarted and I was able to copy files from them without any problem."
      9. ) MWT2_UC -- network outage affected dCache pools over the weekend. Problem resolved, test jobs succeeded, site set back to 'om-line'.
      10. ) Issue with large files in xrootd systems (file size check fails) -- information from Wei: The problem I see so far exists in Xrootd Posix preload libs on 64bit hosts It could have a broad impact on many commands using the preload lib. The problem has been addressed in later xrootd releases and I am using xrootd posix preload lib from CVS to work around this problem. So far I verified "cp/xcp", stat/ls, md5sum, adler32 and gridftp modules. I think that is all panda jobs use, and I hope I am not missing anything. The new 64bit preload lib is available (as a hot fix) at: http://www.slac.stanford.edu/~yangw/libXrdPosixPreload.so 32-bit hosts don't need a fix.
      11. ) Follow-up from an earlier item: Any updates about bringing Tufts into production? (NET2 site)
    • Difficulties communicating information from central operations shifters to UWISC; how do we improve communications to Tier 3's?

Analysis queues (Nurcan)

  • Reference:
  • last meeting:
    • STEP09 ended on June 12th. See analysis job summaries (as of yesterday, June 16 ) for clouds and US sites below. Statistics are not final yet. Killed jobs on June 12th will be excluded.
    • STEP09 post-mortem is due: July 1 - ATLAS internal, July 9-10 - WLCG. MWT2 signed up to provide feedback: https://twiki.cern.ch/twiki/bin/view/Atlas/Step09Feedback#T2
    • Status of DB access job: tested the new tag PyUtils-00-03-56-02 that Sebastien provided to process files with URL's root:// and dcap:// for Release 14 jobs worked. However jobs at SLAC (not a ROOT file) and SWT2 (looping job) are still unsuccessful, needs further debugging.
  • this meeting:
    • AnalysisStep09PostMortem - ATLAS post mortem meeting is July 1
    • Status of DB access job at SWT2: A test job had failed during a time when SWT2 had a major hiccup with Xrootd storage. Next job on Friday blew up because it failed to download something from CERN's panda server. Patrick to run the job by hand.
    • Status of DB access job at SLAC: Much debugging in the last week by Wei, still not understood why the input file is accessed correctly but later as not a ROOT file. Further debugging needed (run the job interactively, etc.).
    • Cosmic job: I got a job from Hong Ma running on cosmic data of the type IDPROJCOMM, IDCOMM, ESD (for instance: data08_cosmag.00091900.physics_IDCosmic.merge.DPD_IDPROJCOMM.o4_r653_p26/). Job is configured to run at BNL. Will test this job at BNL and try to run at Tier2's as well.
    • User analysis issues:
      • User tried to run an official trf (csc_simul_reco_trf.py) at SWT2: Tadashi reported that this kind of official transformations don't support direct access except rfio: and castor: and user needs to modify PyJobTransformsCore. A recipe is provided at:https://groups.cern.ch/group/hn-atlas-dist-analysis-help/Lists/Archive/Pathena%20jobs%20failing%20in%20SWT2. Alden in contact with the trf developers to get the trf modified.
      • Probelm with reading input files like AOD.064188._00433.pool.root.1__DQ2-1243719597 at STW2: Pilot gave the name as AOD.064188._00433.pool.root so runAthena failed to find it. Tadashi added a protection in the runAthena script.
      • File look up problem at NET2: From Saul: "The problem is caused because GPFS (our main storage file system) has a hard limit of 65K subdirectories of any single directory. When the ddm system exceeds this limit, "put" errors occur in our local site mover and panda jobs fail because of that....I gather from Kaushik that a general fix is being prepared. ┬áIn the mean time, we have avoided the limit by replacing some directories with symlinks so that you can create more datasets at NET2."

DDM Operations (Hiro)

  • Reference
  • last meeting(s):
    • UC cleaning issue understood
    • Working on problem reported by Wei - LFC service was being blocked - service was doing port scanning. There is a new DOE cross-lab cyber-security program that detected SLAC and AGLT2 were detected as port scanning. 30 ports are being accessed on a Panda mover servers (not necessarily Panda mover processes) from remote LFCs. What changed? John Hover is investigating will provide a report.
    • AGLT2 FTS delegated proxy has a problem; Bob increased frequency of renewal of cron job. Should probably check validity.
    • Simone changed ToA setting to enforce production role, but it can be over-ridden.
  • this meeting:
    • Currently all sites are okay
    • UTD, UTA transfers are slow for some reason. Under investigation.
    • Sometimes an individual file in a dataset may go missing, which breaks the subscription. What causes this? Sometimes an over-aggressive clean-up script. Or, a pilot problem with mis-registration. (eg. lcg-cr wrapper bug)
    • There was a problem at AGLT2 - Hiro intervened.

Tier 3 issues (Doug)

  • last meeting(s)
    • Torre, Kaushik, Doug met last week, using Panda jobs to push data to Tier 3. Tier 3 would need an SRM, be in ToA. New panda client to serve data rather than full subscription model.
    • Kaushik is preparing draft for ATLAS review.
    • Does need an LFC someplace. This will be provided by BNL.
    • Few weeks of discussion to follow. Will take a month of development. Client is not too difficult.
    • Rik and Doug have been doing dq2-get performance testing on a dataset with 158 files, O(100 GB).
    • Tested from 3 places (ANL, FNAL, Duke) against most Tier 2's.
    • Rik reported on results to the stress-test.
    • Has seen copy errors of ~1%. Have also seen checksums not agreeing.
    • Lots of work on VMs. Looking like a good solution for Tier 3. Virtualize headnode systems.
    • No news on BNL srm-xrootd.
    • Would like to get analysis jobs similar to HC to use for Tier 3 validation (local submission).
  • this meeting

Conditions data access from Tier 2, Tier 3 (Fred)

  • last week
    • https://twiki.cern.ch/twiki/bin/view/Atlas/RemoteConditionsDataAccess
    • Needs to be solved quickly - Saha, Richard Hawkings, Elizabeth Gallas, David Front.
    • Jobs are taking up connections - they hold them open for a long time.
    • Squid tests have AGLT2, MWT2, WT2 successful - reduces load on backend
    • Fred - will contact NE and SW to setup squid caches; validate w/ Fred's test jobs
    • COOL conditions data will be subscribed to sites; owned by Sasha
    • XML file needs to be maintained at each site using Squid - raw, esd, or even AOD data; this needs to be structured centrally, since all sites will want to do this.
    • Michael has discussed this w/ Massimo; needs to bring this up at the ADC operations meeting.
    • More ATLAS coordination is involved
  • this week
    • UTA squid is setup - just need to test the client - will send John
    • BU squid - similar very

Data Management & Storage Validation (Kaushik)

Throughput Initiative (Shawn)

  • last week:
    • Note:
       
      **ACTION ITEM**  Each site needs to provide a date before the end of June for their throughput test demonstration.  Send the following information to Shawn McKee and CC Hiro:
      a)      Date of test (sometime after June 14 when STEP09 ends and before July 1)
      b)      Site person name and contact information.  This person will be responsible for watching their site during the test and documenting the result.
      For each site the goal is a graph or table showing either:
      i)                    400MB/sec (avg) for a 10GE connected site
      ii)                   Best possible result if you have a bottleneck below 10GE
      Each site should provide this information by close of business Thursday (June 11th).   Otherwise Shawn will assign dates and people!!
    • Major action item is for every site to reserve a time for BW test
    • OSG will provide client tools as part of VDT
    • Rich - raises issue that checks of the plots need to be made
  • this week:
    • Last week BNL --> MWT2_UC throughput testing, see: http://integrationcloud.campfirenow.com/room/192199/transcript/2009/06/18
    • Performance not as good as hoped, but 400 MB/s milestone reached (peak only)
      • For some reason the individual file transfers were low
    • NET2 - need 10G NIC
    • SLAC - may need additional gridftp servers
    • UTA - CPB - iperf tends to vary to dcdoor10. 300-400 Mbps, mostly 50-75 Mbps; 700-800 Mbps into _UTA (probably is coming back).
    • OU - need new storage.
    • AGLT2 - directional

Site news and issues (all sites)

  • T1:
    • last week: Biggest issue was the Tier 1 didn't get the anticipated share of data exports during the step09 exercise. Issues with the output buffer at CERN - meeting to understand this, will instrument to adjust FTS configuration; will meeting again to discuss concrete steps. Pedro/dCache: maintenance, upgraded. Jason working on monitoring srm probes for ganglia.
    • this week: Another fiber cut in US LHCNet - 10 Gbps to 5 Gbps on OPNET. (Fnal also affected) Second time in last couple of weeks, worrisome. Nexan shipment coming next week. 32 TB useable (Thor extension). PNFS ID implemented in one of the tables, and helping enormously.

  • AGLT2:
    • last week:
    • this week: Things moving along fine.

  • NET2:
    • last week(s): John finishing up the corrupt data issue. Squid installed. Discussion about IO_WAIT observations.
    • this week: Smoothly running.

  • MWT2:
    • last week(s): dCache much more stable - but a few restarts during step09 still being investigated. Only got about 200 MB/s during step09 due to an FTS configuration error. Would like to get FTS 2.0 working for UC to implement passive gridftp for direct transfers. Throughput tests tomorrow.
    • this week: Temperature incident over the weekend. Environmental monitoring. Working w/ Paul on lsm migration.

  • SWT2 (UTA):
    • last week: Will implement Adler32 today. Squid server host prepared, will being installing.
    • this week: TP test later today; squid; network troubleshooting to BNL; analysis job issues w/ DB access.

  • SWT2 (OU):
    • last week: throughput issue OU-BNL, may be maintenance related.
    • this week: ok

  • WT2:
    • last week: found performance problems w/ thumpers, file-not-found errors during step09. Old solaris and ZFS which has difficulty detecting defective drives; will upgrade. Found throughput to BNL reduced while analysis jobs running. Found registrations to LFC sometimes failing for DQ2 transfers, in discussions w/ DQ2 developers. Hiro believes data did get into LFC, were they removed? Wei claims registrations, specifically for DATADISK, were stopped while MCDISK registrations were successful. IN2P3? visitor did I/O profiling and found largest block size was 300 bytes; also found lots of jumps within a file. Is read-ahead then useful?
    • this week: SLAC offline at the moment. Bug in posix preload library for 64 bit machines. Preparing storage

Carryover issues (any updates?)

Squids and Frontier (John DeStefano)

  • Note: this activity involves a number of people from both Tier 1 and Tier 2 facilities. I've put John down to regularly report on updates as he currently chairs the Friday Frontier-Squid meeting, though we expect contributions from Douglas, Shawn, John - others at Tier 2's, and developments are made. - rwg
  • last meeting(s):
  • this week:
    • Meeting this Friday

Release installation, validation (Xin)

The issue of validating presence, completeness of releases on sites.
  • last meeting
    • Transfer of new pacball datasets to bnl much improved. Tom/UK will handle subs.
    • Tadashi changed panda mover to make install jobs highest priority. Should see releases installed very quickly.
    • Six new installation sites to panda - some configs need changes.
  • this meeting:

HTTP interface to LFC (Charles)

VDT Bestman, Bestman-Xrootd

  • See BestMan page for more instructions & references
  • last week
    • Have discussed adding Adler32 checksum to xrootd. Alex developing something to calculate this on the fly. Expects to release this very soon. Want to supply this to the gridftp server.
    • Need to communicate w/ CERN regarding how this will work with FTS.
  • this week

Local Site Mover

AOB

  • last week
    • Doug reports that we will have Tier 3's opting for the GS type. Is there a Tier 2 "best practices" that can be provided to Tier 3's?
  • this week * None


-- RobertGardner - 23 Jun 2009

  • load_test_UC_2009_06_19_bnlucnetwork.png:
    load_test_UC_2009_06_19_bnlucnetwork.png

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments


png load_test_UC_2009_06_19_bnldcdoors.png (29.5K) | RobertGardner, 24 Jun 2009 - 13:34 |
png load_test_UC_2009_06_19_bnlucnetwork.png (19.6K) | RobertGardner, 24 Jun 2009 - 13:34 |
png load_test_UC_2009_06_19_ucswitch.png (30.5K) | RobertGardner, 24 Jun 2009 - 13:35 |
 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback