r5 - 08 Jul 2009 - 15:07:26 - RobertGardnerYou are here: TWiki >  Admins Web > MinutesJul8

MinutesJul8

Introduction

Minutes of the Facilities Integration Program meeting, July 8, 2009
  • Previous meetings and background : IntegrationProgram
  • Coordinates: Wednesdays, 1:00pm Eastern
    • (309) 946-5300, Access code: 735188; Dial *6 to mute/un-mute.

Attending

  • Meeting attendees: Charles, Rob, John De Stefano, Sarah, Patrick, Shawn, Booker Bense, Saul, Rich, Doug, Kaushik, Nurcan, Wei, Armin, Rupom, Mark, Karthik, Fred, Michael
  • Apologies: Michael - joining late

Integration program update (Rob, Michael)

  • IntegrationPhase9 - FY09Q3 report being prepared: SummaryReportP9
  • Special meetings
    • Tuesday (9:30am CDT): Facility working group on analysis queue performance: FacilityWGAP suspended for now
    • Tuesday (12 noon CDT) : Data management
    • Tuesday (2pm CDT): Throughput meetings
    • Friday (1pm CDT): Frontier/Squid
  • Upcoming related meetings:
  • US ATLAS persistent chat room http://integrationcloud.campfirenow.com/ (requires account, email Rob), guest (open): http://integrationcloud.campfirenow.com/1391f
  • Program notes:
    • In third month of FY09Q3:
      • Please update FabricUpgradeP9 with any CPU, storage, or infrastructure procurements during this phase.
      • Please update SiteCertificationP9 for DQ2 site services update for Adler32, Fabric upgrade, and Squid server-Frontier client deployment
    • Squid deployment at Tier 2 - we need:
      • We need ATLAS to validate the method - we need a discussion within ADC operations and physics (Jim C). Also need a validation job - eg. one running over cosmics.
      • Within the facility need sites to deploy Squid - June 24.
  • Other remarks
    • last week(s)
    • this week
      • We need to better understand relatively low Hammer Cloud efficiency and event rate as observed in the US cloud during STEP09. See http://indico.cern.ch/conferenceDisplay.py?confId=62853, specifically Dan's talk.
        • Lots of seeking in ATLAS jobs, large offsets (reading the file). Studied both at SLAC and AGLT2
      • Expect more (non-robotic) analysis jobs from the (100M) JF35 sample.
      • Cosmics replication - some delays due to T0 reprocessing.
      • Quarterly reports due now!
      • MRI proposal in the works led by Internet2 - enabling DCN for T2/T3, using LHC sites as exemplars. Being discussed (UM and CIT submitting). Install required hardware at the campus/region. "Instrument upgrade" to enable science. A protection of bandwidth. Hardware and services component. Needs support from exec board and US LHC programs.

Operations overview: Production (Kaushik)

  • Reference:
  • last meeting(s):
    • there will be plans for more user analysis later this week - Jim C
    • putting together pieces of the sample together into the containers (150 M events); can only add completed datasets to containers (now done DONE)
    • size of analysis container ~20TB, i.e. need about 25 TB in MCDISK.
    • Capacity: SWT2 (35 TB); NET2 (30 TB); WT2 (45 TB)
    • 65K directories at the top level in PRODDISK - limit for GPFS - solution. Discussion w/ Simone to change way directory structures are defined by DQ2. Need to follow same scheme for all proddisk areas. Competing effects of nesting versus number.
    • Will build containers today - Wensheng will distributed - probably by next week DONE
    • KD has defined a number of tasks; UWISC group. 100M fast sim queue filler. ATLAS universe is idle.
    • RAC meeting to be called. JC will gather user needs.
    • Some tasks coming require tape access, at ~5% level in background
    • Defined enough tasks for two weeks.
  • this week:
    • James Catmore - green light for fast reprocessing - to be done at BNL only. Fast reprocessing of cosmic data.
    • 200M queue filler jobs being run now. Things are running stably. Enough jobs for another two weeks.
    • Why are there high failure rates presently? One of the DQ2 central servers has problems. Long discussion in the ADC operations meeting - lots of confusion. Is there a good place to notice these things? (elog? virtual chat room - is this public, or the best place? do we need a bulletin? - Mark will look into)

Shifters report (Mark)

  • Reference
  • last meeting:
    • Generally production running well over this past week. With a few exceptions job failures rates have been low. Yuri's weekly summary presented at the Tuesday morning ADCoS? meeting here.
      1. ) Wei announced that the previously scheduled maintenance outage at SLAC would have to be rescheduled due to the impending cosmic run.
      2. ) Wednesday afternoon (6/17): ~30 minute panda monitor outage -- services restored after problematic code was backed out.
      3. ) Following resolution of condor job eviction issue at IllinoisHEP? test jobs succeeded and site set back 'on-line' (6/18).
      4. ) Power cut at CERN late night Wednesday temporarily affected panda servers -- issues cleared up by Thursday morning (6/18).
      5. ) Pilot update from Paul (v37j) -- details:
        • (i) A new error code (1122/"Bad replica entry returned by lfc_getreplicas(): SFN not set in LFC for this guid"/EXEPANDA_NOLFCSFN) is now used to identify stress related problems with the LFC. It has been observed that the lfc_getreplicas() in some cases return empty replica objects for a given guid (especially when the LFC is under heavy load). The problem can occur when the distance between the client and server is large, and/or if several guids are sent with the lfc_getreplicas() call - as was recently introduced in the pilot. Jean-Philippe suggested that this can occur in older LFC server versions (problem partially fixed in v 1.7.0, and fully fixed in 1.7.2).
        • (ii) Added file size info to DQ2 tracing report in all relevant site movers.
        • (iii) The file size test in the dCacheLFC site mover has been dropped since it's comparing the checksums anyway. Problems have been observer at SARA where the local and remote file sizes failed due to an unexplained mixup of file sizes of different files (the DBRelease file was seemingly compared to another file..). It is not guaranteed that this fix will solve that problem. The problem has only been observed at SARA to my knowledge.
        • (iv) An annoying warning message ("Release XXX was not found in tags file") has been corrected when searching for releases in the tags file. Previously, the release number was expected to appear at the end of the string (e.g. VO-atlas-production-13.0.40 but was missed in e.g. VO-atlas-production-14.2.23.2-i686-slc4-gcc34-opt).
      6. ) Slow network performance between UTD-HEP and BNL under investigation.
      7. ) AGLT2 -- ran out of space used by PostgreSQL? db for dCache -- problem resolved (6/19).
      8. ) dccp timeout errors at BNL on Saturday (6/20) -- from Pedro: "both machines have been restarted. acas0015 has also been restarted. during this period the pilot could have gotten some timeouts copying files but the pools have been restarted and I was able to copy files from them without any problem."
      9. ) MWT2_UC -- network outage affected dCache pools over the weekend. Problem resolved, test jobs succeeded, site set back to 'om-line'.
      10. ) Issue with large files in xrootd systems (file size check fails) -- information from Wei: The problem I see so far exists in Xrootd Posix preload libs on 64bit hosts It could have a broad impact on many commands using the preload lib. The problem has been addressed in later xrootd releases and I am using xrootd posix preload lib from CVS to work around this problem. So far I verified "cp/xcp", stat/ls, md5sum, adler32 and gridftp modules. I think that is all panda jobs use, and I hope I am not missing anything. The new 64bit preload lib is available (as a hot fix) at: http://www.slac.stanford.edu/~yangw/libXrdPosixPreload.so 32-bit hosts don't need a fix.
      11. ) Follow-up from an earlier item: Any updates about bringing Tufts into production? (NET2 site)
    • Difficulties communicating information from central operations shifters to UWISC; how do we improve communications to Tier 3's?

  • this meeting:
    • Yuri's weekly summary presented at the Tuesday morning ADCoS? meeting: http://indico.cern.ch/conferenceDisplay.py?confId=63897
    • One particular topic that was discussed this week was the procedure for shifters to follow in cases of missing files (i.e., input files not found, etc.) A procedure is outlined in the ADCoS? wiki, but there's a question of how long this could take in a situation with a large number of files. Not yet finalized.
    1. ) Update to the pilot (from Graeme while Paul is on vacation): A small changed to the pilot code has been made in version 37l. This modifies the athena environment setup to get the correct trigger menus in athena > 14.5 (7/2).
    2. ) SRM restart at BNL (7/2) -- from Pedro: We needed to SRM a few more times and change some settings on the SRM, PnfsManager? plus we had to quickly retire+clean some old pools and re-deploy them in the MCTAPE write pools.
    3. ) (7/2) Stage-in/out errors at OU due to occasional crash of the SE/LFC node -- a system is in place to detect this and perform an auto-restart. (This issue will be resolved once their new storage system is in place.)
    4. ) AGLT2 set back to 'on-line' following resolution of a missing file(s) issue. See: https://rt-racf.bnl.gov/rt/index.html?q=13427 (7/3).
    5. ) Auto-pilot wrapper was downloading the pilot code from BNL (which had a slightly out of date version) rather than CERN. Issue resolved (thanks to Torre & Graeme) See: http://savannah.cern.ch/bugs/?52757 (7/4).
    6. ) Over the weekend transfers to AGLT2_MCDISK were failing due to insufficient space. 20TB added (7/6). See: https://rt-racf.bnl.gov/rt/index.html?q=13454
    7. ) Intermittent stage-in issues at the NET2 sites (BU and HU) -- from Saul: This problem appeared when the top prodsys directory reached 65K subdirectories (a hard GPFS limit). We have gotten around the limit, but don't yet understand the missing files below. We're working on it.
    8. ) Stage-out & DDM transfer problems at AGLT2 Tuesday afternoon (7/7). From Shawn: We had some strange gPlazma issues this afternoon. It required a restart of the dCache admin headnode (and some subsequent gPlazma restarts) to resolve. Things seem to be OK now but we will continue to watch it. See: prod-grid-logger.cern.ch/elog/ATLAS+Computer+Operations+Logbook/4475.
    9. ) Stage-in errors at BNL Tuesday morning (>600 failed jobs). RT 13463. According to Pedro this issue is the same one as previously reported in https://rt-racf.bnl.gov/rt/Ticket/Display.html?id=13418. Will follow up in that thread. Problem seems to be network-related.
    10. ) Maintenance downtime at AGLT2 tomorrow (Thursday, 7/9). From Bob: On Thursday this week, July 9, we will upgrade various components of our cluster. Many nodes will be rebuilt, and dCache and network outages should be expected during this time. The upgrades will commence at 8am EDT on July 9, and we hope to be finished with all work by 6pm. If we are back up earlier than this, or need more time, we will send notification. An OIM outage for this time has been set. New condor jobs will not start after 4pm on July 8. I will set our site offline (AGLT2 and ANALY_AGLT2) shortly before that time so that no Idle jobs will remain when compute nodes stop accepting new jobs.
    11. ) Problem with transfers to WISC resolved (7/7). From Wen: We had a gridftp problem which blocked some transfers this afternoon. So we stop the SRM server to redefine gridftp configuration. The problem has already solved. GGUS ticket 50095.
    12. ) Tuesday evening, 7/7 -- MWT2_UC failed jobs with errors about missing files. From Charles: All of these failures were coming from a single WN - uct2-c194. There was a filesystem problem on this node earlier today which led to a large number of failures.... the problem is fixed now.
    13. ) 7/7: Test jobs submitted to UCITB_EDGE7. The jobs failed, most likely due to a missing value for "cmtConfig" in schedconfigdb. Working on this. See for example: http://panda.cern.ch:25980/server/pandamon/query?job=1014372939.
    • Follow-ups from an earlier items:
      • Any updates about bringing Tufts into production? (NET2 site)
      • Slow network performance between UTD-HEP and BNL not fully understood, but the site has been running production stably for at least a week or more. Software release installations are still slow, but this should get resolved later this Summer when a hardware issue with a file server is addressed.

Analysis queues (Nurcan)

  • Reference:
  • last meeting:
    • AnalysisStep09PostMortem - ATLAS post mortem meeting is July 1
    • Status of DB access job at SWT2: A test job had failed during a time when SWT2 had a major hiccup with Xrootd storage. Next job on Friday blew up because it failed to download something from CERN's panda server. Patrick to run the job by hand.
    • Status of DB access job at SLAC: Much debugging in the last week by Wei, still not understood why the input file is accessed correctly but later as not a ROOT file. Further debugging needed (run the job interactively, etc.).
    • Cosmic job: I got a job from Hong Ma running on cosmic data of the type IDPROJCOMM, IDCOMM, ESD (for instance: data08_cosmag.00091900.physics_IDCosmic.merge.DPD_IDPROJCOMM.o4_r653_p26/). Job is configured to run at BNL. Will test this job at BNL and try to run at Tier2's as well.
    • User analysis issues:
      • User tried to run an official trf (csc_simul_reco_trf.py) at SWT2: Tadashi reported that this kind of official transformations don't support direct access except rfio: and castor: and user needs to modify PyJobTransformsCore. A recipe is provided at:https://groups.cern.ch/group/hn-atlas-dist-analysis-help/Lists/Archive/Pathena%20jobs%20failing%20in%20SWT2. Alden in contact with the trf developers to get the trf modified.
      • Problem with reading input files like AOD.064188._00433.pool.root.1__DQ2-1243719597 at STW2: Pilot gave the name as AOD.064188._00433.pool.root so runAthena failed to find it. Tadashi added a protection in the runAthena script.
      • File look up problem at NET2: From Saul: "The problem is caused because GPFS (our main storage file system) has a hard limit of 65K subdirectories of any single directory. When the ddm system exceeds this limit, "put" errors occur in our local site mover and panda jobs fail because of that....I gather from Kaushik that a general fix is being prepared. In the mean time, we have avoided the limit by replacing some directories with symlinks so that you can create more datasets at NET2."

  • this meeting:
    • AnalysisStep09PostMortem - Thanks to all sites providing a detailed info here. Could we use some more info from BNL?. ATLAS post-mortem was on July1st. A summary was presented by Graeme in ATLAS Week yesterday. WLCG post-mortem coming up, July 9-10.
    • Status of DB access job at SWT2: Patrick managed to run the job by hand. He made a change in InstallArea/python/PyUtils/AthFile.py. Now we are waiting a response from Sebastien Binet on this.
    • Stress testing of DB access jobs at other sites: BNL, AGLT2, MWT2 passed the stress test (200 jobs, >95% success rate). NET2 has still input files with bad checksums. This job is to put in HammerCloud.
    • Status of cosmic job: Instructions from Hong Ma. I ran this job fine at BNL. It only runs at BNL not at other sites (tried AGLT2) since BNL provides a local pool files via the poolcond directory at /usatlas/workarea/atlas/PFC/catalogue/poolcond. Hong commented that the regular DB release does not come with these pool files so unless something is done like at BNL, job will fail. Need help from experts on this.
    • User analysis challenge in US with step09 samples:
      • Two samples are now replicated to US sites: step09.00000011.jetStream_medcut.recon.AOD.a84/ (estimated total size 14900 GB, 9769 files, 97.69M events), step09.00000011.jetStream_lowcut.recon.AOD.a84/ (estimated total size 3674 GB, 2749 files, 27.49M events)
      • Validation work ongoing. SUSYValidation job is successful, now running on full sample at SWT2.
      • Day of analysis challenge to be announced by Jim C. Sites are scheduling downtimes, AGLT2 (July 9) and SLAC (needs 4 days).
    • User analysis issues:
      • User reports on problems with "R__unzip: error in inflate" at SWT2. Issue discussed at AnalysisStep09PostMortem by Patrick and Wei. From Wei today:
        The problem, if it is as what the xrootd developer identified before, is a
        cache replacement issue of root client, which only happens if a job reads
        from xrootd servers directly. If a file is copied to batch node, this cache
        is not used. According to the developer, the offending file will _likely_
        repeat the problem most of the time.
        
        I am not sure whether upgrading xrootd client package at site will help (it
        may). It is a little risky because the buggy root client is embedded in
        ATLAS releases. Using LD_LIBRARY_PATH at a site will modify a lot of things
        so we much be very careful. To solve the root of the problem, we need xrootd
        developers to work with the ROOT team and ATLAS.
      • Wei: why did this show up during step09? How is it triggered? A ROOT patch needs to go into the ATLAS release. Happening at SWT2 and SLAC; user's jobs are successful at NET2. Perhaps turn off direct reading. Can we reproduce the error? Nurcan claims the error was seen for several types of HC jobs.
        • We need to have a lightweight/simple way to recreate problems.
        • Need to coordinate updates to ATLAS releases, and updates to ROOT- contact through David Q
      • We need to address the problem of distributing the conditions data cool files and xml files. This is a serious issue that is not being addressed by ADC operations. Fred will follow-up with Jim Shank. We should also discuss this as an item at the L2 management meeting.

ICB meeting (Michael)

  • Discussion of step09 - major issue discussed was efficiency.
  • Sites having the most data were the least efficient.

DDM Operations (Hiro)

Tier 3 issues (Doug)

  • last meeting(s)
    • Torre, Kaushik, Doug met last week, using Panda jobs to push data to Tier 3. Tier 3 would need an SRM, be in ToA. New panda client to serve data rather than full subscription model.
    • Kaushik is preparing draft for ATLAS review.
    • Does need an LFC someplace. This will be provided by BNL.
    • Few weeks of discussion to follow. Will take a month of development. Client is not too difficult.
    • Rik and Doug have been doing dq2-get performance testing on a dataset with 158 files, O(100 GB).
    • Tested from 3 places (ANL, FNAL, Duke) against most Tier 2's.
    • Rik reported on results to the stress-test.
    • Has seen copy errors of ~1%. Have also seen checksums not agreeing.
    • Lots of work on VMs. Looking like a good solution for Tier 3. Virtualize headnode systems.
    • No news on BNL srm-xrootd.
    • Would like to get analysis jobs similar to HC to use for Tier 3 validation (local submission).
  • this meeting
    • CERN virtualization workshop - discuss regarding head node services.
    • BNL is providing some hardware for virtualizing Tier 3 clusters.
    • Considering Rpath

Conditions data access from Tier 2, Tier 3 (Fred)

  • last week
    • https://twiki.cern.ch/twiki/bin/view/Atlas/RemoteConditionsDataAccess
    • Needs to be solved quickly - Saha, Richard Hawkings, Elizabeth Gallas, David Front.
    • Jobs are taking up connections - they hold them open for a long time.
    • Squid tests have AGLT2, MWT2, WT2 successful - reduces load on backend
    • Fred - will contact NE and SW to setup squid caches; validate w/ Fred's test jobs
    • COOL conditions data will be subscribed to sites; owned by Sasha
    • XML file needs to be maintained at each site using Squid - raw, esd, or even AOD data; this needs to be structured centrally, since all sites will want to do this.
    • Michael has discussed this w/ Massimo; needs to bring this up at the ADC operations meeting.
    • More ATLAS coordination is involved
    • UTA squid is setup - just need to test the client - will send John info
    • BU squid - almost ready
  • this week
    • see note above.

Data Management & Storage Validation (Kaushik)

Throughput Initiative (Shawn)

  • last week:
    • Note:
       
      **ACTION ITEM**  Each site needs to provide a date before the end of June for their throughput test demonstration.  Send the following information to Shawn McKee and CC Hiro:
      a)      Date of test (sometime after June 14 when STEP09 ends and before July 1)
      b)      Site person name and contact information.  This person will be responsible for watching their site during the test and documenting the result.
      For each site the goal is a graph or table showing either:
      i)                    400MB/sec (avg) for a 10GE connected site
      ii)                   Best possible result if you have a bottleneck below 10GE
      Each site should provide this information by close of business Thursday (June 11th).   Otherwise Shawn will assign dates and people!!
    • Last week BNL --> MWT2_UC throughput testing, see: http://integrationcloud.campfirenow.com/room/192199/transcript/2009/06/18
    • Performance not as good as hoped, but 400 MB/s milestone reached (peak only)
      • For some reason the individual file transfers were low
    • NET2 - need 10G NIC
    • SLAC - may need additional gridftp servers
    • UTA - CPB - iperf tends to vary to dcdoor10. 300-400 Mbps, mostly 50-75 Mbps; 700-800 Mbps into _UTA (probably is coming back).
    • OU - need new storage.
    • AGLT2 - directional issues
  • this week:
    Topics for today
    1)      perfSONAR updates/discussion
    Release candidate will be out for joint-techs (July 19-23 at IU).   Old deadline was June 30th, 2009.   Now delayed till August 10.    Jay has his graphics page available but there are concerns.   Need to contact Jeff Boote about status.  ACTION ITEM:  Contact Jeff Boote about status for next release and whether there is a good time to discuss this.   Some discussion about Tier-3 adoption and usefulness of graphs on the frontpage of the perfSONAR web interface.  
    2)      Automated data-movement  testing  status(Hiro)
    New machine now available for test hosting.   Hiro will provide API (via http Jay can get results).   Test will initially be 10 files of 3.6 GB each.   Timescale for initiating test is one or two weeks.    Hiro will provide MIN, MAX, AVG, STD-DEV for the transfer result for each site.
    3)      Report on GridFTP/Google-Summer-of-code work.   (Future report…no report today)
    4)      Scheduling throughput milestone demonstrations…current status discussion.
    5)      Site Reports (Include perfSONAR status and milestone status in your report)
    a.       BNL  -  BNL’s FTS are setup for GridFTP2 for MWT2, AGLT2  (BNL->Tier-2 only).  Plan to change to use GridFTP2 forTier-2->BNL during the next downtime for dCache.   ‘srmcp’ is not performing well BNL->MWT2_IU.   ‘gridftp’ is working OK BNL->MWT2_IU.   Setting up Tier-3 DQ2 and LFC instance (all Tier-3s).   Try for 1GB/s test at 10AM tomorrow (7/8/2009).  Hiro will be in ATLAS Campfire site and it will be useful to have sites join there in case there are problems or questions.
    b.      AGLT2 -  Will participate in 1GB/s test tomorrow and try to finish milestone goals then.
    c.       MWT2 – IU -> UC achieved 1.2 GB/s (via srmcp into space-token area).
    d.      NET2 – Installing new (Myricom) 10GE NICs soon.  Will schedule milestone testing after they are in place.
    e.      SWT2 – No report
    f.        WT2 – Nothing to report
    g.       Wisconsin – Nothing to report
    6)      AOB
     
    Need further information for running transfers.   Being able to identify hardware involved in specific transfers is critical to determining the cause of poor transfer rates.   Hiro observes transfers from BNL->Tier-2s where some files are moving at 10’s of MB/sec while others are moving at KB/sec.  Both src and destination in these cases are the same.    Will need to revisit this topic in a future meeting.
     
    An additional buffer to check on 10GE receivers:
     
    root@umt3int01 ~# sysctl -a | grep backlog
    net.core.netdev_max_backlog = 10000

  • 1 GB/s throughput milestone going on now.
  • August 10 target to deploy new perfsonar package.
  • We need to begin using these tools during our throughput testing.
  • July 19-20

OSG 1.2 validation (Rob, Xin)

  • Testing in progress on UCITB-EDGE7 - requires a scheddb change.
  • Xin - adding BNL as a test site in Panda

Site news and issues (all sites)

  • T1:
    • last week: Another fiber cut in US LHCNet - 10 Gbps to 5 Gbps on OPNET. (Fnal also affected) Second time in last couple of weeks, worrisome. Nexan shipment coming next week. 32 TB useable. PNFS ID implemented in one of the tables, and helping enormously.
    • this week: HPSS maintenance on July 14. 52 units of storage coming to BNL today. Expect to have this completed quickly. Have decided against using Thor extension units (expense) - will use the FC connected Nexan units. Have submitted an order for 120 worker nodes (Nehalen). (3MSI2K goal). 3 exascale F10 network chasis. Observed a couple of bottlenecks during step09. Will get 60 Gbps backbound. HPSS inter-mover upgrade to 10 Gbps. Note ATLAS resource request is still under discussion, not yet approved by LHCC; resource request for Tier 2's is at the same level as we've known from before.

  • AGLT2:
    • last week:
    • this week: Downtime planned for tomorrow - 8 am to 6 pm. Need a fix for glue schema reporting. Updating bios and firmware for storage controllers. Jumbo frames on the public network. There were some outages during the last two weeks - but these are understood. AFS to be upgraded - hopes this will stabilize. gplazma on the headnode. Monitoring of the gplazma logfile.

  • NET2:
    • last week(s):
    • this week: Working on a number of DDM issues. Squid and frontier client now installed, waiting for Fred to test. About to install new Myricom cards - 10G, then will repeat throughput tests. 130 TB of storage getting.

  • MWT2:
    • last week(s): local site mover; Cisco 6509 overheated during weekend temperature incident.
    • this week: Have been running smoothly. Saturated 10G link between IU and UC, 14K files transferred. dCache seems to be stable now. No failures. ~5K xfers / hour. Try smaller files to boost up the SRM rate.

  • SWT2 (UTA):
    • last week: TP test later today; squid; network troubleshooting to BNL; analysis job issues w/ DB access.
    • this week: CPB running okay. Working on analysis queue issue discussed above. Ibrix system being upgraded for SWT2 cluster. Still working on squid validation w/ frontier clients.

  • SWT2 (OU):
    • last week:
    • this week: stable here.

  • WT2:
    • last week: SLAC offline at the moment. Bug in posix preload library for 64 bit machines. Preparing storage
    • this week: preload library problem fixed. Working on a procurement - ~10 thor units. ZFS tuning. Looking for a window to upgrade the Thumpers. Latest fix for xrootd client is in the latest ROOT release.

Carryover issues (any updates?)

Squids and Frontier (John DeStefano)

  • Note: this activity involves a number of people from both Tier 1 and Tier 2 facilities. I've put John down to regularly report on updates as he currently chairs the Friday Frontier-Squid meeting, though we expect contributions from Douglas, Shawn, John - others at Tier 2's, and developments are made. - rwg
  • last meeting(s):
  • this week:
    • SLAC - would like move a squid.
    • Moving weekly meting to 10 Eastern.

Release installation, validation (Xin)

The issue of validating presence, completeness of releases on sites.
  • last meeting
    • Transfer of new pacball datasets to bnl much improved. Tom/UK will handle subs.
    • Tadashi changed panda mover to make install jobs highest priority. Should see releases installed very quickly.
    • Six new installation sites to panda - some configs need changes.
  • this meeting:
    • There was no official announcement for the current

HTTP interface to LFC (Charles)

VDT Bestman, Bestman-Xrootd

  • See BestMan page for more instructions & references
  • last week
    • Have discussed adding Adler32 checksum to xrootd. Alex developing something to calculate this on the fly. Expects to release this very soon. Want to supply this to the gridftp server.
    • Need to communicate w/ CERN regarding how this will work with FTS.
  • this week

Local Site Mover

  • Specification: LocalSiteMover
  • code
  • this week if updates:
    • Want to shift to lsm at UC - there are two parts: the pilot piece and what is provided by the site. There is a problem in the pilot w/ lsm-put method. Charles would like try out a fix to the pilot.

AOB

  • last week
  • this week
    • None


-- RobertGardner - 30 Jun 2009

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments

 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback