r3 - 02 Aug 2010 - 14:01:11 - RichardMountYou are here: TWiki >  AtlasSoftware Web > Minutes23Jul2010

Minutes23Jul2010 - RAC core team minutes, July 23, 2010

Core Team Participation

Richard Mount, Kaushik De, Michael Ernst, (apologies) Jim Cochran

Other Participants

Armen Vartapetian, Torre Wenaus

Report from the CREM (Richard, Armen)

The agenda is available at http://indico.cern.ch/conferenceDisplay.py?confId=102212

The main discussion was on the Tier0 throughput problems experienced over the weekend. The coincidence of a 350 Hz rate out of the data acquisition and intense activity in preparation for ICHEP had resulted in a Tier0 overload requiring emergency action. The export of Raw data to T1 disk had been stopped by Alexei & Co. (export to tape remained). Reducing ESD export was seriously considered.

The Tier0 problems appeared to be related to internal network bottlenecks. There was no indication that WAN links were overloaded and nobody on the CREM call knew the magnitude of the WAN utilization (In the RAC meeting Michael said that WAN usage to BNL was far away from saturation). In the CREM call, Beata Heinemann opined that the extreme conditions were due to the imminence of ICHEP and so we should not assume they will persist. (The RAC core group also did not think that action beyond the suppression of Raw --> T1Disk distribution was appropriate at this time).

Hans von der Schmitt asked whether the US had any issue with the changes made to survive the weekend. Richard/Armen could not respond officially, but did not see a reason to object. (The US PS&C Level 2 meeting, immediately after the CREM had agreed that the actions were OK)

The second discussion concerned DESD distribution. Since the US and FR clouds no longer wanted automatic DESD distribution, the total number of primary DESD replicas should be reduced to 5 so that the remainder of the collaboration did not have to increase its hosting of replicas. In the meeting it was noted that the German cloud also wanted to stop DESD replication (because nobody was using them) and the Italian cloud was also likely to start using PD2P and would therefore ask to stop automatic replication. These requests would further reduce the need for replication.

Operations Issues (Kaushik)

PD2P has been running fine.

There has been a storage "crisis" at SLAC, where in spite of PD2P "we couldn't find datasets to delete". The causes were: 1) SLAC is the US T2 with the least disk space until the new purchases are operational; 2) The central deletion system has problems with BeStMan/xrootd sites because they don't assign space rigidly to SpaceTokens. The current way to tell the central system that a particular SpaceToken at an xrootd site should be slimmed is to restart SRM with an artifically low value for the SpaceToken. This is almost unworkable. A way to provide the central system with information via a web services interface has been agreed in principle, but not yet implemented. In addition, too much "archival" data was being sent to SLAC as a result of some misunderstanding about its available disk space.

Action Items Cleanup

  1. 5/7/2010: Richard, Create a web page summarizing the dataset distribution targets in the US (Called into question by the abandonment of distribution targets in favor of usage-driven data distribution) This action item should remain. Although many of the static assignments have been voided by PD2P, there is still a need for a single place where the US distribution policy is summarized.
  2. 4/9/2010: Kevin, Create first version of a Twiki guiding US physicists on requesting Additional Production. (Completed) Completed and can be removed.
  3. 3/26/2010: All but especially core team, Find time during the ADC Workshop next week to identify a point-of-contact for Valid US Regional Production (on hold pending a better definition of what this task should be following successful completion of some Exotics Additional Production). The present approach to Regional/Additional Production seems not to require such a role. Action item can be deleted.

AOB

Michael noted that BNL was having to add scratch space (10 TB recently). The cause was datasets being copied by DQ2 using BNL as intermediate storage. The cleanup of the intermediate storage relied on central deletion which was not keeping up. As of Friday morning central deletion had resumed and was removing 7000 files per hour - just about OK. In the longer term, it made more sense for such temporary files to be locally managed.

All new CPU resources were now in production at BNL. the 800 old cores available for analysis had now become 2200 mainly new cores. (1200 for long and 1000 for short jobs).

Action Items

  1. 5/7/2010: Richard, Create a web page summarizing the dataset distribution policies for the US resources.

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments

 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback