r1 - 07 May 2010 - 19:39:30 - RichardMountYou are here: TWiki >  AtlasSoftware Web > Minutes7May2010

Minutes7May2010 - RAC core team minutes, May 7, 2010 - DRAFT

Core Team Participation

Richard Mount, Jim Cochran, Michael Ernst, Kaushik De

Other Participants

Armen Vartapetian, Kevin Black, Mike Tuts

Report from the CREM Meeting (Richard)

The previous day's CREM meeting topics:
  1. Revisiting the MC ESD distribution issue. Kors suggested stopping pre-emptive distribution to T2s and waiting to see what was really needed. Several arguments were raised against this, such as the delay between observing heavy access at T1s and getting the data to the T2s. Part of Kors's suggestion seemed to be reaction to the original proposal for 3 copies at US T2s. Consensus seemed to be that one copy pre-emptively distributed to T2s was OK for now.
  2. Small files. This started off as a side issue but occupied most of the meeting. The discussion trigger seemed to be problems sending data to Tokyo that were attributed to the 45 seconds it takes to initiate a file transfer. With the current level of parallelism in the data transfer system (tuned to work for large ESDs) this copies a file every 8 seconds. If the files are ~10 Mbytes each the efficiency of network use falls to a percent or so. The direction of the discussion favored more systematic merging of all small files. An extensive email discussion followed the CREM meeting.
  3. Deleting old MC. Strong reactions to the five-days notice of the deletion of some old MC datasets (10 TeV) had resulted in a decision to keep one copy of the MC08 AODs around for two months on MCDisk. It was commented that the 10 TeV MC is special because it is unlikely to be superseded.
  4. Deletion Policy. Discussion was postponed to the next CREM (in 2 weeks) due to lack of time.

Reported disruption of Jet analysis

Ian had reported (at Thursday's PS&C meeting) problems with Jet analysis that seemed to be caused by data distribution issues. The dataset concerned was group10.perf-jets.data10_7TeV.00153159.physics_MinBias.recon.ESD.f249_JetEtMissDPDModifier000006_p1_EXT. Kaushik believed that the problems were caused by the dataset being located at Taiwan and Lyon. Taiwan had serious problems and Lyon was seriously overloaded. BNL now has a full copy of this dataset. It can be further replicated to US T2s if necessary.

This week's US Operations actions for RAC endorsement/comment (Kaushik, Armen)

Kaushik reported an interesting week:

Analysis Overload

On Monday a serious backlog of analysis jobs was noticed at BNL - 600 running jobs and 60k in the queue. Michael and Torre were alerted and BNL added 600 cores to the analysis queue. All 1200 are now in continuous use but the backlog has grown (after an initial small decrease). It seems that strong interest in the data from two days of high(er) luminosity running last week is to blame.

'Why can't people run somewhere else?' was asked. The data are at BNL and Lyon and both are backlogged. The agreed distribution should have resulted in 3.25 copies at US T2s, but these have not yet arrived. Michael and Kaushik outlined the technical problems with distribution - long list of MC, reprocessed and real data to be distributed, no prioritization of the distribution queue and poor performance of some T2 storage systems. Network bandwidth appeared not to be a bottleneck. Some T2's storage systems could accept data at close to (10 Gbps) wire speed, but some were down at or below 30% of this. Michael advised addressing both the transfer-prioritization and the storage-performance issues. Kaushik noted that heavy data distribution traffic had also had also slowed data deletions on the overloaded storage.

The RAC recommended that BNL add cores to the analysis queues as a temporary measure until reprocessing started. Richard commented that instances like this, of very hot new data being intensely analyzed, would not be the norm for ATLAS. It was appropriate to address this issue with a one-time fix of this nature.

BNL Scratch Space

BNL staff had added 35 TB to Scratch last week at Armen's request to prepare for a known-in-advance request. Yesterday, an additional unannounced 1TB/h had begun to arrive on BNL Scratch. BNL had responded to another request from Armen and added 40 TB while the Operations team tried to find out who or what was responsible for the transfer - the investigation is in progress.

There is a policy limit of no more than 0.5 TB of unapproved transfers to Scratch. This has obviously been ignored, but the instrumentation and monitoring of the distribution system did not make it easy for the Operations team to track down who was responsible.

Richard commented that, unlike the 'hot data' problem, the Scratch disk problem would not go away and the (commended) actions of BNL staff and Armen were not a scalable solution. Scratch should be kept small, miscreants should have their files deleted, and a more scalable approach of putting substantial disk space under the management of analysis groups should be considered. Michael insisted the we have to have monitoring tools (without any dissent). Jim asked whether it would help if he found someone to work on this - answer 'Yes'.

User request for data at SLAC

This morning Kaushik received a request for 45 TB of data to be transferred to SLAC. The request was from a graduate student. After consultation with the T2, most of the request was denied and a portion was put on hold. The logic was: a) SLAC doesn't have enough space; b) most of the data exist elsewhere in the US; c) for the on hold portion, the data should be replicated in the US real soon now, but if this doesn't happen this portion will be reconsidered. All (Operations/SLAC/User) were Ok with this. The RAC was happy, but noted that it doesn't appear to be a very scalable approach.

How to publicize current dataset distribution policies and realities as they affect the US? (Richard)

Richard noted that he had dutifully attached emails describing the current data distribution strategy to the RAC minutes, but it remained very difficult for anyone to find out either what the distribution was supposed to be, or what the current real distribution was. Kaushik said that he found it impossible to get the distribution targets out of anything other than emails and notes. Adequate automated reporting was a long way off. Richard proposed that he try to assemble and maintain the reasonably static information (agreed distribution targets) on a single web page. Unsurprisingly, this was agreed.

Progress/problems with exotics production (Kevin)

There had been delays while the analysis group agreed on the exact parameters for the simulation. Kevin now expected the specification to be in the production cache by Monday. Kaushik said that the next step was for Kevin to send email to Borut so that the production request could be put into Borut's spreadsheet. Borut should then submit the tasks that will get transmitted to Kaushik et al. The RAC will observe this with interest.



Action Items

  1. 5/7/2010: Richard, Create a web page summarizing the dataset distribution targets in the US
  2. 4/9/2010: Kevin, Create first version of a Twiki guiding US physicists on requesting Additional Production. (In progress)
  3. 3/26/2010: All but especially core team, Find time during the ADC Workshop next week to identify a point-of-contact for Valid US Regional Production (on hold pending a better definition of what this task should be following successful completion of some Exotics Additional Production).

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback