r3 - 12 Dec 2010 - 19:43:02 - RichardMountYou are here: TWiki >  AtlasSoftware Web > Minutes24Sep2010

Minutes24Sep2010 RAC Minutes, September 24, 2010

Members (*=present, #=apologies)

*Richard Mount, Kevin Black begin_of_the_skype_highlighting     end_of_the_skype_highlighting begin_of_the_skype_highlighting     end_of_the_skype_highlighting (Physics Forum Chair), #Jim Cochran (Analysis Support Manager), Alexei Klimentov (ATLAS ADC), *Ian Hinchliffe (Physics Advisor), Rik Yoshida (Tier3 Coordinator), *Michael Ernst (U.S. Facilities Manager), Rob Gardner (Integration Coordinator), *Kaushik De (U.S. Operations Manager), *Armen Vartapetian (U.S. Operations Deputy)

Ex-Officio: *Torre Wenaus, *Stephane Willocq, Mike Tuts, Howard Gordon

Correction/approval of minutes of previous RAC meeting.


Summary of Operational Issues in the Last Month (Kaushik)

The US cloud was working well and completely full. At SLAC (only), in the presence of a queue of production jobs and a queue of analysis jobs, there was no walled-off capacity for production so the higher-priority analysis was taking almost all slots. This issue was being addressed. Care was needed to ensure that ATLAS analysis could still compete effectively for SLAC's large non-ATLAS CPU capacity.

Additional Production Issues

It was now possible to give US analysis jobs (jobs with an OSG certificate) priority access to US beyond-pledge resources. There was no such mechanism to give US production requests priority access to the beyond-pledge resources, because the jobs did not appear in the system with OSG certificates. Kaushik and his team were able to manually raise the priority of US additional production, but this required frequent oversight and intervention.

It was agreed that a mechanism (manual or automatic) to give US-requested additional production priority access to US non-pledge resources was needed. The implementation, and the timetable for moving from the manual system, should be determined by Kaushik and the operations team.

The issue of output datasets was discussed. Proposers of additional production should be asked

  1. if they really needed any intermediate datasets (ESD ...) that they proposed to keep
  2. if they should be keeping intermediate datasets that they had not proposed to keep.
The balance between "keep everything you might need" and "don't worry, it could be re-computed if you find you need it" would vary with the changing pressure on disk and CPU resources.

Additional production requests

Minbias events sliced in leading truth jet pT, for the track jet analysis (Ian Hinchliffe, Seth Zenz)

This request for 90M inner detector events entered the system as a cafeteria discussion between Ian Hinchliffe/Seth Zenz and Borut. The RAC got to know about it when Borut asked Kaushik and Richard at what priority it should be run (since the Geant4 Simulation Campaign was taking all production resources). The production would write only a special D2PD - AODs were "totally useless" due to lack of detailed track information.

The RAC approved the request.

The difficulty of running additional production during a major official production campaign gave rise to the first "additional production issue" above.

QCD dijets with D* in jets (Chunhui Chen)

Chunhui's request for 6M events was submitted to the RAC on 9/27. The request was on track to obtain all required physics group approvals, and Richard had encouraged Chunhui to proceed to prepare the spreadsheet for Borut pending final RAC approval. The request was for AOD only.

The RAC approved this request.

Ian suggested that it might be useful to keep the ESDs. Richard agreed to ask Chunhui about this.

Status of Rebrokerage (Torre)

Tadashi was waiting for a fix to DQ2 before he could put rebrokerage into production. The fix was ready, but had not yet made it into a release.

Strategy for support of high-memory jobs (high luminosity, heavy ion etc.)

Kaushik had already reported by email that "the current state of the technology to route large memory jobs to the sites/queues prepared to execute them" (see action item) was that the system was ready, but untested.

Richard reported on Charlie Young's presentation at the Future and Upgrade Computing workshop the previous week. Charlie had pointed out the many obstacles in the way of producing an upgrade LOI. One of these was the difficulty of running large-memory simulation tasks. Richard proposed that a multi-site test of running large-memory jobs on the grid was now needed. This was agreed.

In Charlie's absence it was proposed the he be asked to submit a test large-memory production. A large-memory queue already existed at BNL and SLAC would create a large-memory queue to exploit its 195 8-core machines with 24GB each. In principle, jobs submitted by Borut should automatically find their way to these queues.



Action Items

  1. 9/24/2010: Richard: Ceck with Chunhui Chen if the ESD should be saved in the QCD dijets production.
  2. 9/24/2020: Kaushik: Organize the nature and timing of the effort to create an automated mechanism to give US-requested additional production priority access to US non-pledged resources.
  3. 8/27/2010: Kaushik and Torre, Investigate the current state of the technology to route large memory jobs to the sites/queues prepared to execute them. Reported at this meeting to be ready but untested.
  4. 5/7/2010: Richard, Create a web page summarizing the dataset distribution policies for the US resources.

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback