r3 - 13 Nov 2008 - 14:18:23 - MarcoMambelliYou are here: TWiki >  Admins Web > LFCMeetNov12

LFCMeetNov12

Introduction

  • Meeting of the sub-committee to address LFC migration in the US cloud.
  • Coordinates:
    • Wednesday, Nov 12, 10am Central/5pm CERN
    • new phone (309) 946-5300, Access code: 735188; Dial *6 to mute/un-mute.
  • Background: SubCommitteeLFC, FileCatalog
  • Attending: Paul, Charles, John, Marco
  • Apologies: Horst

Discussion of local site mover

See LocalSiteMover for specification. Additional discussion:
  • error description returned in stderr, no separate function
  • different exit codes for different errors
  • timeout handled in the pilot code
  • no need for GUID

LFC launch schedule

Oct 1  - LFC subcommittee  meeting to discuss launch
            plan (changes needed in all services and tools
            to switch a site) and resolve any voms
            group/role issues, etc
Oct 7 - AGLT2 : main burn in and validation on a single site...
followed by the rest of the sites:
Oct 14  - UC, UTA
Oct 16 - IU, BU
Oct 20 -  BU, SLAC
Oct 21  - OU, BNL
Oct 23 - WISC

BU

  • last week
    • Issue is with the Harvard site: worker nodes are behind a firewall. So LFC communication is blocked, port 5010 is all. Looking for solutions. Considering putting the LFC on a border machine at Harvard.
    • One question is how was LRC working?
    • LFC proxy is another possibility.
    • There may also be lcg-cp communication - port 8443 for Bestman. Multiple high-range ports for gridftp transfers.
    • Tiny proxy works for scp.
    • Site mover changes? Would have to do reliable copy, and use the space token properly.

UTA

  • last week
    • Production running again at SWT2_CPB. Nominally okay.
    • Issue with OU - group ownerships in the catalog itself.
    • Directory in LFC is /grid/atlas/users/pathena/, created by Paul during execution, his DN owns the dir, group would be atlas/role=production. Some users have usatlas role. Was the correct proxy used? And how did the LFC allow this?
    • Note that LFC handles multiple identities, so it can use both roles.

OU

  • last week
    • Work in progress - Paul is testing, and there are some problems. Permissions issues - sorting out.

SLAC

  • starting

BNL

  • Determine schedule

WISC

Broken production credentials w/ DQ2

From Patrick: Hi,

We seem to be having some issues with our DQ2 system supporting SWT2_CPB_*. It is having problems trying to register data in the LFC using the DQ2 proxy because of how the credentials used by something in the production system, either PandaMover? or a job. I strongly suspect the latter (PandaID? =18930642) based on what was created.

There have been 3.7 million error messages in the last 24 hours concerning registration problems for ~2400 files.

I suspect that I can fix the problem with an lfc-chgrp -R command, but ultimately there is either a problem with our configuration or there is a potential problem with the production system since it can seemingly start affecting DQ2 subscriptions.

Has anyone else seen something similar?

The details:

The production system created some LFC directories:

/grid/atlas/dq2/mc08/misal1_mc12
/grid/atlas/dq2/mc08/misal1_mc12/AOD
/grid/atlas/dq2/mc08/misal1_mc12/log

These directories are owned by the production cert with group of "/atlas/usatlas/Role=production" as seen in:

[atlasddm@gk06 ~]$ export LFC_HOST=gk02.atlas-swt2.org
[atlasddm@gk06 ~]$ lfc-getacl /grid/atlas/dq2/mc08_misal1_mc12
# file: /grid/atlas/dq2/mc08_misal1_mc12
# owner: /DC=org/DC=doegrids/OU=People/CN=Nurcan Ozturk 18551
# group: atlas/usatlas/Role=production
user::rwx
group::rwx              #effective:rwx
other::r-x
default:user::rwx
default:group::rwx
default:other::r-x

[atlasddm@gk06 ~]$ lfc-ls -ld /grid/atlas/dq2/mc08_misal1_mc12
drwxrwxr-x   2 109      109                       0 Nov 07 08:51 /grid/atlas/dq2/mc08_misal1_mc12

(Where 109 in the userinfo table is the production cert and 109 in the groupinfo table is atlas/usatlas/Role=production).

I did not think that this was supposed to happen since the production system is supposedly using the the "double proxy" configuration and that the "atlas/Role=production" is first and that LFC entries would be owned by this group.

Meanwhile the credentials for DQ2 utilize "atlas/Role=production" as seen with:

[atlasddm@gk06 ~]$ voms-proxy-info -all

subject   : /DC=org/DC=doegrids/OU=People/CN=Patrick McGuigan 416226/CN=proxy/CN=proxy
issuer    : /DC=org/DC=doegrids/OU=People/CN=Patrick McGuigan 416226/CN=proxy
identity  : /DC=org/DC=doegrids/OU=People/CN=Patrick McGuigan 416226/CN=proxy
type      : unknown
strength  : 512 bits
path      : /opt/dq2/certs/dq2_proxy.pem
timeleft  : 92:52:27
=== VO atlas extension information ===
VO        : atlas
subject   : /DC=org/DC=doegrids/OU=People/CN=Patrick McGuigan 416226
issuer    : /DC=ch/DC=cern/OU=computers/CN=lcg-voms.cern.ch
attribute : /atlas/Role=production/Capability=NULL
attribute : /atlas/lcg1/Role=NULL/Capability=NULL
attribute : /atlas/Role=NULL/Capability=NULL
attribute : /atlas/usatlas/Role=NULL/Capability=NULL

This causes an incompatibility where the DQ2 credentials can not create subdirectories in the LFC tree owned by the production credentials

Failures in the LFC log look like:

11/10 14:22:15 10935,0 Cns_srv_mkdir: NS092 - mkdir request by /DC=org/DC=doegrids/OU=People/CN=Patrick McGuigan 416226 (101,102,103,101,1
04) from gk06.atlas-swt2.org
11/10 14:22:15 10935,0 Cns_srv_mkdir: NS098 - mkdir /grid/atlas/dq2/mc08_misal1_mc12/AOD/mc08_misal1_mc12.005200.T1_McAtNlo_Jimmy.recon.AO
D.v14022302_tid028350  775 22
11/10 14:22:15 10935,0 Cns_srv_mkdir: returns 13
11/10 14:22:15 10935,0 Cns_srv_addreplica: NS092 - addreplica request by /DC=org/DC=doegrids/OU=People/CN=Patrick McGuigan 416226 (101,102
,103,101,104) from gk06.atlas-swt2.org
11/10 14:22:15 10935,0 Cns_srv_addreplica: NS098 - addreplica 2AA9C0DE-E5AE-DD11-8180-00A0D1E7FC70 gk03.atlas-swt2.org srm://gk03.atlas-sw
t2.org/srm/v2/server?SFN=/xrd/mcdisk/mc08_misal1_mc12/AOD/mc08_misal1_mc12.005200.T1_McAtNlo_Jimmy.recon.AOD.v14022302_tid028350/AOD.02835
0._00686.pool.root.1
11/10 14:22:15 10935,0 Cns_srv_addreplica: returns 2

Extraneous information that someone find interesting:

[atlasddm@gk06 ~]$ lfc-ls -l /grid/atlas/dq2/mc08_misal1_mc12
drwxrwxr-x   1 109      109                       0 Nov 07 08:51 AOD
drwxrwxr-x   1 109      109                       0 Nov 07 08:51 log
[atlasddm@gk06 ~]$ lfc-ls -l /grid/atlas/dq2/mc08_misal1_mc12/AOD
drwxrwxr-x   1 109      109                       0 Nov 07 08:51 mc08_misal1_mc12.005200.T1_McAtNlo_Jimmy.recon.AOD.v14022302_tid028349_sub02737957
[atlasddm@gk06 ~]$ lfc-ls -l /grid/atlas/dq2/mc08_misal1_mc12/AOD/mc08_misal1_mc12.005200.T1_McAtNlo_Jimmy.recon.AOD.v14022302_tid028349_sub02737957
-rw-r--r--   1 109      109                71402430 Nov 07 08:51 AOD.028349._00006.pool.root.2

[atlasddm@gk06 ~]$ lfc-ls -l /grid/atlas/dq2/mc08_misal1_mc12/log
drwxrwxr-x   1 109      109                       0 Nov 07 08:51 mc08_misal1_mc12.005200.T1_McAtNlo_Jimmy.recon.log.v14022302_tid028349_sub02737959
[atlasddm@gk06 ~]$ lfc-ls -l /grid/atlas/dq2/mc08_misal1_mc12/log/mc08_misal1_mc12.005200.T1_McAtNlo_Jimmy.recon.log.v14022302_tid028349_sub02737959
-rw-r--r--   1 109      109                  672745 Nov 07 08:51 log.028349._00006.job.log.tgz.2

Any advice on how to fix this, and more importantly prevent this in the future is appreciated,

Patrick

AOB


-- RobertGardner - 11 Nov 2008

About This Site

Please note that this site is a content mirror of the BNL US ATLAS TWiki. To edit the content of this page, click the Edit this page button at the top of the page and log in with your US ATLAS computing account name and password.


Attachments

 
Powered by TWiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback