  • Meeting of the sub-committee to address LFC migration in the US cloud.
  • Coordinates:
    • Wednesday, Nov 12, 10am Central/5pm CERN
    • new phone (309) 946-5300, Access code: 735188; Dial *6 to mute/un-mute.
  • Attending: Paul, Charles, John, Marco
  • Apologies: Horst

Discussion of local site mover

See LocalSiteMover for specification. Additional discussion:
  • error description returned in stderr, no separate function
  • different exit codes for different errors
  • timeout handled in the pilot code
  • no need for GUID

LFC launch schedule

Oct 1  - LFC subcommittee  meeting to discuss launch
            plan (changes needed in all services and tools
            to switch a site) and resolve any voms
            group/role issues, etc
Oct 7 - AGLT2 : main burn in and validation on a single site...
followed by the rest of the sites:
Oct 14  - UC, UTA
Oct 16 - IU, BU
Oct 20 -  BU, SLAC
Oct 21  - OU, BNL
Oct 23 - WISC


  • last week
    • Issue is with the Harvard site: worker nodes are behind a firewall. So LFC communication is blocked, port 5010 is all. Looking for solutions. Considering putting the LFC on a border machine at Harvard.
    • One question is how was LRC working?
    • LFC proxy is another possibility.
    • There may also be lcg-cp communication - port 8443 for Bestman. Multiple high-range ports for gridftp transfers.
    • Tiny proxy works for scp.
    • Site mover changes? Would have to do reliable copy, and use the space token properly.


  • last week
    • Production running again at SWT2_CPB. Nominally okay.
    • Issue with OU - group ownerships in the catalog itself.
    • Directory in LFC is /grid/atlas/users/pathena/, created by Paul during execution, his DN owns the dir, group would be atlas/role=production. Some users have usatlas role. Was the correct proxy used? And how did the LFC allow this?
    • Note that LFC handles multiple identities, so it can use both roles.


  • last week
    • Work in progress - Paul is testing, and there are some problems. Permissions issues - sorting out.


  • starting


  • Determine schedule


Broken production credentials w/ DQ2

From Patrick: Hi,

We seem to be having some issues with our DQ2 system supporting SWT2_CPB_*. It is having problems trying to register data in the LFC using the DQ2 proxy because of how the credentials used by something in the production system, either PandaMover? or a job. I strongly suspect the latter (PandaID? =18930642) based on what was created.

There have been 3.7 million error messages in the last 24 hours concerning registration problems for ~2400 files.

I suspect that I can fix the problem with an lfc-chgrp -R command, but ultimately there is either a problem with our configuration or there is a potential problem with the production system since it can seemingly start affecting DQ2 subscriptions.

Has anyone else seen something similar?

The details:

The production system created some LFC directories:


These directories are owned by the production cert with group of "/atlas/usatlas/Role=production" as seen in:

[atlasddm@gk06 ~]$ export LFC_HOST=gk02.atlas-swt2.org
[atlasddm@gk06 ~]$ lfc-getacl /grid/atlas/dq2/mc08_misal1_mc12
# file: /grid/atlas/dq2/mc08_misal1_mc12
# owner: /DC=org/DC=doegrids/OU=People/CN=Nurcan Ozturk 18551
# group: atlas/usatlas/Role=production
group::rwx              #effective:rwx

[atlasddm@gk06 ~]$ lfc-ls -ld /grid/atlas/dq2/mc08_misal1_mc12
drwxrwxr-x   2 109      109                       0 Nov 07 08:51 /grid/atlas/dq2/mc08_misal1_mc12

(Where 109 in the userinfo table is the production cert and 109 in the groupinfo table is atlas/usatlas/Role=production).

I did not think that this was supposed to happen since the production system is supposedly using the the "double proxy" configuration and that the "atlas/Role=production" is first and that LFC entries would be owned by this group.

Meanwhile the credentials for DQ2 utilize "atlas/Role=production" as seen with:

[atlasddm@gk06 ~]$ voms-proxy-info -all

subject   : /DC=org/DC=doegrids/OU=People/CN=Patrick McGuigan 416226/CN=proxy/CN=proxy
issuer    : /DC=org/DC=doegrids/OU=People/CN=Patrick McGuigan 416226/CN=proxy
identity  : /DC=org/DC=doegrids/OU=People/CN=Patrick McGuigan 416226/CN=proxy
type      : unknown
strength  : 512 bits
path      : /opt/dq2/certs/dq2_proxy.pem
timeleft  : 92:52:27
=== VO atlas extension information ===
VO        : atlas
subject   : /DC=org/DC=doegrids/OU=People/CN=Patrick McGuigan 416226
issuer    : /DC=ch/DC=cern/OU=computers/CN=lcg-voms.cern.ch
attribute : /atlas/Role=production/Capability=NULL
attribute : /atlas/lcg1/Role=NULL/Capability=NULL
attribute : /atlas/Role=NULL/Capability=NULL
attribute : /atlas/usatlas/Role=NULL/Capability=NULL

This causes an incompatibility where the DQ2 credentials can not create subdirectories in the LFC tree owned by the production credentials

Failures in the LFC log look like:

11/10 14:22:15 10935,0 Cns_srv_mkdir: NS092 - mkdir request by /DC=org/DC=doegrids/OU=People/CN=Patrick McGuigan 416226 (101,102,103,101,1
04) from gk06.atlas-swt2.org
11/10 14:22:15 10935,0 Cns_srv_mkdir: NS098 - mkdir /grid/atlas/dq2/mc08_misal1_mc12/AOD/mc08_misal1_mc12.005200.T1_McAtNlo_Jimmy.recon.AO
D.v14022302_tid028350  775 22
11/10 14:22:15 10935,0 Cns_srv_mkdir: returns 13
11/10 14:22:15 10935,0 Cns_srv_addreplica: NS092 - addreplica request by /DC=org/DC=doegrids/OU=People/CN=Patrick McGuigan 416226 (101,102
,103,101,104) from gk06.atlas-swt2.org
11/10 14:22:15 10935,0 Cns_srv_addreplica: NS098 - addreplica 2AA9C0DE-E5AE-DD11-8180-00A0D1E7FC70 gk03.atlas-swt2.org srm://gk03.atlas-sw
11/10 14:22:15 10935,0 Cns_srv_addreplica: returns 2

Extraneous information that someone find interesting:

[atlasddm@gk06 ~]$ lfc-ls -l /grid/atlas/dq2/mc08_misal1_mc12
drwxrwxr-x   1 109      109                       0 Nov 07 08:51 AOD
drwxrwxr-x   1 109      109                       0 Nov 07 08:51 log
[atlasddm@gk06 ~]$ lfc-ls -l /grid/atlas/dq2/mc08_misal1_mc12/AOD
drwxrwxr-x   1 109      109                       0 Nov 07 08:51 mc08_misal1_mc12.005200.T1_McAtNlo_Jimmy.recon.AOD.v14022302_tid028349_sub02737957
[atlasddm@gk06 ~]$ lfc-ls -l /grid/atlas/dq2/mc08_misal1_mc12/AOD/mc08_misal1_mc12.005200.T1_McAtNlo_Jimmy.recon.AOD.v14022302_tid028349_sub02737957
-rw-r--r--   1 109      109                71402430 Nov 07 08:51 AOD.028349._00006.pool.root.2

[atlasddm@gk06 ~]$ lfc-ls -l /grid/atlas/dq2/mc08_misal1_mc12/log
drwxrwxr-x   1 109      109                       0 Nov 07 08:51 mc08_misal1_mc12.005200.T1_McAtNlo_Jimmy.recon.log.v14022302_tid028349_sub02737959
[atlasddm@gk06 ~]$ lfc-ls -l /grid/atlas/dq2/mc08_misal1_mc12/log/mc08_misal1_mc12.005200.T1_McAtNlo_Jimmy.recon.log.v14022302_tid028349_sub02737959
-rw-r--r--   1 109      109                  672745 Nov 07 08:51 log.028349._00006.job.log.tgz.2

Any advice on how to fix this, and more importantly prevent this in the future is appreciated,



-- RobertGardner - 11 Nov 2008

