TestingXRootD5

Summary of functional tests performed by LHC VO's on ceph-test-gw691.gridpp.rl.ac.uk and batch farm WNs

 

 

CMS testing

  • A file was written into ECHO

 

SSS authentication between xrootd@proxy and xrootd@ceph that is in place on the external Echo gateways caused TPC transfers (and tests) to fail generating errors of the typeTPC job 199: [2021-06-09 09:29:28.429493 +0100][Debug ][XRootDTransport ] [ceph-dev-gw2.gridpp.rl.ac.uk:1095.0] Authentication is required: &P=sss,0.+13:/etc/grid-security/xrootd/sss.keytab.grp TPC job 199: [2021-06-09 09:29:28.429498 +0100][Debug ][XRootDTransport ] [ceph-dev-gw2.gridpp.rl.ac.uk:1095.0] Sending authentication data TPC job 199: [2021-06-09 09:29:28.431420 +0100][Debug ][XRootDTransport ] [ceph-dev-gw2.gridpp.rl.ac.uk:1095.0] Trying to authenticate using sss TPC job 199: [2021-06-09 09:29:28.431687 +0100][Error ][XRootDTransport ] [ceph-dev-gw2.gridpp.rl.ac.uk:1095.0] Authentication with sss failed: Invalid keyname specified. TPC job 199: [2021-06-09 09:29:28.431701 +0100][Error ][AsyncSock ] [ceph-dev-gw2.gridpp.rl.ac.uk:1095.0] Socket error while handshaking: [FATAL] Auth failed TPC job 199: [2021-06-09 09:29:28.431786 +0100][Error ][PostMaster ] [ceph-dev-gw2.gridpp.rl.ac.uk:1095] Unable to recover: [FATAL] Auth failed. TPC job 199: [2021-06-09 09:29:28.431832 +0100][Debug ][XRootD ] [ceph-dev-gw2.gridpp.rl.ac.uk:1095] Handling error while processing kXR_stat (path: alice:/00/01509/a9531020-c52f-11eb-9cd9-6c02e09897e9, flags: none): [FATAL] Auth failed. TPC job 199: [2021-06-09 09:29:28.431935 +0100][Debug ][ExDbgMsg ] [ceph-dev-gw2.gridpp.rl.ac.uk:1095] Calling MsgHandler: 0x44c9c700 (message: kXR_stat (path: alice:/00/01509/a9531020-c52f-11eb-9cd9-6c02e09897e9, flags: none) ) with status: [FATAL] Auth failed. TPC job 199: [2021-06-09 09:29:28.772792 +0100][Debug ][XRootDTransport ] [storage09.spacescience.ro:1094.0] Authentication is required: &P=unix TPC job 199: [2021-06-09 09:29:28.772796 +0100][Debug ][XRootDTransport ] [storage09.spacescience.ro:1094.0] Sending authentication data TPC job 199: [2021-06-09 09:29:28.773061 +0100][Debug ][XRootDTransport ] [storage09.spacescience.ro:1094.0] Trying to authenticate using unix TPC job 199: [2021-06-09 09:29:28.952324 +0100][Debug ][XRootDTransport ] [storage09.spacescience.ro:1094.0] Authenticated with unix.Tried to set up a single XRootD process on ceph-dev-gw2 to get around the problem of this “spurious” SSS authorisation in TPC but this has resulted in breaking all ALICE tests. Probably this is because after the removal of all memory proxy directives (pss.XXX) from the config, the access to the file catalogue (/etc/xrootd/storage.xml) is lost and the name-to-name mapping disabled.Reverted to running xrootd@proxy and xrootd@ceph to restore the access to the file catalogue. As a result the basic ALICE tests were successful again. Tried to remove the TPC script definition but the file ended up being written to the local file system on the gateway. Subsequently, when all auth directives were removed from the xrootd@ceph config the file was successfully written to the ALICE Ceph pool but for some reason XRootD doesn’t complete the transfer and the client prompt doesn’t return and hangs.Had a meeting with Costin to discuss the TPC issue. We found that the xrdcp-tpc.sh is called by the server but for some reason the control is not returned from the script which remains in a "zombie" state. Costin suggested to try and feed the PIDs of the xrootd server and the xrdcp-tpc.sh to gdb to see why this happensUpdated the TPC script with James’s version running on the Echo gateways and also included the autorm option in the ofs.tpc directive. As a result, TPC transfers are now working for ALICE EOS sourcesALICE TPC still doesn’t work for XRootD native and dCache endpoints:In the case of the XRootD native sources the error is[3005] [ERROR] Server responded with an error: [3010] tpc authorization expired (source) In the case of dCache sources the error is [3005] [ERROR] Server responded with an error: [3010] An authorization token is required for this request In both cases, we see this error (7 times in the case of Ceph and only once in the case of dCache)Message kXR_stat (handle: 0x00000000, flags: none) returned with [ERROR] Server responded with an error: [3011] Unable to get state for alice:/05/30011/802c7880-3014-11ec-b8b3-0242ee26aa8b; no such file or directory

CephsumScriptTesting of various upload methods and whether checksum value is recorded and what happens

XrdCks.adler32 (timeout 1st attempt )gfal-copyrootxrootd.echo.stfc.ac.uknogfal-sumgsiftpError returned to client [1] ; no checksum storedgfal-copyrootxrootd.echo.stfc.ac.uknoxrdadler32rootyes (slow)xrdcprootxrootd.echo.stfc.ac.uknoN/AN/AN/Axrdcp -C adler32rootxrootd.echo.stfc.ac.ukyesN/AN/AN/Axrdcp -C adler32:printrootxrootd.echo.stfc.ac.ukyesN/AN/AN/Axrdcp -C adler32:sourcerootxrootd.echo.stfc.ac.uknoN/AN/AN/Axrdcp -C adler32:aabbccddrootxrootd.echo.stfc.ac.ukyes; stores correct checksum, client returns bad checksum errorN/AN/AN/A FTS?xrootd.echo.stfc.ac.uk????N/A- Errors
[1]
gfal-sum error: 70 (Communication error on send) - globus_ftp_client: the server responded with an error 500 Cannot find checksum for /dteam:test1/domatest/jwalder/test_cks error: No data available

XrdCks metadata; i.e. confirms that checksum is recalculated on request in current XrootD ral configuration.gfal-sum with gsiFTP:// protocol will fail if no checksum available (is that the source of Current DOMA TPC Checksum 500 errors ?)

XrdCksData class specifies the following format for XrdCks.adler32 type datachar Name[NameSize]; // Checksum algorithm name long long fmTime; // Out: File's mtime when checksum was computed. int csTime; // Delta from fmTime when checksum was computed. short Rsvd1; // Reserved field char Rsvd2; // Reserved field char Length; // Length, in bytes, of the checksum value char Value[ValuSize]; // The binary checksum valueNameSize = 16 (Name is allowed max size 15)
ValuSize = 64 (for 512bit max size)Endian format affects just fmTime and csTime (others appear to be byte arranged.)

DeletionsTesting

XrootD. This section reports the observations, tests, and proposed solutions to enabled such stuck files to be sucesfully deleted.

XrdCeph:

int ceph_posix_unlink(XrdOucEnv* env, const char *pathname) { logwrapper((char*)"ceph_posix_unlink : %s", pathname); // minimal stat : only size and times are filled CephFile file = getCephFile(pathname, env); libradosstriper::RadosStriper *striper = getRadosStriper(file); if (0 == striper) { return -EINVAL; } return striper->remove(file.name); }

 

https://github.com/snafus/xrootd-ceph/commit/1bd3040718814e65683d83da94adb26c2d17de5e (Scenario 1); if EBUSY (which implies likely to be locked); remove the xattr lock.striper.lock and try to remove again.

XrdCephPosix Library

StandaloneXrdCephPosix describes approaches to for invoking and testing methods in the XrdCephPosix library independently of the XRootD client and server.