The latest blah blah about BLAHP.

a brief update.

Massimo Mezzadri, Francesco Prelz, David Rebatto

INFN, sezione di Milano

Summary

WTH is BLAH(P) ?

  • The BLAHP is like the GAHP, but it's used to locally transfer a job to a 3rd party batch system and control it.

Why BLAH(P) ?

Update (1)

Update (2)

  • Now, gLite is gone, and the good news is that we survived!

In the meanwhile

In the meanwhile:

Real life issues - specimen #1 (1)

Real life issues - specimen #1 (2)

  • Use of the BLAH registry is triggerable by configuration and tries to be as transparent as possibile.

Real life issues - specimen #2

  • The dilemma: telling a fast, completed job from a lost, disappeared job.
  • 'History' commands are heavy for all batch systems, and there can be disrupting surprises.
  • Most batch systems can be configured to keep terminated jobs in their active job database for a while.
  • Still, for reasons that would deserve some investigation by Platform, LSF's command line tools (e.g. bjobs) can be up to four times slower than obtaining the same information via the lsbatch API!
  • Ulrich Schwickerath, a friend at CERN, maintains a few API-based tools for LSF that can be used in conjunction with BLAH. in the cernops/info.dynamic-scheduler-lsf github package.
  • But one needs to know: this knowledge still needs to be kept together in one place.

In the pipeline

  • With just a fraction of three people at hand, not much is boiling in terms of new features:
    1. A high availability mode, where updates to the job registry are shared (optionally via multicast) among a pool of BLAH servers (so that all can service requests on the same set of jobs) is in the code. Hasn't seen thorough testing yet.
    2. Better, configurable logging (other than what is exchanged in the command line and/or logged to stdout and stderr).

Conclusions (1)

  • We could not keep it as simple and dumb as we hoped, but the BLAH(P) is still around, now pushing around the majority of WLCG jobs, and possibly a few Higgs bosons here and there.
  • Simple semantics provided (as usual) the ability to compose new tools based on any real need at hand.
  • However, pushing the scale factor up to WLCG center production needs required measures that are normally not enabled when BLAH(P) is used from within HTCondor.
  • These can be enabled if needed, but knowledge and documentation are still not gathered in a single place, as it mostly comes from people with firsthand expertise at managing the various batch systems.
    • Now that we don't depend on decisions and priorities set by upper-tier projects we should perhaps start populating the github Wiki.

Conclusions (2)

  • BLAH doesn't have much of a roadmap ahead, but we are as usual open to suggestions, requests (feature or support), and to directly provide info beyond the scarce documentation:

    blah@mi.infn.it
  • Thank you for your time. Hope it wasn't too BLAH.