CBC Analysis (Offline)¶

To start an offline CBC analysis, you’ll need a configuration file to point at the start/end times to analyze, input data products (e.g. template bank, mass model) and other workflow-related configuration needed.

All the below steps assume a Singularity container with the GstLAL software stack installed. Other methods of installation will follow a similar procedure, however, with one caveat that workflows will not work on the Open Science Grid (OSG).

For a dag on the OSG IGWN grid, you must use a Singularity container on cvmfs, set the profile in config.yaml to osg and make sure to submit the dag from a OSG node. Otherwise the workflow is the same.

When running without a Singularity container, the commands below should be modified. (Such as running gstlal_inspiral_workflow init -c config.yml) instead of singularity exec <image> gstlal_inspiral_workflow init -c config.yml).

For ICDS gstlalcbc shared accounts, the env.sh contents much be changed and instead of running $ X509_USER_PROXY=/path/to/x509_proxy ligo-proxy-init -p albert.einstein run source env.sh. (Details are below.)

Running Workflows¶

1 Build Singularity image (optional)¶

NOTE: If you are using a reference Singularity container (suitable in most cases), you can skip this step. The <image> throughout this doc refers to singularity-image specified in the condor section of your configuration.

If not using the reference Singularity container, say for local development, you can specify a path to a local container and use that for the workflow (non-OSG).

To pull a container with gstlal installed, run:

$ singularity build --sandbox --fix-perms <image-name> docker://containers.ligo.org/lscsoft/gstlal:master

To use a branch other than master, you can replace master in the above command with the name of the desired branch. To use a custom build instead, gstlal will need to be installed into the container from your modified source code. For installation instructions, see the installation page

2. Set up workflow¶

First, we create a new analysis directory and switch to it:

$ mkdir <analysis-dir>
$ cd <analysis-dir>
$ mkdir bank mass_model idq dtdphi

Default configuration files and environment (env.sh) for a variety of different banks are contained in the offline-configuration repository. One can run the commands below to grab the configuration files, or clone the repository and copy the files as needed into the analysis directory. To download data files (mass model, template banks) that may be needed for offline runs, see the README in the offline-configuration repo. Move the template bank(s) into bank and the mass model into mass_model.

For example, to grab all the relevant files for a small BNS dag:

$ curl -O https://git.ligo.org/gstlal/offline-configuration/-/raw/main/configs/bns-small/config.yml
$ curl -O https://git.ligo.org/gstlal/offline-configuration/-/raw/main/env.sh
$ source /cvmfs/oasis.opensciencegrid.org/ligo/sw/conda/etc/profile.d/conda.sh
$ conda activate igwn
$ dcc archive --archive-dir=. --files -i T2200318-v2
$ conda deactivate

Then move the template bank, mass model, idq file, and dtdphi file into their corresponding directories.

When running an analysis on the ICDS cluster in the gstlalcbc shared account, the contents of env.sh must be changed to what is given below. In addition, below in the tutorial, where it says to run ligo-proxy-init -p, instead, run source env.sh on the modified env.sh. When running on non gstlalcbc shared accounts on ICDS or when running on other clusters, the env.sh does not need to be modifed, and ligo-proxy-init -p can be run as in the tutorial.

Now, we’ll need to modify the configuration as needed to run the analysis. At the very least, setting the start/end times and the instruments to run over:

start: 1187000000
stop: 1187100000

instruments: H1L1

Ensure the template bank, mass model, idq file, and dtdphi file are pointed to in the configuration:

data:
  template-bank: bank/gstlal_bank_small.xml.gz

prior:
  mass-model: bank/mass_model_small.h5
  idq-timeseries: idq/H1L1-IDQ_TIMESERIES-1239641219-692847.h5
  dtdphi: dtdphi/inspiral_dtdphi_pdf.h5

If you’re creating a summary page for results, you’ll need to point at a location where they are web-viewable:

summary:
  webdir: ~/public_html/

If you’re running on LIGO compute resources and your username doesn’t match your albert.einstein username, you’ll also additionally need to specify the accounting group user for condor to track accounting information:

condor:
  accounting-group-user: albert.einstein

In addition, update the singularity-image in the condor section of your configuration if needed:

condor:
  singularity-image: /cvmfs/singularity.opensciencegrid.org/lscsoft/gstlal:master

If not using a reference Singularity image, you can replace this with the full path to a local singularity container <image>.

For more detailed configuration options, take a look at the configuration section below.

If you haven’t installed site-specific profiles yet (per-user), you can run:

$ singularity exec <image> gstlal_grid_profile install

which will install configurations that are site-specific, i.e. ldas and icds. You can select which profile to use in the condor section:

condor:
  profile: ldas

For a OSG IGWN grid run, use osg. To view which profiles are available, you can run:

$ singularity exec <image> gstlal_grid_profile list

Note, you can install custom profiles as well.

Once you have the configuration, data products, and grid profiles installed, you can set up the Makefile using the configuration, which we’ll then use for everything else, including the data file needed for the workflow, the workflow itself, the summary page, etc.

$ singularity exec <image> gstlal_inspiral_workflow init -c config.yml

By default, this will generate the full workflow. If you want to only run the filtering step, a rerank, or an injection-only workflow, you can instead specify the workflow as well, e.g.

$ singularity exec <image> gstlal_inspiral_workflow init -c config.yml -w injection

for an injection-only workflow.

If you already have a Makefile and need to update it based on an updated configuration, run gstlal_inspiral_workflow with --force.

Next, if you accessing non-public (GWOSC) data, you’ll need to set up your proxy to ensure you can get access to LIGO data:

$ X509_USER_PROXY=/path/to/x509_proxy ligo-proxy-init -p albert.einstein

Note that we are running this step outside of Singularity. This is because ligo-proxy-init is not installed within the image currently. If you are running on the ICDS gstlalcbc shared account, do not run the command above. Instead, run:

$ source env.sh

Also update the configuration accordingly (if needed):

source:
  x509-proxy: /path/to/x509_proxy

Finally, set up the rest of the workflow including the DAG for submission:

$ singularity exec -B $TMPDIR <image> make dag

If running on the OSG IGWN grid, make sure to submit the dags from the OSG node. This should create condor DAGs for the workflow. Mounting a temporary directory is important as some of the steps will leverage a temporary space to generate files.

If one desires to see detailed error messages, add <PYTHONUNBUFFERED=1> to environment in the submit (*.sub) files by running:

$ sed -i '/^environment = / s/\"$/ PYTHONUNBUFFERED=1\"/' *.sub

3. Launch workflows¶

$ source env.sh
$ make launch

This is simply a thin wrapper around condor_submit_dag launching the DAG in question.

You can monitor the dag with Condor CLI tools such as condor_q and tail -f full_inspiral_dag.dag.dagman.out.

4. Generate Summary Page¶

After the DAG has completed, you can generate the summary page for the analysis:

$ singularity exec <image> make summary

To make an open-box page after this, run:

$ make unlock

Configuration¶

The top-level configuration consists of the analysis times and detector configuration:

start: 1187000000
stop: 1187100000

instruments: H1L1
min-instruments: 1

These set the start and stop gps times of the analysis, plus the detectors to use (H1=Hanford, L1=Livingston, V1=Virgo). There is a nice online converter for gps times here: https://www.gw-openscience.org/gps/. You can also use the program gpstime as well. Note that these start and stop times have no knowledge about science quality data, the actual science quality data that are analyzed is typically a subset of the total time. Information about which detectors were on at different times is available here: https://www.gw-openscience.org/data/.

min-instruments sets the minimum number of instruments we will allow to form an event, e.g. setting it to 1 means the analysis will consider single detector events, 2 means we will only consider events that are coincident across at least 2 detectors.

Section: Data¶

data:
  template-bank: bank/gstlal_bank_small.xml.gz
  analysis-dir: /path/to/analysis/dir

The template-bank option points to the template bank file. These are xml files that follow the LIGOLW (LIGO light weight) schema. The template bank in particular contains a table that lists the parameters of all of the templates, it does not contain the actual waveforms themselves. Metadata such as the waveform approximant and the frequency cutoffs are also listed in this file.

The analysis-dir option is used if the user wishes to point to an existing analysis to perform a rerank or an injection-only workflow. This grabs existing files from this directory to seed the rerank/injection workflows.

One can use multiple sub template banks. In this case, the configuration might look like:

data:
  template-bank:
    bns: bank/sub_bank/bns.xml.gz
    nsbh: bank/sub_bank/nsbh.xml.gz
    bbh_1: bank/sub_bank/bbh_low_q.xml.gz
    bbh_2: bank/sub_bank/other_bbh.xml.gz
    imbh: bank/sub_bank/imbh_low_q.xml.gz

Section: Source¶

source:
  data-source: frames
  data-find-server: datafind.gw-openscience.org
  frame-type:
    H1: H1_GWOSC_O2_16KHZ_R1
    L1: L1_GWOSC_O2_16KHZ_R1
  channel-name:
    H1: GWOSC-16KHZ_R1_STRAIN
    L1: GWOSC-16KHZ_R1_STRAIN
  sample-rate: 4096
  frame-segments-file: segments.xml.gz
  frame-segments-name: datasegments
  x509-proxy: x509_proxy

The data-find-server option points to a server that is queried to find the location of frame files. The address shown above is a publicly available server that will return the locations of public frame files on cvmfs. Each frame file has a type that describes the contents of the frame file, and may contain multiple channels of data, hence the channel names must also be specified. frame-segments-file points to a LIGOLW xml file that describes the actual times to analyze, i.e. it lists the time that science quality data are available. These files are generalized enough that they could describe different types of data, so frame-segments-name is used to specify which segment to consider. In practice, the segments file we produce will only contain the segments we want. Users will typically not change any of these options once they are set for a given instrument and observing run. x509-proxy is the path to your x509-proxy.

Section: Segments¶

The segments section specifies how to generate segments and vetoes for the workflow. There are two backends to determine where to query segments and vetoes from, gwosc (public) and dqsegdb (authenticated).

An example of configuration with the gwosc backend looks like:

segments:
  backend: gwosc
  vetoes:
    category: CAT1

Here, the backend is set to gwosc so both segments and vetoes are determined by querying the GWOSC server. There is no additional configuration needed to query segments, but for vetoes, we also need to specify the category used for vetoes. This can be one of CAT1, CAT2, or CAT3. By default, segments are generated by applying CAT1 vetoes as recommended by the Detector Characterization group.

An example of configuration with the dqsegdb backend looks like:

segments:
  backend: dqsegdb
  science:
    H1: DCS-ANALYSIS_READY_C01:1
    L1: DCS-ANALYSIS_READY_C01:1
    V1: ITF_SCIENCE:2
  vetoes:
    category: CAT1
    veto-definer:
      file: H1L1V1-HOFT_C01_V1ONLINE_O3_CBC.xml
      version: O3b_CBC_H1L1V1_C01_v1.2
      epoch: O3

Here, the backend is set to dqsegdb so both segments and vetoes are determined by querying the DQSEGDB server. To query segments, one needs to specify the flag used per instrument to query segments from. For vetoes, we need to specify the category used for vetoes as with the dqsegdb backend. Additionally, a veto definer file is used to determine which flags are used for which veto categories. The file need not be provided, the file, version and epoch fully specify how to access the veto definer file used for generating vetoes.

Section: PSD¶

psd:
  fft-length: 8
  sample-rate: 4096

The PSD estimation method used by GstLAL is a modified median-Welch method that is described in detail in Section IIB of Ref [1]. The FFT length sets the length of each section that is Fourier transformed. The default whitener will use zero-padding of one-fourth the FFT length on either side and will overlap fourier transformed segments by one-fourth the FFT length. For example, an fft-length of 8 means that each Fourier transformed segment used in the PSD estimation (and consequently the whitener) will contain 4 seconds of data with 2 seconds of zero padding on either side, and will overlap the next segment by 2 seconds (i.e. the last two seconds of data in one segment will be the first two seconds of data in the following window).

Section: SVD¶

svd:
  f-low: 20.0
  num-chi-bins: 1
  sort-by: mchirp
  approximant:
    - 0:1.73:TaylorF2
    - 1.73:1000:SEOBNRv4_ROM
  tolerance: 0.9999
  max-f-final: 1024.0
  num-split-templates: 200
  overlap: 30
  num-banks: 5
  samples-min: 2048
  samples-max-64: 2048
  samples-max-256: 2048
  samples-max: 4096
  autocorrelation-length: 701
  max-duration: 128
  manifest: svd_manifest.json

f-low sets the lower frequency cutoff for the analysis in Hz.

num-chi-bins is a tunable parameter related to the template bank binning procedure; specifically, sets the number of effective spin parameter bins to use in the chirp-mass / effective spin binning procedure described in Sec. IID and Fig. 6 of [1].

sort-by selects the template sort column. This controls how to bin the bank in sub-banks suitable for the svd decomposition. It can be mchirp (sorts by chirp mass), mu (sorts by mu1 and mu2 coordiantes), or template_duration (sorts by template duration).

approximant specifies the waveform approximant that should be used along with chirp mass bounds to use that approximant in. 0:1000:TaylorF2 means use the TaylorF2 approximant for waveforms from systems with chirp-masses between 0 and 1000 solar masses. Multiple waveforms and chirp-mass bounds can be provided.

tolerance is a tunable parameter related to the truncation of SVD basis vectors. A tolerance of 0.9999 means the targeted matched-filter inner-product of the original waveform and the waveform reconstructed from the SVD is 0.9999.

max-f-final sets the max frequency of the template.

num-split-templates, overlap, num-banks, are tunable parameters related to the SVD process. num-split-templates sets the number of templates to decompose at a time; overlap sets the number of templates from adjacent template bank regions to pad to the region being considered in order to actually compute the SVD (this helps the performance of the SVD, and these pad templates are not reconstructed); num-banks sets the number of sets of decomposed templates to include in a given bin for the analysis. For example, num-split-templates of 200, overlap of 30, and num-banks of 5 means that each SVD bank file will contain 5 decomposed sets of 200 templates, where the SVD was computed using an additional 15 templates on either side of the 200 (as defined by the binning procedure).

samples-min, samples-max-64, samples-max-256, and samples-max are tunable parameters related to the template time slicing procedure used by GstLAL (described in Sec. IID and Fig. 7 of Ref. [1], and references therein). Templates are slice in time before the SVD is applied, and only sampled at the rate necessary for the highest frequency in each time slice (rounded up to a power of 2). For example, the low frequency part of a waveform may only be sampled at 32 Hz, while the high frequency part may be sampled at 2048 Hz (depending on user settings). samples-min sets the minimum number of samples to use in any time slice. samples-max sets the maximum number of samples to use in any time slice with a sample rate below 64 Hz; samples-max-64 sets the maximum number of samples to use in any time slice with sample rates between 64 Hz and 256 Hz; samples-max-256 sets the maximum number of samples to use in any time slice with a sample rate greater than 256 Hz.

autocorrelation-length sets the number of samples to use when computing the autocorrelation-based test-statistic, described in IIIC of Ref [1].

max-duration sets the maximum template duration in seconds. One can choose not to use max-duration.

manifest sets the name of a file that will contain metadata about the template bank bins.

If one uses multiple sub template banks, SVD configurations can be specified for each sub template bank. Reference mario config .

Users will typically not change these options.

Section: Filter¶

filter:
  fir-stride: 1
  min-instruments: 1
  coincidence-threshold: 0.01
  ht-gate-threshold: 0.8:15.0-45.0:100.0
  veto-segments-file: vetoes.xml.gz
  time-slide-file: tisi.xml
  injection-time-slide-file: inj_tisi.xml
  time-slides:
    H1: 0:0:0
    L1: 0.62831:0.62831:0.62831
  injections:
    bns:
      file: bns_injections.xml
      range: 0.01:1000.0

fir-stride is a tunable parameter related to the matched-filter procedure, setting the length in seconds of the output of the matched-filter element.

coincidence-threshold is the time in seconds to add to the light-travel time when searching for coincidences between detectors.

ht-gate-threshold sets the h(t) gate threshold as a function of chirp-mass. The h(t) gate threshold is a value over which the output of the whitener plus some padding will be set to zero (as described in IIC of Ref. [1]). 0.8:15.0-45.0:100.0 mean that a template bank bin that that has a max chirp-mass template of 0.8 solar masses will use a gate threshold of 15, a bank bin with a max chirp-mass of 100 will use a threshold of 45, and all other thresholds are described by a linear function between those two points.

veto-segments-file sets the name of a LIGOLW xml file that contains any vetoes used for the analysis, even if there are no vetoes.

time-slide-file and inj-time-slide-file are LIGOLW xml files that describe any time slides used in the analysis. A typical analysis will only analyze injections with the zerolag “time slide” (i.e. the data are not slid in time), and will consider the zerolag and one other time slide for the non-injection analysis. The time slide is used to perform a blind sanity check of the noise model.

injections will list a set of injections, each with their own label. In this example, there is only one injection set, and it is labeled “bns”. file is a relative path to the injection file (a LIGOLW xml file that contains the parameters of the injections, but not the actual waveforms themselves). range sets the chirp-mass range that should be considered when searching for this particular set of injections. Multiple injection files can be provided, each with their own label, file, and range.

The only option here that a user will normally interact with is the injections option.

When using multiple sub template banks, replace bns: under injections: with inj:

Section: Injections¶

injections:
  sets:
    expected-snr:
      f-low: 15.0
    bns:
      f-low: 14.0
      seed: 72338
      time:
        step: 32
        interval: 1
        shift: 0
      waveform: SpinTaylorT4threePointFivePN
      mass-distr: componentMass
      mass1:
        min: 1.1
        max: 2.8
      mass2:
        min: 1.1
        max: 2.8
      spin1:
        min: 0
        max: 0.05
      spin2:
        min: 0
        max: 0.05
      distance:
        min: 10000
        max: 80000
      spin-aligned: True
      file: bns_injections.xml

The sets subsection is used to create injection sets to be used within the analysis, and referenced to by name in the filter section. In sets, the injections are grouped by key. In this case, one bns injection set which creates the bns_injections.xml file and used in the injections section of the filter section.

For multiple injections, the chunk for bns: should be repeated for each injection. Reference mario config .

Besides creating injection sets, the expected-snr subsection is used for the expected SNR jobs. These settings are used to override defaults as needed.

spin-aligned specifies whether the injections should be spin-(mis)aligned spins (if spin-aligned: True) or precessing spins (if spin-aligned: False).

In the case of multiple injection sets that need to be combined, one can add a few options to create a combined file and reference that within the filter jobs. This can be useful for large banks with a large set of templates. To do this, one can add the following:

injections:
  combine: true
  combined-file: combined_injections.xml

The injections created are generated from the lalapps_inspinj program, with the following mapping between configuration and command line options:

f-low: --f-lower
seed: --seed
time section: -time-step, --time-interval. shift adjusts the start time appropriately.
waveform: --waveform
mass-distr: --m-distr
mass/spin/distance sections: maps to options like --min-mass1

Section: Prior¶

prior:
  mass-model: mass_model/mass_model_small.h5

mass-model is a relative path to the file that contains the mass model. This model is used to weight templates appropriately when assigning ranking statistics based on our understanding of the astrophysical distribution of signals. Users will not typically change this option.

An optional dtdphi-file and idq-timeseries can be provided here. If not given, a default model (included in the standard installation) will be used. The dtdph file will specify a probability distribution function for the probability of measuring a given time shift and phase shift in mulitple detector observation. It enters in the ranking statistics. The idq file will give information about the data quality around the time of coalescence. If specifying idq files and dtdphi files, create a directory for idq and dtdphi each in the <analysis-dir>, and put the idq files and dtdphi files in the respective directory. Reference mario config .

Section: Rank¶

rank:
  ranking-stat-samples: 4194304

ranking-stat-samples sets the number of samples to draw from the noise model when computing the distribution of log likelihood-ratios (the ranking statistic) under the noise hypothesis. Users will not typically change this option.

Section: Summary¶

summary:
  webdir: /path/to/public_html/folder

webdir sets the path of the output results webpages produced by the analysis. Users will typically change this option for each analysis.

Section: Condor¶

condor:
  profile: osg-public
  accounting-group: ligo.dev.o3.cbc.uber.gstlaloffline
  accounting-group-user: <albert.einstein>
  singularity-image: <image>

profile sets a base level of configuration options for condor.

accounting-group sets accounting group details on LDG resources. Currently the machinery to produce an analysis dag requires this option, but the option is not actually used by analyses running on non-LDG resources.

singularity-image sets the path of the container on cvmfs that the analysis should use. Users will not typically change this option (use /cvmfs/singularity.opensciencegrid.org/lscsoft/gstlal:master).

Installing Custom Site Profiles¶

You can define a site profile as YAML. As an example, we can create a file called custom.yml:

scheduler: condor
requirements:
  - "(IS_GLIDEIN=?=True)"

Both the directives and requirements sections are optional.

To install one so it’s available for use, run:

$ singularity exec <image> gstlal_grid_profile install custom.yml