CBC Analysis (Offline)¶
To start an offline CBC analysis, you’ll need a configuration file to point at the start/end times to analyze, input data products (e.g. template bank, mass model) and other workflow-related configuration needed.
All the below steps assume a Singularity container with the GstLAL software stack installed. Other methods of installation will follow a similar procedure, however, with one caveat that workflows will not work on the Open Science Grid (OSG).
For a dag on the OSG IGWN grid, you must use a Singularity container on
cvmfs, set the profile
in config.yaml
to osg
and make sure
to submit the dag from a OSG node.
Otherwise the workflow is the same.
When running without a Singularity container, the commands below should be
modified. (Such as running gstlal_inspiral_workflow init -c config.yml
)
instead of singularity exec <image> gstlal_inspiral_workflow init -c config.yml
).
For ICDS gstlalcbc shared accounts, the env.sh
contents much be changed
and instead of running $ X509_USER_PROXY=/path/to/x509_proxy ligo-proxy-init -p albert.einstein
run source env.sh
. (Details are below.)
Running Workflows¶
1 Build Singularity image (optional)¶
NOTE: If you are using a reference Singularity container (suitable in most
cases), you can skip this step. The <image>
throughout this doc refers to
singularity-image
specified in the condor
section of your configuration.
If not using the reference Singularity container, say for local development, you can specify a path to a local container and use that for the workflow (non-OSG).
To pull a container with gstlal installed, run:
$ singularity build --sandbox --fix-perms <image-name> docker://containers.ligo.org/lscsoft/gstlal:master
To use a branch other than master, you can replace master in the above command with the name of the desired branch. To use a custom build instead, gstlal will need to be installed into the container from your modified source code. For installation instructions, see the installation page
2. Set up workflow¶
First, we create a new analysis directory and switch to it:
$ mkdir <analysis-dir>
$ cd <analysis-dir>
$ mkdir bank mass_model idq dtdphi
Default configuration files and environment (env.sh
) for a
variety of different banks are contained in the
offline-configuration
repository.
One can run the commands below to grab the configuration files, or clone the
repository and copy the files as needed into the analysis directory.
To download data files (mass model, template banks) that may be needed for
offline runs, see the
README
in the offline-configuration repo. Move the template bank(s) into bank
and the mass model into mass_model
.
For example, to grab all the relevant files for a small BNS dag:
$ curl -O https://git.ligo.org/gstlal/offline-configuration/-/raw/main/configs/bns-small/config.yml
$ curl -O https://git.ligo.org/gstlal/offline-configuration/-/raw/main/env.sh
$ source /cvmfs/oasis.opensciencegrid.org/ligo/sw/conda/etc/profile.d/conda.sh
$ conda activate igwn
$ dcc archive --archive-dir=. --files -i T2200318-v2
$ conda deactivate
Then move the template bank, mass model, idq file, and dtdphi file into their corresponding directories.
When running an analysis on the ICDS cluster in the gstlalcbc shared account,
the contents of env.sh
must be changed to what is given below.
In addition, below in the tutorial, where it says to run ligo-proxy-init -p
,
instead, run source env.sh
on the modified env.sh
.
When running on non gstlalcbc shared accounts on ICDS or when running on other
clusters, the env.sh
does not need to be modifed, and ligo-proxy-init -p
can be run as in the tutorial.
Now, we’ll need to modify the configuration as needed to run the analysis. At the very least, setting the start/end times and the instruments to run over:
start: 1187000000
stop: 1187100000
instruments: H1L1
Ensure the template bank, mass model, idq file, and dtdphi file are pointed to in the configuration:
data:
template-bank: bank/gstlal_bank_small.xml.gz
prior:
mass-model: bank/mass_model_small.h5
idq-timeseries: idq/H1L1-IDQ_TIMESERIES-1239641219-692847.h5
dtdphi: dtdphi/inspiral_dtdphi_pdf.h5
If you’re creating a summary page for results, you’ll need to point at a location where they are web-viewable:
summary:
webdir: ~/public_html/
If you’re running on LIGO compute resources and your username doesn’t match your albert.einstein username, you’ll also additionally need to specify the accounting group user for condor to track accounting information:
condor:
accounting-group-user: albert.einstein
In addition, update the singularity-image
in the condor
section of your configuration if needed:
condor:
singularity-image: /cvmfs/singularity.opensciencegrid.org/lscsoft/gstlal:master
If not using a reference Singularity image, you can replace this with the
full path to a local singularity container <image>
.
For more detailed configuration options, take a look at the configuration section below.
If you haven’t installed site-specific profiles yet (per-user), you can run:
$ singularity exec <image> gstlal_grid_profile install
which will install configurations that are site-specific, i.e. ldas
and icds
.
You can select which profile to use in the condor
section:
condor:
profile: ldas
For a OSG IGWN grid run, use osg
.
To view which profiles are available, you can run:
$ singularity exec <image> gstlal_grid_profile list
Note, you can install custom profiles as well.
Once you have the configuration, data products, and grid profiles installed, you can set up the Makefile using the configuration, which we’ll then use for everything else, including the data file needed for the workflow, the workflow itself, the summary page, etc.
$ singularity exec <image> gstlal_inspiral_workflow init -c config.yml
By default, this will generate the full workflow. If you want to only run the filtering step, a rerank, or an injection-only workflow, you can instead specify the workflow as well, e.g.
$ singularity exec <image> gstlal_inspiral_workflow init -c config.yml -w injection
for an injection-only workflow.
If you already have a Makefile and need to update it based on an updated
configuration, run gstlal_inspiral_workflow
with --force
.
Next, if you accessing non-public (GWOSC) data, you’ll need to set up your proxy to ensure you can get access to LIGO data:
$ X509_USER_PROXY=/path/to/x509_proxy ligo-proxy-init -p albert.einstein
Note that we are running this step outside of Singularity. This is because ligo-proxy-init
is not installed within the image currently.
If you are running on the ICDS gstlalcbc shared account, do not run the command
above.
Instead, run:
$ source env.sh
Also update the configuration accordingly (if needed):
source:
x509-proxy: /path/to/x509_proxy
Finally, set up the rest of the workflow including the DAG for submission:
$ singularity exec -B $TMPDIR <image> make dag
If running on the OSG IGWN grid, make sure to submit the dags from the OSG node. This should create condor DAGs for the workflow. Mounting a temporary directory is important as some of the steps will leverage a temporary space to generate files.
If one desires to see detailed error messages, add <PYTHONUNBUFFERED=1>
to
environment
in the submit (*.sub
) files by running:
$ sed -i '/^environment = / s/\"$/ PYTHONUNBUFFERED=1\"/' *.sub
3. Launch workflows¶
$ source env.sh
$ make launch
This is simply a thin wrapper around condor_submit_dag launching the DAG in question.
You can monitor the dag with Condor CLI tools such as condor_q
and tail -f full_inspiral_dag.dag.dagman.out
.
4. Generate Summary Page¶
After the DAG has completed, you can generate the summary page for the analysis:
$ singularity exec <image> make summary
To make an open-box page after this, run:
$ make unlock
Configuration¶
The top-level configuration consists of the analysis times and detector configuration:
start: 1187000000
stop: 1187100000
instruments: H1L1
min-instruments: 1
These set the start and stop gps times of the analysis, plus the detectors to use (H1=Hanford, L1=Livingston, V1=Virgo). There is a nice online converter for gps times here: https://www.gw-openscience.org/gps/. You can also use the program gpstime as well. Note that these start and stop times have no knowledge about science quality data, the actual science quality data that are analyzed is typically a subset of the total time. Information about which detectors were on at different times is available here: https://www.gw-openscience.org/data/.
min-instruments
sets the minimum number of instruments we will allow to form
an event, e.g. setting it to 1 means the analysis will consider single detector
events, 2 means we will only consider events that are coincident across at least
2 detectors.
Section: Data¶
data:
template-bank: bank/gstlal_bank_small.xml.gz
analysis-dir: /path/to/analysis/dir
The template-bank
option points to the template bank file. These
are xml files that follow the LIGOLW (LIGO light weight) schema. The template
bank in particular contains a table that lists the parameters of all of the
templates, it does not contain the actual waveforms themselves. Metadata such as
the waveform approximant and the frequency cutoffs are also listed in this file.
The analysis-dir
option is used if the user wishes to point to an existing
analysis to perform a rerank or an injection-only workflow. This grabs existing files
from this directory to seed the rerank/injection workflows.
One can use multiple sub template banks. In this case, the configuration might look like:
data:
template-bank:
bns: bank/sub_bank/bns.xml.gz
nsbh: bank/sub_bank/nsbh.xml.gz
bbh_1: bank/sub_bank/bbh_low_q.xml.gz
bbh_2: bank/sub_bank/other_bbh.xml.gz
imbh: bank/sub_bank/imbh_low_q.xml.gz
Section: Source¶
source:
data-source: frames
data-find-server: datafind.gw-openscience.org
frame-type:
H1: H1_GWOSC_O2_16KHZ_R1
L1: L1_GWOSC_O2_16KHZ_R1
channel-name:
H1: GWOSC-16KHZ_R1_STRAIN
L1: GWOSC-16KHZ_R1_STRAIN
sample-rate: 4096
frame-segments-file: segments.xml.gz
frame-segments-name: datasegments
x509-proxy: x509_proxy
The data-find-server
option points to a server that is queried to find the
location of frame files. The address shown above is a publicly available server
that will return the locations of public frame files on cvmfs. Each frame file
has a type that describes the contents of the frame file, and may contain
multiple channels of data, hence the channel names must also be specified.
frame-segments-file
points to a LIGOLW xml file that describes the actual
times to analyze, i.e. it lists the time that science quality data are
available. These files are generalized enough that they could describe different
types of data, so frame-segments-name
is used to specify which segment to
consider. In practice, the segments file we produce will only contain the
segments we want. Users will typically not change any of these options once they
are set for a given instrument and observing run. x509-proxy
is the path to
your x509-proxy
.
Section: Segments¶
The segments
section specifies how to generate segments and vetoes for the
workflow. There are two backends to determine where to query segments and vetoes
from, gwosc
(public) and dqsegdb
(authenticated).
An example of configuration with the gwosc
backend looks like:
segments:
backend: gwosc
vetoes:
category: CAT1
Here, the backend
is set to gwosc
so both segments and vetoes are determined
by querying the GWOSC server. There is no additional configuration needed to query
segments, but for vetoes, we also need to specify the category
used for vetoes.
This can be one of CAT1
, CAT2
, or CAT3
. By default, segments are generated
by applying CAT1
vetoes as recommended by the Detector Characterization group.
An example of configuration with the dqsegdb
backend looks like:
segments:
backend: dqsegdb
science:
H1: DCS-ANALYSIS_READY_C01:1
L1: DCS-ANALYSIS_READY_C01:1
V1: ITF_SCIENCE:2
vetoes:
category: CAT1
veto-definer:
file: H1L1V1-HOFT_C01_V1ONLINE_O3_CBC.xml
version: O3b_CBC_H1L1V1_C01_v1.2
epoch: O3
Here, the backend
is set to dqsegdb
so both segments and vetoes are determined
by querying the DQSEGDB server. To query segments, one needs to specify the flag used
per instrument to query segments from. For vetoes, we need to specify the category
used for vetoes as with the dqsegdb
backend. Additionally, a veto definer file is
used to determine which flags are used for which veto categories. The file need not be
provided, the file
, version
and epoch
fully specify how to access the veto
definer file used for generating vetoes.
Section: PSD¶
psd:
fft-length: 8
sample-rate: 4096
The PSD estimation method used by GstLAL is a modified median-Welch method that
is described in detail in Section IIB of Ref [1]. The FFT length sets the length
of each section that is Fourier transformed. The default whitener will use
zero-padding of one-fourth the FFT length on either side and will overlap
fourier transformed segments by one-fourth the FFT length. For example, an
fft-length
of 8 means that each Fourier transformed segment used in the PSD
estimation (and consequently the whitener) will contain 4 seconds of data with 2
seconds of zero padding on either side, and will overlap the next segment by 2
seconds (i.e. the last two seconds of data in one segment will be the first two
seconds of data in the following window).
Section: SVD¶
svd:
f-low: 20.0
num-chi-bins: 1
sort-by: mchirp
approximant:
- 0:1.73:TaylorF2
- 1.73:1000:SEOBNRv4_ROM
tolerance: 0.9999
max-f-final: 1024.0
num-split-templates: 200
overlap: 30
num-banks: 5
samples-min: 2048
samples-max-64: 2048
samples-max-256: 2048
samples-max: 4096
autocorrelation-length: 701
max-duration: 128
manifest: svd_manifest.json
f-low
sets the lower frequency cutoff for the analysis in Hz.
num-chi-bins
is a tunable parameter related to the template bank binning
procedure; specifically, sets the number of effective spin parameter bins to use
in the chirp-mass / effective spin binning procedure described in Sec. IID and
Fig. 6 of [1].
sort-by
selects the template sort column. This controls how to bin the
bank in sub-banks suitable for the svd decomposition. It can be mchirp
(sorts by chirp mass), mu
(sorts by mu1 and mu2 coordiantes), or
template_duration
(sorts by template duration).
approximant
specifies the waveform approximant that should be used along
with chirp mass bounds to use that approximant in. 0:1000:TaylorF2 means use the
TaylorF2 approximant for waveforms from systems with chirp-masses between 0 and
1000 solar masses. Multiple waveforms and chirp-mass bounds can be provided.
tolerance
is a tunable parameter related to the truncation of SVD basis
vectors. A tolerance of 0.9999 means the targeted matched-filter inner-product
of the original waveform and the waveform reconstructed from the SVD is 0.9999.
max-f-final
sets the max frequency of the template.
num-split-templates
, overlap
, num-banks
, are tunable parameters
related to the SVD process. num-split-templates
sets the number of templates
to decompose at a time; overlap
sets the number of templates from adjacent
template bank regions to pad to the region being considered in order to actually
compute the SVD (this helps the performance of the SVD, and these pad templates
are not reconstructed); num-banks
sets the number of sets of decomposed
templates to include in a given bin for the analysis. For example,
num-split-templates
of 200, overlap
of 30, and num-banks
of 5 means
that each SVD bank file will contain 5 decomposed sets of 200 templates, where
the SVD was computed using an additional 15 templates on either side of the 200
(as defined by the binning procedure).
samples-min
, samples-max-64
, samples-max-256
, and samples-max
are tunable parameters related to the template time slicing procedure used by
GstLAL (described in Sec. IID and Fig. 7 of Ref. [1], and references therein).
Templates are slice in time before the SVD is applied, and only sampled at the
rate necessary for the highest frequency in each time slice (rounded up to a
power of 2). For example, the low frequency part of a waveform may only be
sampled at 32 Hz, while the high frequency part may be sampled at 2048 Hz
(depending on user settings). samples-min
sets the minimum number of samples
to use in any time slice. samples-max
sets the maximum number of samples to
use in any time slice with a sample rate below 64 Hz; samples-max-64
sets
the maximum number of samples to use in any time slice with sample rates between
64 Hz and 256 Hz; samples-max-256
sets the maximum number of samples to use
in any time slice with a sample rate greater than 256 Hz.
autocorrelation-length
sets the number of samples to use when computing the
autocorrelation-based test-statistic, described in IIIC of Ref [1].
max-duration
sets the maximum template duration in seconds. One can choose
not to use max-duration
.
manifest
sets the name of a file that will contain metadata about the
template bank bins.
If one uses multiple sub template banks, SVD configurations can be specified for each sub template bank. Reference mario config .
Users will typically not change these options.
Section: Filter¶
filter:
fir-stride: 1
min-instruments: 1
coincidence-threshold: 0.01
ht-gate-threshold: 0.8:15.0-45.0:100.0
veto-segments-file: vetoes.xml.gz
time-slide-file: tisi.xml
injection-time-slide-file: inj_tisi.xml
time-slides:
H1: 0:0:0
L1: 0.62831:0.62831:0.62831
injections:
bns:
file: bns_injections.xml
range: 0.01:1000.0
fir-stride
is a tunable parameter related to the matched-filter procedure,
setting the length in seconds of the output of the matched-filter element.
coincidence-threshold
is the time in seconds to add to the light-travel time
when searching for coincidences between detectors.
ht-gate-threshold
sets the h(t) gate threshold as a function of chirp-mass.
The h(t) gate threshold is a value over which the output of the whitener plus
some padding will be set to zero (as described in IIC of Ref. [1]).
0.8:15.0-45.0:100.0 mean that a template bank bin that that has a max chirp-mass
template of 0.8 solar masses will use a gate threshold of 15, a bank bin with a
max chirp-mass of 100 will use a threshold of 45, and all other thresholds are
described by a linear function between those two points.
veto-segments-file
sets the name of a LIGOLW xml file that contains any
vetoes used for the analysis, even if there are no vetoes.
time-slide-file
and inj-time-slide-file
are LIGOLW xml files that
describe any time slides used in the analysis. A typical analysis will only
analyze injections with the zerolag “time slide” (i.e. the data are not slid in
time), and will consider the zerolag and one other time slide for the
non-injection analysis. The time slide is used to perform a blind sanity check
of the noise model.
injections will list a set of injections, each with their own label. In this example, there is only one injection set, and it is labeled “bns”. file is a relative path to the injection file (a LIGOLW xml file that contains the parameters of the injections, but not the actual waveforms themselves). range sets the chirp-mass range that should be considered when searching for this particular set of injections. Multiple injection files can be provided, each with their own label, file, and range.
The only option here that a user will normally interact with is the injections option.
When using multiple sub template banks, replace bns:
under injections:
with inj:
Section: Injections¶
injections:
sets:
expected-snr:
f-low: 15.0
bns:
f-low: 14.0
seed: 72338
time:
step: 32
interval: 1
shift: 0
waveform: SpinTaylorT4threePointFivePN
mass-distr: componentMass
mass1:
min: 1.1
max: 2.8
mass2:
min: 1.1
max: 2.8
spin1:
min: 0
max: 0.05
spin2:
min: 0
max: 0.05
distance:
min: 10000
max: 80000
spin-aligned: True
file: bns_injections.xml
The sets
subsection is used to create injection sets to be used within the
analysis, and referenced to by name in the filter
section. In sets
, the
injections are grouped by key. In this case, one bns
injection set which
creates the bns_injections.xml
file and used in the injections
section
of the filter
section.
For multiple injections, the chunk for bns:
should be repeated for each
injection. Reference mario config .
Besides creating injection sets, the expected-snr
subsection is used for the
expected SNR jobs. These settings are used to override defaults as needed.
spin-aligned
specifies whether the injections should be spin-(mis)aligned
spins (if spin-aligned: True
) or precessing spins (if spin-aligned: False
).
In the case of multiple injection sets that need to be combined, one can add a few options to create a combined file and reference that within the filter jobs. This can be useful for large banks with a large set of templates. To do this, one can add the following:
injections:
combine: true
combined-file: combined_injections.xml
The injections created are generated from the lalapps_inspinj
program, with
the following mapping between configuration and command line options:
f-low
:--f-lower
seed
:--seed
time
section:-time-step
,--time-interval
.shift
adjusts the start time appropriately.waveform
:--waveform
mass-distr
:--m-distr
mass/spin/distance
sections: maps to options like--min-mass1
Section: Prior¶
prior:
mass-model: mass_model/mass_model_small.h5
mass-model
is a relative path to the file that contains the mass model. This
model is used to weight templates appropriately when assigning ranking
statistics based on our understanding of the astrophysical distribution of
signals. Users will not typically change this option.
An optional dtdphi-file
and idq-timeseries
can be provided here. If not
given, a default model (included in the standard installation) will be used.
The dtdph file will specify a probability distribution function for the
probability of measuring a given time shift and phase shift in mulitple detector
observation. It enters in the ranking statistics.
The idq file will give information about the data quality around the time of
coalescence.
If specifying idq files and dtdphi files, create a directory for idq and dtdphi
each in the <analysis-dir>
, and put the idq files and dtdphi files in the
respective directory.
Reference mario config .
Section: Rank¶
rank:
ranking-stat-samples: 4194304
ranking-stat-samples
sets the number of samples to draw from the noise model
when computing the distribution of log likelihood-ratios (the ranking statistic)
under the noise hypothesis. Users will not typically change this option.
Section: Summary¶
summary:
webdir: /path/to/public_html/folder
webdir
sets the path of the output results webpages produced by the
analysis. Users will typically change this option for each analysis.
Section: Condor¶
condor:
profile: osg-public
accounting-group: ligo.dev.o3.cbc.uber.gstlaloffline
accounting-group-user: <albert.einstein>
singularity-image: <image>
profile
sets a base level of configuration options for condor.
accounting-group
sets accounting group details on LDG resources. Currently
the machinery to produce an analysis dag requires this option, but the option is
not actually used by analyses running on non-LDG resources.
singularity-image
sets the path of the container on cvmfs that the analysis
should use. Users will not typically change this option
(use /cvmfs/singularity.opensciencegrid.org/lscsoft/gstlal:master
).
Installing Custom Site Profiles¶
You can define a site profile as YAML. As an example, we can create a file called custom.yml
:
scheduler: condor
requirements:
- "(IS_GLIDEIN=?=True)"
Both the directives and requirements sections are optional.
To install one so it’s available for use, run:
$ singularity exec <image> gstlal_grid_profile install custom.yml