IgPhyML - B cell phylogenetic inference package

IgPhyML is a program designed to build phylogenetic trees and test evolutionary hypotheses regarding B cell affinity maturation.

The biology of B cell somatic hypermutation (SHM) violates important assumptions in most standard phylogenetic substitution models; further, while most phylogenetics programs are designed to analyze single lineages, B cell repertoires typically contain thousands of lineages. IgPhyML addresses both of these issues by implementing substitution models that correct for the context-sensitive nature of SHM, and combines information from multiple lineages to give more precisely estimated repertoire-wide model parameter estimates.

IgPhyML is integrated into the Immcantation framework, meaning users can easily build and visualize B cell lineage trees and test hypotheses about mutation and selection in B cell repertoires.

Overview

This is the instruction guide for using the new, repertoire-wide version of IgPhyML.

The most up-to-date description of IgPhyML can be found here:

Hoehn, KB, Vander Heiden, JA, Zhou JQ, Lunter, G, Pybus, OG, & Kleinstein SH. Repertoire-wide phylogenetic models of B cell molecular evolution reveal evolutionary signatures of aging and vaccination. Proceedings of the National Academy of Sciences, 116(45), 22664-22672. doi: https://doi.org/10.1073/pnas.1906020116

and here:

Hoehn, KB, Lunter, G, & Pybus, OG. A phylogenetic codon substitution model for antibody lineages. Genetics, 206(1), 417-427. doi: http://dx.doi.org/10.1534/genetics.116.196303

Installation

Quick start

IgPhyML is easiest to use when run indirectly through the Change-O program BuildTrees by specifying the --igphyml option. Most of these instructions require Change-O 0.4.6 or higher, Alakazam 0.3.0 or higher, and IgPhyML to be installed, with the executable in your PATH variable. If these aren’t possible, see IgPhyML standalone operation

To view all options for BuildTrees , run the command:

BuildTrees.py --help

The following commands should work as a first pass on many reasonably sized datasets, but if you really want to understand what’s going on or make sure what you’re doing makes sense, please check out the rest of the website.

Build trees and estimate model parameters

Move to the examples subfolder and run:

BuildTrees.py -d example.tsv --outname ex --log ex.log --collapse \
    --sample 3000 --igphyml --clean all --nproc 1

This command processes an AIRR-formatted dataset of BCR sequences that have been clonally clustered with germlines reconstructed. It then quickly builds trees using the GY94 model and, using these fixed topologies, estimates HLP19 model parameters. This can be sped up by increasing the --nproc option. Subsampling using the --sample option in isn’t strictly necessary, but IgPhyML will run slowly when applied to large datasets. Here, the --collapse flag is used to collapse identical sequences. This is highly recommended because identical sequences slow down calculations without affecting likelihood values in IgPhyML.

Visualize results

The output file of the above command can be read using the readIgphyml function of Alakazam. After opening an R session, enter the following commands. Note that when using the Docker container, you’ll need to run dev.off() after plotting the tree to create a pdf plot in the examples directory:

library(alakazam)
library(igraph)

db = readIgphyml("ex_igphyml-pass.tab")

#plot largest lineage tree
plot(db$trees[[1]],layout=layout_as_tree)

#show HLP10 parameters
print(t(db$param[1,]))
CLONE         "REPERTOIRE"
NSEQ          "4"
NSITE         "107"
TREE_LENGTH   "0.286"
LHOOD         "-290.7928"
KAPPA_MLE     "2.266"
OMEGA_FWR_MLE "0.5284"
OMEGA_CDR_MLE "2.3324"
WRC_2_MLE     "4.8019"
GYW_0_MLE     "3.4464"
WA_1_MLE      "5.972"
TW_0_MLE      "0.8131"
SYC_2_MLE     "-0.99"
GRS_0_MLE     "0.2583"
map to buried treasure

Lineage tree of example clone.

Using Dowser

For simply building trees, the easiest way to use IgPhyML is through the R package Dowser. Dowser supports multiple methods for building B cell lineage trees, including IgPhyML.

To build trees with Dowser, follow this tutorial. For building IgPhyML trees specifically, follow this link once formatting steps are complete.

In either case, IgPhyML will need to either have already been compiled from source code or contained within the Docker container.

While BuildTrees currently offers more building and model options, and also masks codons split by V-gene reference alignment, these features also under in development within Dowser and should be available soon.

Using BuildTrees

IgPhyML can be run indirectly through the Change-O program BuildTrees by specifying the --igphyml option. Most of these instructions require Change-O 0.4.6 or higher, Alakazam 0.3.0 or higher, and IgPhyML to be installed, with the executable in your PATH variable. If these aren’t possible, see IgPhyML standalone operation

To view all options for BuildTrees , run the command:

BuildTrees.py --help

Controlling input

The process begins with an AIRR formatted data file, in which each sequence has been clustered into a clonal group, which has subsequently had its unmutated V and J sequence predicted germlines. The following column names are required in the input file: fields: sequence_id, sequence, sequence_alignment, germline_alignment_d_mask, v_call, j_call, and clone_id. productive is recommended.

Use BuildTrees.py to break this file into separate sequence alignment files that can be used with IgPhyML. This program will:

  1. Filter out nonfunctional sequences.
  2. Mask codons split by insertions.
  3. Separate clonal groups into separate alignment files (aligned by IMGT site) and information files
  4. Create the repertoire files for this dataset.

Create IgPhyML input files from examples/example.tsv without running IgPhyML:

cd examples
BuildTrees.py -d example.tsv --outname ex --log ex.log --collapse

Here the --collapse flag is used to collapse identical sequences. This is highly recommended because identical sequences slow down calculations without actually affecting likelihood values in IgPhyML.

Note

IgPhyML requires at least three sequences in a lineage, so in the case that there is only one observed sequence within a clone, that sequence is duplicated. This will not affect the likelihood calculation because these sequences will have a branch length of zero, but it will affect metrics that take sequence frequency into account.

Subsampling AIRR datasets

IgPhyML runs slowly with more than a few thousand sequences. You can subsample your dataset using the --sample and --minseq options, which will subsample your dataset to the specified depth and then remove all clones below the specified size cutoff:

BuildTrees.py -d example.tsv --collapse --sample 5 --minseq 2 --igphyml

Selecting individual clones

Often only particular clones are of interest for lineage tree analysis. To restrict IgPhyML analysis to particular clones, use the --clones option:

#Only use sequences from clone 2:
BuildTrees.py -d example.tsv --collapse --clone 2 --igphyml

#Only use sequences from clones 1 and 2:
BuildTrees.py -d example.tsv --collapse --clone 1 2 --igphyml

Cleaning up intermediates

BuildTrees produces many intermediate files that are usually not useful to the user. To delete them after IgPhyML is run, use --clean all:

BuildTrees.py -d example.tsv --log ex.log --collapse --igphyml --clean all

Removing the CDR3 region

If you plan to analyze model parameters to study things such as SHM and selection, it’s important to remove the CDR3 region to avoid known model biases in estimating \(\omega\). To do this, use --ncdr3:

BuildTrees.py -d example.tsv --log ex.log --collapse --ncdr3 --igphyml

Building lineage trees

If you’re simply interested in getting some tree topologies, the fastest option is to just use the GY94 and not estimate any parameters under HLP19. This is done using the --optimize n option:

BuildTrees.py -d example.tsv --collapse --igphyml --clean all --optimize n

The trees can then be visualized using igraph and Alakazam. Open an R session:

library(alakazam)
library(igraph)

db = readIgphyml("example_igphyml-pass.tab")

#plot largest lineage tree
plot(db$trees[[1]],layout=layout_as_tree)
graph

Graph-formatted lineage tree of example clone 1.

Run dev.off() after plotting if using the Docker image to create a pdf. In these plots, edge labels represent the expected number of substitutions between nodes in the tree. See the Alakazam documentation for plotting this style of trees. Alternatively, more traditional bifurcating tree topologies can be used:

library(alakazam)
library(ape)

db = readIgphyml("example_igphyml-pass.tab",format="phylo")

#plot largest lineage tree
plot(db$trees[[1]])
phylo

Phylo-formatted lineage tree of example clone 1.

Of course, these are quite simple trees. A more interesting tree can be visualized from a different provided dataset:

library(alakazam)
library(ape)

db = readIgphyml("sample1_igphyml-pass.tab",format="phylo")

#plot largest lineage tree
plot(ladderize(db$trees[[1]]),cex=0.7,no.margin=TRUE)
phylo

Phylo-formatted lineage tree of a larger B cell clone.

Alternatively, to estimate ML tree topologies using the HLP19 model, use:

BuildTrees.py -d example.tsv --collapse --igphyml --clean all --optimize tlr

This will be slower than using the GY94 model but does return meaningful HLP19 parameter estimates. These results can be visualized in the same manner using Alakazam.

Evolutionary hypothesis testing

The HLP19 model

The HLP19 model is the heart of IgPhyML and adjusts for features of affinity maturation that violate the assumptions of most other phylogenetic models. It uses four sets of parameters to characterize the types of mutations the occurred over a lineage’s development, and to help build the tree.

\(\omega\): Also called dN/dS, or the ratio of nonsynonymous (amino acid replacement) and synonymous (silent) mutation rates. This parameter generally relates to clonal selection, with totally neutral amino acid evolution having an \(\omega \approx 1\), negative selection indicated by \(\omega < 1\) and diversifying selection indicated by \(\omega > 1\). Generally, we find a lower \(\omega\) for FWRs than CDRs, presumably because FWRs are more structurally constrained.

\(\kappa\): Ratio of transitions (within purines/pyrimidines) to transversions (between purines/pyrimidines). For normal somatic hypermutation this ratio is usually \(\approx 2\).

Motif mutability (e.g. \(h^{WRC}\)): Mutability parameters for specified hot- and coldspot motifs. These estimates are equivalent to the fold-change in mutability for that motif compared to regular motifs, minus one. So, \(h^{WRC} > 0\) indicates at hotspot, \(h^{WRC} < 0\) indicates a coldspot, and \(h^{WRC} = 2\) indicates a 3x increase in WRC substitution rate. The HLP19 model by default estimates six motif mutability parameters: four hotspots (WRC, GYW, WA, and TW) and two coldspots (SYC and GRS).

Specifying parameters

Substitution parameters are specified using the -t for \(\kappa\) (transition/transverion rate), --omega for \(\omega\) (nonsynonymous/synonymous mutation rate), and --motifs and --hotness for specifying the motif mutability models. The default for all of these is to estimate shared parameter values across all lineages, which is also specified by e.

Due to default parameter settings, the following two commands are equivalent:

BuildTrees.py -d example.tsv --collapse --igphyml

BuildTrees.py -d example.tsv --collapse --igphyml -t e --omega e,e \
 --motifs WRC_2:0,GYW_0:1,WA_1:2,TW_0:3,SYC_2:4,GRS_0:5 \
 --hotness e,e,e,e,e,e --optimize lr

Note that here we use --optimize lr, which will keep tree topologies the same and only estimate branch lengths and substitution parameters. This will keep topologies the same as the GY94, but will estimate substitution parameters much more quickly. Using --optimize tlr will also optimize tree topology, using --optimize r will only optimize model parameters, and --optimize n will not optimize topology, branch lengths, or model parameters.

The default setting is to estimate a separate \(\omega\) parameter for FWR and CDR regions. If you want one \(\omega\) for all regions, use:

BuildTrees.py -d example.tsv --collapse --igphyml --omega e

You can also constrain motifs to have the same mutabilities by altering the indexes after the ‘:’ in the --motifs option. For motif mutability, each value in the --hotness option corresponds to the index specified in the --motifs option. For example, to estimate a model in which WRC/GYW, WA/TW, and SYC/GRS motifs are respectively constrained to have the same mutabilities, use:

BuildTrees.py -d example.tsv --collapse --igphyml \
 --motifs WRC_2:0,GYW_0:0,WA_1:1,TW_0:1,SYC_2:2,GRS_0:2 \
 --hotness e,e,e

Confidence interval estimation

It is possible to estimate 95% confidence intervals for any of these parameters by adding a ‘c’ to the parameter specification. For example, to estimate a 95% confidence interval for \(\omega _{CDR}\) but not \(\omega _{FWR}\), use:

BuildTrees.py -d example.tsv --collapse --ncdr3 --clean all --igphyml --omega e,ce

To estimate a 95% confidence interval for \(\omega _{FWR}\) but not \(\omega _{CDR}\), use:

BuildTrees.py -d example.tsv --collapse --ncdr3 --clean all --igphyml --omega ce,e

Any combination of confidence interval specifications can be used for the above parameter options. For instance, to estimate confidence intervals for GYW mutability, use:

BuildTrees.py -d example.tsv --collapse --ncdr3 --clean all --igphyml --hotness e,ce,e,e,e,e

which is equivalent to:

BuildTrees.py -d example.tsv --collapse --ncdr3 --clean all --igphyml \
 --motifs WRC_2:0,GYW_0:1,WA_1:2,TW_0:3,SYC_2:4,GRS_0:5 \
 --hotness e,ce,e,e,e,e

Remember it is important to remove the CDR3 region for this kind of analysis. You can find further explanation of the different options in the commandline help page of BuildTrees, including controlling output directories and file names.

Visualizing results

Model hypothesis testing can be easily accomplished with the Alakazam functions readIgphyml and combineIgphyml. In this example, we first run IgPhyML on an example file and estimate confidence intervals on \(\omega _{CDR}\):

BuildTrees.py -d example.tsv --collapse --nproc 2 --ncdr3 --clean all --igphyml --omega e,ce

Then, open an R session, where we load the example result and two other samples. To compare maximum likelihood parameter estimates for all samples, use (run dev.off() after plotting if using the Docker image to create a pdf):

#!/usr/bin/R
library(alakazam)
library(ggplot2)

#read in three different samples
ex = readIgphyml("example_igphyml-pass.tab",id="EX")
s1 = readIgphyml("sample1_igphyml-pass.tab",id="S1")
s2 = readIgphyml("sample2_igphyml-pass.tab",id="S2")

#print out parameter values
print(ex$param[1,])

#combine objects into a dataframe
comb = combineIgphyml(list(ex,s1,s2),format="long")

ggplot(comb[grepl("MLE",comb$variable),],
   aes(x=ID,y=variable,fill=value)) + geom_tile() +
   theme_bw() + scale_fill_distiller(palette="RdYlBu")
phylo

Maximum likelihood HLP19 parameter estimates for three samples.

Maximum likelihood point estimates of each parameter are specified with “_MLE”, while upper and lower confidence interval bounds of a parameter are specified with “_UCI” and “_LCI” respectively. Which estimates are available is controlled by the model specified and whether confidence intervals were estimated when running IgPhyML.

properly test the hypothesis that \(\omega _{CDR}\) parameter estimates are significantly different among these datasets, use:

#!/usr/bin/R
library(alakazam)
library(ggplot2)

#read in three different samples
ex = readIgphyml("example_igphyml-pass.tab",id="EX")
s1 = readIgphyml("sample1_igphyml-pass.tab",id="S1")
s2 = readIgphyml("sample2_igphyml-pass.tab",id="S2")

#combine objects into a dataframe
comb = combineIgphyml(list(ex,s1,s2),format="wide")

#compare CDR dN/dS for three samples
ggplot(comb,aes(x=ID,y=OMEGA_CDR_MLE, ymin=OMEGA_CDR_LCI,
  ymax=OMEGA_CDR_UCI)) + geom_point() +
  geom_errorbar(width=0.1) + theme_bw()
phylo

95% confidence intervals for \(\omega _{CDR}\) of three samples.

Where we can see that the confidence interval for Sample 1 does not overlap with the confidence interval for Sample 2, meanging we conclude Sample 1 has significantly lower \(\omega _{CDR}\) than Sample 2. However, the confidence intervals for our example file (ex) are too wide to reach any firm conclusion.

Optimizing performance

IgPhyML is a computationally intensive program. There are some ways to make calculations more practical, detailed below.

Data subsampling: IgPhyML runs slowly with more than a few thousand sequences. You can subsample your dataset using the --sample and --minseq options in BuildTrees.py, which will subsample your dataset to the specified depth and then remove all clones below the specified size cutoff (see Subsampling Change-O datasets).

Analyzing specific clones: The --clone option can be used to analyze only the specified clones.

Parallelizing computations: It is possible to parallelize likelihood calculations by splitting computations across multiple cores using the --nproc option. Currently, calculations are parallelized by tree, so there is no use in using more threads than lineages.

Enforcing minimum lineage size: Many repertoires often contain huge numbers of small lineages that can make computations impractical. To limit the size of lineages being analyzed, specify a cutoff with --minseq when running BuildTrees.py.

Standalone operation

While IgPhyML is easiest to use when run indirectly through the Change-O program BuildTrees, this is not always possible or desirable. This section details how IgPhyML may be run directly as a standalone program. As before, this section requires IgPhyML to be installed, with the executable in your PATH variable. This is already done in the Docker image.

Model specification and confidence interval estimation are the same as when running IgPhyML from BuildTrees. The means of specifying input, however, are different.

To view all options for IgPhyML directly, run the command:

igphyml --help

Analyzing a single lineage

In its most basic and flexible operation, IgPhyML operates on single lineage. Its only required input is an in-frame multiple sequence alignment, without any stop codons, in FASTA or PHYLIP format. The model must be specified. The most basic is GY94, which is fast but doesn’t account for SHM context sensitivity. The HLP19 model corrects for SHM context sensitivity but is slower requires the root sequence be specified:

cd examples

#Estimate parameters and topology using GY94 model
igphyml -i example.fasta -m GY --run_id gy

#Estimate parameters and topology using HLP19 model
igphyml -i example.fasta -m HLP --root V4-59 --run_id hlp

It is also possible to use a fixed tree topology:

igphyml -i example.fasta -m HLP --root V4-59 -u example.fasta_igphyml_tree_gy.txt

Or estimate separate \(\omega\) values for CDR and FWR partitions:

igphyml -i example.fasta -m HLP --root V4-59 --partfile part.example.txt

Generally, all the following features of repertoire-wide phylogenetic analysis can also be performed on single lineages by specifying input files in the above manner.

Analyzing repertoires

These commands should work as a first pass on many reasonably sized datasets, but if you really want to understand what’s going on or make sure what you’re doing makes sense, please check out the rest of the manual.

Convert Change-O files into IgPhyML inputs

Move to the examples subfolder and run, in order:

BuildTrees.py -d example.tsv --outname ex --log ex.log --collapse

This will create the directory ex and the file ex_lineages.tsv. Each ex/<clone ID>.fasta contains the IMGT mutliple sequence alignemt for a particular clone, and each ex/<clone ID>.part.txt file contains information about V and J germline assignments, as well as IMGT unique numbering for each site. The file ex.log will contain information about whether or not each sequence was included in the analysis. The file ex_lineages.tsv is the direct input to IgPhyML. Each line represents a clone and shows the multiple sequence alignment, starting tree topology (N if ignored), germline sequence ID in alignment file, and partition file (N if ignored). These repertoire files start with the number of lineages in the repertoire, and lineages are arranged from most to least number of sequences. Here, it is important to not use --igphyml or --clean all to prevent IgPhyML from being run by BuildTrees, and to prevent these intermediate files from being deleted.

Build lineage trees using the GY94 model

This option is fast and makes good starting topologies, but doesn’t correct for SHM mutation biases. Use the --outrep option to make a modified

igphyml --repfile ex_lineages.tsv -m GY --outrep ex_lineages_gy.tsv --run_id gy

Build lineage trees using the HLP19 model with GY94 starting trees

This option is slower but corrects for mutation biases of SHM:

#HLP topology, branch lengths, and parameters (slow)
igphyml --repfile ex_lineages_gy.tsv -m HLP --threads 1

#GY94 topology, HLP19 branch lengths and parameters (faster)
igphyml --repfile ex_lineages_gy.tsv -m HLP --optimize lr --threads 1

Both of these can be parallelized by modifying the --threads <thread count> option. Trees files are listed as ex/<clone id>.fasta_igphyml_tree.txt, and can be viewed with most tree viewers (I recommend FigTree). Parameter estimates are in ex_lineages.tsv_igphyml_stats.txt.

Controlling output format

Alternatively, run using --oformat tab to create input readable by the readIgphyml function of Alakazam.:

#Output can be read using readIgphyml function
igphyml --repfile ex_lineages_gy.tsv -m HLP --optimize lr --threads 1 --oformat tab

Open an R session, and run the following commands. Note the results are the same as in the quickstart example

library(alakazam)
library(igraph)

db = readIgphyml("ex_lineages_gy.tsv_igphyml_stats.tab")

#plot largest lineage tree
plot(db$trees[[1]],layout=layout_as_tree)

#show HLP10 parameters
print(t(db$param[1,]))
CLONE         "REPERTOIRE"
NSEQ          "4"
NSITE         "107"
TREE_LENGTH   "0.286"
LHOOD         "-290.7928"
KAPPA_MLE     "2.266"
OMEGA_FWR_MLE "0.5284"
OMEGA_CDR_MLE "2.3324"
WRC_2_MLE     "4.8019"
GYW_0_MLE     "3.4464"
WA_1_MLE      "5.972"
TW_0_MLE      "0.8131"
SYC_2_MLE     "-0.99"
GRS_0_MLE     "0.2583"
map to buried treasure

Lineage tree of example clone.

Release Notes

Version 2.0.0: Sept. 22, 2023 + Added support for combined light chain tree building, and scaled branch models + Compatible in Dowser 2.0.0

Version 1.1.4: April 13, 2022

  • Added new parsimony reconstruction methods and maximum ambiguity polytomy resolution for Dowser compatability.

Version 1.1.3: May 20, 2020

  • Small bug fixes in preparation for Immcantation 4.0.0 release

Version 1.1.2: May 13, 2020

  • Added basic intermediate sequence reconstruction function using –ASR or –ASRc 0.9

Version 1.1.1: Jan. 27, 2020

  • Added more general maximum parsimony reconstruction methods.

Version 1.1.0: Sept. 16, 2019

  • Added parsimony reconstruction methods for cell subsets. Information for Austin, Buckner et al. (2019) hardcoded. More general methods and documentation to follow.

Version 1.0.7: Feb. 19, 2019

  • Added HLP19 substitution model, specified by HLP or HLP19.
  • Constant B matrix approximation (–cbmat; default for HLP) dramatically improves runtime.
  • Default for –motifs is now FCH (free coldspots and hotspots).
  • FWR/CDR omega optimization is default behavior if partition files included.
  • Output file cleaned up, as well as confidence interval prediction output.

Version 1.0.6

  • Repertoire-wide HLP17 parameter optimization
  • Parallelization by tree using –threads
  • Integration to Change-O through BuildTrees.py

Contact

If you have questions you can email the Immcantation Group.

If you’ve discovered a bug or have a feature request, you can create an issue on Bitbucket using the Issue Tracker.

Citation

To cite IgPhyML in publications please use:

Hoehn, KB, Vander Heiden, JA, Zhou JQ, Lunter, G, Pybus, OG, & Kleinstein SH. Repertoire-wide phylogenetic models of B cell molecular evolution reveal evolutionary signatures of aging and vaccination. Proceedings of the National Academy of Sciences, 116(45), 22664-22672. doi: https://doi.org/10.1073/pnas.1906020116

Hoehn, KB, Lunter, G, & Pybus, OG. A phylogenetic codon substitution model for antibody lineages. Genetics, 206(1), 417-427. doi: http://dx.doi.org/10.1534/genetics.116.196303

References

IgPhyML is build using the source code of codonPhyML:

Gil, M, Zanetti, MS, Zoller, S, Anisimova, M. CodonPhyML: fast maximum likelihood phylogeny estimation under codon substitution models Mol Biol Evol. 2013 Jun;30(6):1270-80. doi: 10.1093/molbev/mst034

which is based off of PhyML:

Guindon, S, Delsuc F, Dufayard JF, Gascuel, O Estimating maximum likelihood phylogenies with PhyML Methods Mol Biol. 2009;537:113-37. doi: 10.1007/978-1-59745-251-9_6

The HLP17 and HLP19 models are based off of the GY94 model:

Nielsen R, Yang, Z Likelihood models for detecting positively selected amino acid sites and applications to the HIV-1 envelope gene. Genetics. 1998 Mar;148(3):929-36

License

This work is licensed under the GNU General Public License, verion 2 (GNU GLP v2.0).