Deprecated features

These features are not available in ggCaller v1.4+. They are replaced by generation of GFF files which can be used in any modern clustering method, such as Panaroo.

Annotating genes

ggCaller comes with two default databases for functional annotation of genes. - Bacterial and Viral databases from Uniprot, used by DIAMOND - HMM profiles from Prokka, used by HMMER3

Important

Ensure you are connected to the internet when first running ggCaller as these databases are downloaded automatically. Subsequent runs can be conducted offline.

There are three sensitivity levels for annotation:

  • fast: only DIAMOND in fast mode

  • sensitive: only DIAMOND in sensitive mode

  • ultrasensitive: HMMER3 and DIAMOND in sensitive mode

For example, to run DIAMOND only in fast mode, run:

ggcaller --refs input.txt --annotation fast

By default these commands will annotate using DIAMOND with the Bacteria uniprot database. To change this to the Viruses database, run:

ggcaller --refs input.txt --annotation fast --diamonddb Viruses

Custom databases can also be specified for both DIAMOND using --diamonddb and HMMER3 using --hmmdb. DIAMOND databases must be amino-acid FASTA files. HMMER3 databases must be HMM-profile .HAMAP files built using hmmbuild which is part of the HMMER3 package.

To run with custom DIAMOND and HMMER3 databases:

ggcaller --refs input.txt --annotation ultrasensitive --diamonddb annotation.fasta --hmmdb annotation.HAMAP

Annotation is not on by default. If annotation is specified, ggCaller will additionally generate:

  • GFF files for each input genome in a separate directory GFF

  • Annotations will be added to gene call FASTA files

Aligning genes

ggCaller also supports generation of within-cluster and core genome alignments using MAFFT.

There are two alignment algorithms implemented:

  • def or default, which uses the standard MAFFT multiple sequence alignment algorithm. This is faster when aligning <=500 sequences in a cluster.

  • ref or reference, which uses reference-guided alignment. This is faster when aligning >500 sequences in a cluster.

There are also two modes for alignment:

  • core aligns genes only within core clusters, and generates a concatenated core genome alignment.

  • pan aligns genes within all clusters (pangenome alignment), as well as generating a concatenated core genome alignment.

To generate a core genome alignment using default MAFFT, run:

ggcaller --refs input.txt --aligner def --alignment core

To generate a pangenome alignment using reference-guided MAFFT, run:

ggcaller --refs input.txt --aligner ref --alignment pan

To change the frequency of genes deemed to be core, use –core-threshold (default = 0.95, or 95% frequency). For example, only include genes found at 100% frequency:

ggcaller --refs input.txt --aligner def --alignment core --core-threshold 1.0

Alignment is off by default. If specified, ggCaller will additionally generate:

  • Core genome alignment in FASTA format

  • Core genome Neighbour-joining tree in NWK format

  • Per-cluster alignment files in FASTA format in a separate directory aligned_gene_sequences

  • Per-cluster VCF file generated by SNP-SITES in separate directory VCF

Quality control and clustering

Important

This feature is not available past ggCaller v1.4+. This is replaced by generation of GFF files which can be used in any modern clustering method, such as Panaroo.

ggCaller implements Panaroo to identify spurious clusters that are generated by assembly fragmentation and contamination.

Panaroo identifies spurious clusters as those with <2 edges in the gene graph. Spurious clusters are then removed based on their population frequency, determined by three settings:

  • strict; remove spurious clusters with <5% frequency. Good for datasets >100 genomes where rare plasmids are not expected.

  • moderate; remove spurious clusters with <1% frequency (default). Good for datasets <=100 genomes where rare plasmids are not expected.

  • sensitive; do not remove clusters. Good for datasets where rare plasmids are expected.

For example, to run ggCaller in strict mode:

ggcaller --refs input.txt --clean-mode strict

More information can be found here.

If you use the full pipeline of ggCaller, also please cite Panaroo.

Deprecated advanced options

Gene clustering options

  • --identity-cutoff: Minimum identity at amino acid level between two ORFs for lowest-level clustering (Default = 0.98)

  • --len-diff-cutoff: Minimum ratio of length between two ORFs for lowest-level clustering (Default = 0.98)

  • --family-threshold: Gene family sequence identity threshold (default=0.7)

  • --merge-paralogs: Don’t split paralogs during Panaroo quality control (Default = False)

Annotation options

  • --evalue: Maximum e-value to return for DIAMOND and HMMER searches during annotation (Default = 0.001)

  • --truncation-threshold: Sequences in a cluster less than centroid length * truncation-threshold will be annotated as ‘potential pseudogene’ (Default = 0.8)

Gene-refinding options

  • --search-radius: The distance (bp) surrounding the neighbour of an accessory gene in which to search for it (Default = 5000)

  • --refind-prop-match: The proportion of an accessory gene’s length that must be found in order to consider it a match (Default = 0.2)

Gene graph correction stringency options (determined by clean-mode)

  • --min-trailing-support: Minimum cluster size to keep a gene called at the end of a contig.

  • --trailing-recursive: Number of times to perform recursive trimming of low support nodes near the end of contigs

  • --edge-support-threshold: Minimum support required to keep an edge that has been flagged as a possible mis-assembly

  • --length-outlier-support-proportion: Proportion of genomes supporting a spurious long gene (>1.5x outside the IQR of cluster)

  • --min-edge-support-sv: Minimum edge support required to call structural variants in the presence/absence sv file

  • --no-clean-edges: Turn off edge filtering in the final output graph

Alignment options

  • --no-variants: Do not call variants using SNP-sites after alignment (Default = False)

  • --ignore-pseduogenes: Ignore ORFs annotated as ‘potential pseudogenes’ in alignments (Default = False)

Avoid/include algorithms

  • --no-clustering: Do not cluster ORFs (Default = False)

  • --no-refind: Do not refind missed genes (Default = False)