Deprecated features
============

These features are not available in ggCaller v1.4+. They are replaced by generation of GFF files which can  be used in any modern clustering method, such as Panaroo.

For gene clustering and functional annotation, use `ggCallaroo <https://github.com/samhorsfield96/ggCallaroo>`_, which combines ggCaller, `Panaroo <https://github.com/gtonkinhill/panaroo>`_ and `Bakta <https://github.com/oschwengers/bakta>`_.

Annotating genes
^^^^^^^^^^^^^^^^

ggCaller comes with two default databases for functional annotation of genes.
- Bacterial and Viral databases from `Uniprot <https://www.uniprot.org/>`_, used by `DIAMOND <https://github.com/bbuchfink/diamond>`_
- HMM profiles from `Prokka <https://github.com/tseemann/prokka>`_, used by `HMMER3 <https://github.com/EddyRivasLab/hmmer>`_

.. important::
    Ensure you are connected to the internet
    when first running ggCaller as these databases
    are downloaded automatically. Subsequent runs
    can be conducted offline.

There are three sensitivity levels for annotation:

- ``fast``: only DIAMOND  in fast mode
- ``sensitive``: only DIAMOND in sensitive mode
- ``ultrasensitive``: HMMER3 and DIAMOND in sensitive mode

For example, to run DIAMOND only in fast mode, run::

    ggcaller --refs input.txt --annotation fast

By default these commands will annotate using DIAMOND with the ``Bacteria`` uniprot database.
To change this to the ``Viruses`` database, run::

    ggcaller --refs input.txt --annotation fast --diamonddb Viruses

Custom databases can also be specified for both DIAMOND using ``--diamonddb`` and HMMER3 using ``--hmmdb``.
DIAMOND databases must be amino-acid FASTA files. HMMER3 databases must be HMM-profile ``.HAMAP`` files built using
``hmmbuild`` which is part of the HMMER3 package.

To run with custom DIAMOND and HMMER3 databases::

    ggcaller --refs input.txt --annotation ultrasensitive --diamonddb annotation.fasta --hmmdb annotation.HAMAP

Annotation is not on by default. If annotation is specified, ggCaller will additionally generate:

- GFF files for each input genome in a separate directory ``GFF``
- Annotations will be added to gene call FASTA files

Aligning genes
^^^^^^^^^^^^^^

ggCaller also supports generation of within-cluster and core genome alignments using `MAFFT <https://github.com/GSLBiotech/mafft>`_.

There are two alignment algorithms implemented:

- ``def`` or default, which uses the standard MAFFT multiple sequence alignment algorithm. This is faster when aligning <=500 sequences in a cluster.
- ``ref`` or reference, which uses reference-guided alignment. This is faster when aligning >500 sequences in a cluster.

There are also two modes for alignment:

- ``core`` aligns genes only within core clusters, and generates a concatenated core genome alignment.
- ``pan`` aligns genes within all clusters (pangenome alignment), as well as generating a concatenated core genome alignment.

To generate a core genome alignment  using default MAFFT, run::

    ggcaller --refs input.txt --aligner def --alignment core

To generate a pangenome alignment using reference-guided MAFFT, run::

    ggcaller --refs input.txt --aligner ref --alignment pan

To change the frequency of genes deemed to be core, use `--core-threshold` (default = 0.95, or 95% frequency).
For example, only include genes found at 100% frequency::

    ggcaller --refs input.txt --aligner def --alignment core --core-threshold 1.0

Alignment is off by default. If specified, ggCaller will additionally generate:

- Core genome alignment in FASTA format
- Core genome Neighbour-joining tree in NWK format
- Per-cluster alignment files in FASTA format in a separate directory ``aligned_gene_sequences``
- Per-cluster VCF file generated by `SNP-SITES <https://github.com/sanger-pathogens/snp-sites>`_ in separate directory ``VCF``

Quality control and clustering
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. important::
    This feature is not available past ggCaller v1.4+.
    This is replaced by generation of GFF files which can 
    be used in any modern clustering method, such as Panaroo.

ggCaller implements Panaroo to identify spurious clusters that are generated by assembly fragmentation and contamination.

Panaroo identifies spurious clusters as those with <2 edges in the gene graph. Spurious clusters are then removed based
on their population frequency, determined by three settings:

- ``strict``; remove spurious clusters with <5% frequency. Good for datasets >100 genomes where rare plasmids are not expected.
- ``moderate``; remove spurious clusters with <1% frequency (default). Good for datasets <=100 genomes where rare plasmids are not expected.
- ``sensitive``; do not remove clusters. Good for datasets where rare plasmids are expected.

For example, to run ggCaller in strict mode::

    ggcaller --refs input.txt --clean-mode strict

More information can be found `here <https://gtonkinhill.github.io/panaroo/#/gettingstarted/params>`_.

**If you use the full pipeline of ggCaller, also please cite** `Panaroo <https://doi.org/10.1186/s13059-020-02090-4>`_.


Deprecated advanced options
^^^^^^^^^^^^^^^^^^^^^^^^

Gene clustering options

- ``--identity-cutoff``: Minimum identity at amino acid level between two ORFs for lowest-level clustering (Default = 0.98)
- ``--len-diff-cutoff``: Minimum ratio of length between two ORFs for lowest-level clustering (Default = 0.98)
- ``--family-threshold``: Gene family sequence identity threshold (default=0.7)
- ``--merge-paralogs``: Don't split paralogs during Panaroo quality control (Default = False)

Annotation options

- ``--evalue``: Maximum e-value to return for DIAMOND and HMMER searches during annotation (Default = 0.001)
- ``--truncation-threshold``: Sequences in a cluster less than `centroid length * truncation-threshold` will be annotated as 'potential pseudogene' (Default = 0.8)

Gene-refinding options

- ``--search-radius``: The distance (bp) surrounding the neighbour of an accessory gene in which to search for it (Default = 5000)
- ``--refind-prop-match``: The proportion of an accessory gene's length that must be found in order to consider it a match (Default = 0.2)

Gene graph correction stringency options (determined by clean-mode)

- ``--min-trailing-support``: Minimum cluster size to keep a gene called at the end of a contig.
- ``--trailing-recursive``: Number of times to perform recursive trimming of low support nodes near the end of contigs
- ``--edge-support-threshold``: Minimum support required to keep an edge that has been flagged as a possible mis-assembly
- ``--length-outlier-support-proportion``: Proportion of genomes supporting a spurious long gene (>1.5x outside the IQR of cluster)
- ``--min-edge-support-sv``: Minimum edge support required to call structural variants in the presence/absence sv file
- ``--no-clean-edges``: Turn off edge filtering in the final output graph

Alignment options

- ``--no-variants``: Do not call variants using SNP-sites after alignment (Default = False)
- ``--ignore-pseduogenes``: Ignore ORFs annotated as 'potential pseudogenes' in alignments (Default = False)

Avoid/include algorithms

- ``--no-clustering``: Do not cluster ORFs (Default = False)
- ``--no-refind``: Do not refind missed genes (Default = False)