Tutorial

Here we’ll walk through a typical run of ggCaller, including both Gene-calling and Querying.

Example results can be found here.

Important

Results will be consistent, but may not exactly match between your run and the example. This is due to the greedy clustering algorithm used by ggCaller, which can cause small differences in genes counts.

Installation and setup

Follow the guide in Installation for downloading and installing ggCaller.

Working Dataset

We’ll use a dataset from Bentley et al. (2006). This dataset contains 91 sequences pneumococcal capsular polysaccharide synthetic (CPS) loci. These sequences are structurally diverse, but are only ~20,000 bp in length, so can be analysed quickly (~5-10 minutes) on a standard laptop or desktop.

Download the files from here and unzip:

tar xvf Bentley_et_al_2006_CPS_sequences.tar.bz2

We will also provide our own custom annotation database for DIAMOND. These will be the manually curated protein sequences from Bentley et al. Download from here and unzip:

tar xvf Bentley_et_al_2006_CPS_protein_sequences.tar.bz2

Gene-calling

First generate an input file for ggCaller. This must be a file containing paths (absolute recommended) to all sequences to be analysed. We recommend running the below command within the unzipped to generate this file:

cd Bentley_et_al_2006_CPS_sequences
ls -d -1 $PWD/*.fa > input.txt
cd ..

input.txt will now contain absolute paths to all .fa files in the directory Bentley_et_al_2006_CPS_sequences.

Now we will run ggCaller specifying saved intermediate datastructures, enabling sequence querying

To do this using 4 threads, run:

ggcaller --refs Bentley_et_al_2006_CPS_sequences/input.txt --save --out ggc_Bentley_et_al_CPS --threads 4

You will find the following files in the output directory ggc_Bentley_et_al_CPS:

  • GFF: directory of GFF files for each sample in GFF3 format

  • ggc_data: intermediate datastructures written to disk, required for querying.

  • ORF_dir: intermediate datastructures written to disk, containing gene predictions.

  • Path_dir: intermediate datastructures written to disk, containing genome paths through the DBG.

  • gene_calls.faa: fasta file containing all gene calls from all samples.

  • gene_calls.fna: fasta file containing all gene calls from all samples, in nucleotide format.

Additionally, input.gfa and input.color.bfg will be generated, which are the graph and colour files respectively.

Querying the graph

We can now query the graph. To do so, run:

ggcaller --query CPS_queries.fasta --graph Bentley_et_al_2006_CPS_sequences/input.gfa --colours Bentley_et_al_2006_CPS_sequences/input.color.bfg --prev-run ggc_Bentley_et_al_CPS --out ggc_Bentley_et_al_CPS --threads 4

Results will be saved in ggc_Bentley_et_al_CPS/matched_queries.fasta.

Details on the output can be found in Interpreting results.

From matched_queries.fasta, we can see that all the genes queried were identified in the graph.

As we searched for specific gene variants, this search was too stringent to return orthologues in other genomes.

Important

We recommend searching for partial gene sequences, or lowering --query-id to return more distantly related sequences.