Manual¶
Calculate¶
Overview¶
Generates data needed to create plots in preqc-lr-report.
Input¶
- READS file: long read sequences in fasta or fastq format
- PAF file: information on overlaps between reads in READS file
- GFA file: graph assembly that contains contig information
Output¶
- JSON file containing data needed to generate plots in preqclr-report
- Log file summarizing statistics calculated, input, and output
Usage example¶
./preqclr [-h/--help] -r/--reads <fasta|fastq|fasta.gz|fastq.gz> \
-n/--sample_name sample_name \
-p/--paf <PAF> -g/--gfa <GFA> \
--rlen_cutoff INT \
--verbose -v/--version
Argument name(s) | Required | Default value | Description |
---|---|---|---|
-r , --reads |
Y | NA | Fasta, fastq, fasta.gz, or fastq.gz files containing reads. |
-n , --sample_name |
Y | NA | Sample name; you can use the name of species for example. This will be used as output prefix. |
-p , --paf |
N | NA | Minimap2 Pairwise mApping Format (PAF) file. This is produced using minimap2 -x ava-ont sample.fastq sample.fasta . |
-g , --gfa |
N | NA | Miniasm Graph Fragment Assembly (GFA) file. This is produced using miniasm -f reads.fasta overlaps.paf > layout.gfa . This is required only if user wants to generate an NGX plot. If not given, it will NOT CALCULATE NGX STATISTICS. |
--verbose |
N | False | Use to output preqc-lr progress to stdout. |
Report¶
Overview¶
Generates a report with plots describing QC metrics for long read data sets.
Input¶
- JSON file(s) containing data for sample(s) needed to generate plots created in preqclr calculate
Output¶
- PDF file report
Plots:
- Estimated genome size This is a bar plot that shows the estimated genome size for one or more samples. As coverage was inferred from overlap information, we can use this to calculate genome size with Lander-Waterman statistics.
- Read length distribution This is the distribution of read lengths calculated from the READS file. preqclr imposes an x-limit such that 90% of all of the read lengths falls under this limit. This was done to avoid extremely long tails.
- Estimated coverage distribution This shows the distribution of coverage for each read inferred from the overlap information file (PAF).
- Per read GC content distribution In this plot we show the distribution of GC content per read for a sample of 40% of reads. To calculate this for each read, we summed the number of C and G nucleotides then divided by the read length.
- Total number of bases vs minimum read length We show the total number of bases with reads of a minimum length of x.
- NGX This shows the contigiuity of the data. Miniasm produces contigs from your sequencing data. To interpret this let’s look at x=50 and it’s NG(50) value on the y-axis. The contig length on the y-axis describes the length at which 50% of the genome size estimate is capture in contigs with length greater than or equal to the NG(50) value.
Usage example¶
python preqclr-report.py [-h/--help] -i/--input <*.preqclr> \
--save_png --list_plots -o/--output <output_prefix> --plot <list of user specified plots> \
--verbose
Argument name(s) | Required | Default value | Description |
---|---|---|---|
-i , --input |
Y | NA | Output of preqclr calculate step. JSON formatted file with ‘.preqclr’ extension. |
-o , --output |
N | If only one preqclr file given, it will infer from prefix. Else if multiple, prefix will be “preqc-lr-output”. | Prefix for output PDF. |
--plot |
N | NA | Users can specify which plots they want. To do so, use --list_plots and use the names of plots. |
--list_plots |
N | NA | Use this to see which plots are available. Note that NGX plots are also dependent on whether or not it was calculated in preqc-lr-calculate step and this depends on whether or not miniasm’s GFA file was passed as input. |
--save_png |
N | False | Use to save each subplot as a png. |
--verbose |
N | False | Use to print progress to stdout. |