Manual¶

Calculate¶

Overview¶

Generates data needed to create plots in preqc-lr-report.

Input¶

READS file: long read sequences in fasta or fastq format

PAF file: information on overlaps between reads in READS file

GFA file: graph assembly that contains contig information

Output¶

JSON file containing data needed to generate plots in preqclr-report

Log file summarizing statistics calculated, input, and output

Usage example¶

./preqclr [-h/--help] -r/--reads <fasta|fastq|fasta.gz|fastq.gz> \
        -n/--sample_name sample_name \
        -p/--paf <PAF> -g/--gfa <GFA> \
        --rlen_cutoff INT \
        --verbose -v/--version

Argument name(s)	Required	Default value	Description
`-r`, `--reads`	Y	NA	Fasta, fastq, fasta.gz, or fastq.gz files containing reads.
`-n`, `--sample_name`	Y	NA	Sample name; you can use the name of species for example. This will be used as output prefix.
`-p`, `--paf`	N	NA	Minimap2 Pairwise mApping Format (PAF) file. This is produced using `minimap2 -x ava-ont sample.fastq sample.fasta`.
`-g`, `--gfa`	N	NA	Miniasm Graph Fragment Assembly (GFA) file. This is produced using `miniasm -f reads.fasta overlaps.paf > layout.gfa`. This is required only if user wants to generate an NGX plot. If not given, it will NOT CALCULATE NGX STATISTICS.
`--verbose`	N	False	Use to output preqc-lr progress to stdout.

Report¶

Overview¶

Generates a report with plots describing QC metrics for long read data sets.

Input¶

JSON file(s) containing data for sample(s) needed to generate plots created in preqclr calculate

Output¶

PDF file report

Plots:

Estimated genome size This is a bar plot that shows the estimated genome size for one or more samples. As coverage was inferred from overlap information, we can use this to calculate genome size with Lander-Waterman statistics.
Read length distribution This is the distribution of read lengths calculated from the READS file. preqclr imposes an x-limit such that 90% of all of the read lengths falls under this limit. This was done to avoid extremely long tails.
Estimated coverage distribution This shows the distribution of coverage for each read inferred from the overlap information file (PAF).
Per read GC content distribution In this plot we show the distribution of GC content per read for a sample of 40% of reads. To calculate this for each read, we summed the number of C and G nucleotides then divided by the read length.
Total number of bases vs minimum read length We show the total number of bases with reads of a minimum length of x.
NGX This shows the contigiuity of the data. Miniasm produces contigs from your sequencing data. To interpret this let’s look at x=50 and it’s NG(50) value on the y-axis. The contig length on the y-axis describes the length at which 50% of the genome size estimate is capture in contigs with length greater than or equal to the NG(50) value.

Usage example¶

python preqclr-report.py [-h/--help] -i/--input <*.preqclr> \
     --save_png --list_plots -o/--output <output_prefix> --plot <list of user specified plots> \
     --verbose

Argument name(s)	Required	Default value	Description
`-i`, `--input`	Y	NA	Output of preqclr calculate step. JSON formatted file with ‘.preqclr’ extension.
`-o`, `--output`	N	If only one preqclr file given, it will infer from prefix. Else if multiple, prefix will be “preqc-lr-output”.	Prefix for output PDF.
`--plot`	N	NA	Users can specify which plots they want. To do so, use `--list_plots` and use the names of plots.
`--list_plots`	N	NA	Use this to see which plots are available. Note that NGX plots are also dependent on whether or not it was calculated in preqc-lr-calculate step and this depends on whether or not miniasm’s GFA file was passed as input.
`--save_png`	N	False	Use to save each subplot as a png.
`--verbose`	N	False	Use to print progress to stdout.