BHFtools: Software for detecting histocompatibility genes

Description

This package includes Perl scripts and Java executables (with source code) to facilitate the identification of major and/or minor histocompatibility genes in organisms that undergo a "natural transplantation reaction," whereby members of the same species either fuse or reject. These tools were used to identify BHF, the Botryllus Histocompatibility Factor (Voskoboynik and Newman et al., 2013), and are configured to score candidate fusibility genes based on the rule that two B. schlosseri colonies will fuse if they share at least 1 allele of the allorecognition gene and will reject otherwise. For methodological details, see the paper.

Input

1. A SAM file for each colony containing paired-end RNA-Seq reads mapped to individual genes.

2. A list of SNPs identified in each colony (e.g., called using VarScan 2). (Format) (Example)

3. A matrix specifying known fusion/rejection relationships between colonies (or individuals). (Example)

Note: all RNA-Seq reads from the paper are available via the NCBI Short Read Archive (download).

Pipeline

BHFtools will perform the following operations:

1. Phase RNA-Seq reads into haplotypes.

2. Refine inferred haplotypes based on known fusion/rejection relationships.

3. Determine the best mismatch threshold between alleles in order to stratify colonies by known fusion/rejection outcomes.

The resulting classification error is used to rank each gene to prioritize candidate histocompatibility factors.

Download

Click here. No installation needed. Simply unzip contents (executables and source code) into the same directory.

Requirements

• Perl 5, with the following dependencies: Inline::Java, Proc::Fork, Text::NSP::Measures::2D::Fisher::twotailed

• SAMtools

• 64 bit Java 1.7+

• Unix operating system (tested on Mac OS X)

• >40 GB of RAM and >20 logical CPU cores (recommended, but not strictly required)

Package

*Tool*	*Description*
RunBHFtools.pl	A script to run the entire pipeline.
VarPhaseMultiFork.pl	Wrapper to iterate through SAM files and perform haplotyping in a parallelized fashion.
HapPred.jar	Algorithm for Haplotype Prediction.
PhaseDist.jar	Algorithm to Phase discontinuous haplotypes based on mismatch Distances between colonies with known histocompatibility outcomes.
FusRejClassifier.jar	Optimize stratification of colony pairs for each gene and derive classification error.
FusionRejection.txt	Known fusion/rejection relationships in a symmetrical matrix format (1=fuse; 2=reject; 0=unknown; sample names should be labeled in first row and column). (Example)
Java source code	All relevant source code.

Usage

Run entire pipeline:

>perl RunBHFtools.pl [input directory] [output directory] [optional: 1 (protein), 0 (DNA/RNA)]

Input directory must contain:

(1) SNP files (output of VarScan 2 with pileup2snp setting), each named "samplename.varscan"

(2) Mapped reads with same sample names as SNP files (e.g., "samplename.sam")

(3) Fusion/rejection outcome matrix, "FusionRejection.txt" (*use same sample names in matrix as above).

Output:

• classifier.out (for details of format, see FusRejClassifier.jar output below)

Tool:

VarPhaseMultiFork.pl

Usage:

>perl VarPhaseMultiFork.pl [SNP file directory] [SAM file directory] [optional: 1 (protein), 0 (DNA/RNA)]

Input:

(1) Directory of SNP files (output of VarScan 2 with pileup2snp setting), each named "samplename.varscan"

(2) Directory of mapped reads with same sample names as SNP files (e.g., "samplename.sam")

Output:

(1) For each input sample: "samplename.phased.txt"

<column 1>  Library name (character string)
<column 2>  Haplotype class (integer >0); o.w. homozygous position (=-1)
<column 3>  Sequence coordinate in mRNA (integer)
<column 4>  Amino acid or nucleotide (character)

(2) All phasing results combined: "all_phased.txt"

Tool:

HapPred.jar

Usage:

>java -Xmx5g -Xms5g -jar HapPred.jar [input file] [optional: 1 (protein), 0 (DNA/RNA)]

Input:

<column 1>  Sequence coordinate in mRNA (integer)
<column 2>  Nucleotide base (amino acid also acceptable)
<column 3>  Read id (integer; each sequencing read must be uniquely identified)
<column 4>  Reference sequence identifier (character string)
<column 5>  SNP - IUPAC code (character)

Note: Use 1-based sequence indices for column 1 (i.e., sequence starts at 1, not zero), paired-end reads should be given the same id in column 3, and the '-' character should be used for homozygous positions in column 5

Output:

<column 1> Input file name (character string)

<column 2> Haplotype class (integer >0); o.w. homozygous position (=-1)

<column 3> Sequence coordinate in mRNA (integer)

<column 4> Amino acid or nucleotide (character)

Hard-coded parameters (change by recompiling):

(1) "minHomThresh": Minimum depth required (=3)

(2) "minHomFrac": Minimum fraction for homozygous allele (=0.8)

Tool:

PhaseDist.jar

Usage:

>java -Xmx5g -Xms5g -jar PhaseDist.jar all_phased.txt FusionRejection.txt > all_distances.txt

Input:

(1) "all_phased.txt": output of VarPhaseMultiFork.pl

(2) "FusionRejection.txt": see above.

Output:

(1) "all_distances.txt"

<column 1>  Gene sequence identifier (character string)
<column 2>  Library 1 (character string)
<column 3>  Library 2 (character string)
<column 4>  No. allelic mismatches (integer)
<column 5>  % mismatches (No. mismatches/total sequence compared) (integer)
<column 6>  No. of mismatches in which both sites are homozygous (integer)
<column 7>  No. of mismatches in which 1 site is heterozygous (integer)
<column 8>  No. mismatches in which both sites are heterozygous (integer)
<column 9>  Total number of characters compared (integer)

(2) "all_rephased.txt" (rephased haplotypes for all samples)

<column 1>  Library name (character string)
<column 2>  Haplotype class (integer >0); o.w. homozygous position (=-1)
<column 3>  Sequence coordinate in mRNA (integer)
<column 4>  Amino acid or nucleotide (character)

Tool:

FusRejClassifier.jar

Usage:

>java -Xmx5g -Xms5g -jar FusRejClassifier.jar all_distances.txt FusionRejection.txt > classifier.out

Input:

(1) "all_distances.txt": output of PhaseDist.pl

(2) "FusionRejection.txt": See above.

Output:

For lines beginning with ">>", the format is as follows, ordered from left to right:

• Gene sequence name

• "Acc:" Accuracy of classification (No. correct / all pairs)

• "ROC dist:" Classification error (distance from a perfect result, that is, TPR=1 and FPR=0)

• "TPR:" True Positive Rate

• "FPR:" False Positive Rate

• "Thresh:" % mismatch threshold that optimally stratifies colonies into known fusion/rejection groups

• "Ttest" P-value resulting from a t-test between % mismatches after being stratified by optimal threshold

• "Pairs:" No. of colony pairs analyzed

• "FuseCount:" No. of colony pairs known to fuse

• "RejectCount:" No. of colony pairs known to reject

Raw data for a given gene sequence are found immediately following the line starting with ">>".
Format is as follows:

<column 1>  1 (known fusion), 0 (known rejection)
<column 2>  % mismatches between best-matching alleles for the given pair of colonies
<column 3>  No. mismatches between best-matching alleles for the given pair of colonies
<column 4>  No. of characters compared (residues or nucleotides)
<column 5>  Sequence sample 1 (colony 1)
<column 6>  Sequence sample 2 (colony 2)

Reference

Voskoboynik A*, Newman AM*, Corey DM, Sahoo D, Pushkarev D, Neff NF, Passarelli B, Koh W, Ishizuka KJ, Palmeri KJ, Dimov IK, Keasar C, Fan HC, Mantalas GL, Sinha R, Penland L, Quake SR, Weissman IL (2013) Identification of a colonial chordate histocompatibility gene. Science, 341(6144):384-387 (link)

Questions/Comments?

BHFtools: Software and Documentation

Contents

Description

Download

Requirements

Package

Usage