BHFtools: Software and Documentation




Contents

  • Description
  • Download
  • Requirements
  • Package
  • Usage
  • Reference
  • Contact

  • Description

  • This package includes Perl scripts and Java executables (with source code) to facilitate the identification of major and/or minor histocompatibility genes in organisms that undergo a "natural transplantation reaction," whereby members of the same species either fuse or reject. These tools were used to identify BHF, the Botryllus Histocompatibility Factor (Voskoboynik and Newman et al., 2013), and are configured to score candidate fusibility genes based on the rule that two B. schlosseri colonies will fuse if they share at least 1 allele of the allorecognition gene and will reject otherwise. For methodological details, see the paper.

  • Input

  • 1. A SAM file for each colony containing paired-end RNA-Seq reads mapped to individual genes.
  • 2. A list of SNPs identified in each colony (e.g., called using VarScan 2). (Format) (Example)
  • 3. A matrix specifying known fusion/rejection relationships between colonies (or individuals). (Example)

  • Note: all RNA-Seq reads from the paper are available via the NCBI Short Read Archive (download).

  • Pipeline

    BHFtools will perform the following operations:
  • 1. Phase RNA-Seq reads into haplotypes.
  • 2. Refine inferred haplotypes based on known fusion/rejection relationships.
  • 3. Determine the best mismatch threshold between alleles in order to stratify colonies by known fusion/rejection outcomes.

  • The resulting classification error is used to rank each gene to prioritize candidate histocompatibility factors. 

  • Download

    Click here. No installation needed. Simply unzip contents (executables and source code) into the same directory. 

    Requirements

  • Perl 5, with the following dependencies: Inline::Java, Proc::Fork, Text::NSP::Measures::2D::Fisher::twotailed
  • SAMtools
  • 64 bit Java 1.7+
  • • Unix operating system (tested on Mac OS X)
  • • >40 GB of RAM and >20 logical CPU cores (recommended, but not strictly required)

  • Package

  • Tool
  • Description
  • RunBHFtools.pl
  • A script to run the entire pipeline.
  • VarPhaseMultiFork.pl
  • Wrapper to iterate through SAM files and perform haplotyping in a parallelized fashion.
  • HapPred.jar
  • Algorithm for Haplotype Prediction.
  • PhaseDist.jar
  • Algorithm to Phase discontinuous haplotypes based on mismatch Distances between colonies with known histocompatibility outcomes.
  • FusRejClassifier.jar
  • Optimize stratification of colony pairs for each gene and derive classification error.
  • FusionRejection.txt
  • Known fusion/rejection relationships in a symmetrical matrix format (1=fuse; 2=reject; 0=unknown; sample names should be labeled in first row and column). (Example)
  • Java source code
  • All relevant source code.

  • Usage

  • Run entire pipeline:

  • >perl  RunBHFtools.pl  [input directory]  [output directory]  [optional: 1 (protein), 0 (DNA/RNA)]

  • Input directory must contain:

  • (1) SNP files (output of VarScan 2 with pileup2snp setting), each named "samplename.varscan"

  • (2) Mapped reads with same sample names as SNP files (e.g., "samplename.sam")

  • (3) Fusion/rejection outcome matrix, "FusionRejection.txt" (*use same sample names in matrix as above).

  • Output:

  • • classifier.out (for details of format, see FusRejClassifier.jar output below)



  • Tool:

  • VarPhaseMultiFork.pl

  • Usage:

  • >perl  VarPhaseMultiFork.pl  [SNP file directory]  [SAM file directory]  [optional: 1 (protein), 0 (DNA/RNA)]

  • Input:

  • (1) Directory of SNP files (output of VarScan 2 with pileup2snp setting), each named "samplename.varscan"

  • (2) Directory of mapped reads with same sample names as SNP files (e.g., "samplename.sam")

  • Output:

  • (1) For each input sample: "samplename.phased.txt"

    <column 1>  Library name (character string)
    <column 2>  Haplotype class (integer >0); o.w. homozygous position (=-1)
    <column 3>  Sequence coordinate in mRNA (integer)
    <column 4>  Amino acid or nucleotide (character)

  • (2) All phasing results combined: "all_phased.txt"



  • Tool:

  • HapPred.jar

  • Usage:

  • >java  -Xmx5g  -Xms5g  -jar  HapPred.jar  [input file]  [optional: 1 (protein), 0 (DNA/RNA)]

  • Input:

  • <column 1>  Sequence coordinate in mRNA (integer)
    <column 2>  Nucleotide base (amino acid also acceptable)
    <column 3>  Read id (integer; each sequencing read must be uniquely identified)
    <column 4>  Reference sequence identifier (character string)
    <column 5>  SNP - IUPAC code (character)

  • Note: Use 1-based sequence indices for column 1 (i.e., sequence starts at 1, not zero), paired-end reads should be given the same id in column 3, and the '-' character should be used for homozygous positions in column 5

  • Output:

  • <column 1>  Input file name (character string)
  • <column 2>  Haplotype class (integer >0); o.w. homozygous position (=-1)
  • <column 3>  Sequence coordinate in mRNA (integer)
  • <column 4>  Amino acid or nucleotide (character)

  • Hard-coded parameters (change by recompiling):

  • (1) "minHomThresh": Minimum depth required (=3)

  • (2) "minHomFrac": Minimum fraction for homozygous allele (=0.8)



  • Tool:

  • PhaseDist.jar

  • Usage:

  • >java  -Xmx5g  -Xms5g  -jar  PhaseDist.jar  all_phased.txt  FusionRejection.txt  >  all_distances.txt

  • Input:

  • (1) "all_phased.txt": output of VarPhaseMultiFork.pl

  • (2) "FusionRejection.txt": see above.

  • Output:

  • (1) "all_distances.txt"
  • <column 1>  Gene sequence identifier (character string)
    <column 2>  Library 1 (character string)
    <column 3>  Library 2 (character string)
    <column 4>  No. allelic mismatches (integer)
    <column 5>  % mismatches (No. mismatches/total sequence compared) (integer)
    <column 6>  No. of mismatches in which both sites are homozygous (integer)
    <column 7>  No. of mismatches in which 1 site is heterozygous (integer)
    <column 8>  No. mismatches in which both sites are heterozygous (integer)
    <column 9>  Total number of characters compared (integer)

  • (2) "all_rephased.txt" (rephased haplotypes for all samples)

    <column 1>  Library name (character string)
    <column 2>  Haplotype class (integer >0); o.w. homozygous position (=-1)
    <column 3>  Sequence coordinate in mRNA (integer)
    <column 4>  Amino acid or nucleotide (character)



  • Tool:

  • FusRejClassifier.jar

  • Usage:

  • >java  -Xmx5g  -Xms5g  -jar  FusRejClassifier.jar  all_distances.txt  FusionRejection.txt  >  classifier.out

  • Input:

  • (1) "all_distances.txt": output of PhaseDist.pl

  • (2) "FusionRejection.txt": See above.

  • Output:

  • For lines beginning with ">>", the format is as follows, ordered from left to right:

  • • Gene sequence name
  • • "Acc:" Accuracy of classification (No. correct / all pairs)
  • • "ROC dist:" Classification error (distance from a perfect result, that is, TPR=1 and FPR=0)
  • • "TPR:" True Positive Rate
  • • "FPR:" False Positive Rate
  • • "Thresh:" % mismatch threshold that optimally stratifies colonies into known fusion/rejection groups
  • • "Ttest" P-value resulting from a t-test between % mismatches after being stratified by optimal threshold
  • • "Pairs:" No. of colony pairs analyzed
  • • "FuseCount:" No. of colony pairs known to fuse
  • • "RejectCount:" No. of colony pairs known to reject

  • Raw data for a given gene sequence are found immediately following the line starting with ">>".
    Format is as follows:

  • <column 1>  1 (known fusion), 0 (known rejection)
    <column 2>  % mismatches between best-matching alleles for the given pair of colonies
    <column 3>  No. mismatches between best-matching alleles for the given pair of colonies
    <column 4>  No. of characters compared (residues or nucleotides)
    <column 5>  Sequence sample 1 (colony 1)
    <column 6>  Sequence sample 2 (colony 2)




  • Reference
  • Voskoboynik A*, Newman AM*, Corey DM, Sahoo D, Pushkarev D, Neff NF, Passarelli B, Koh W, Ishizuka KJ, Palmeri KJ, Dimov IK, Keasar C, Fan HC, Mantalas GL, Sinha R, Penland L, Quake SR, Weissman IL (2013) Identification of a colonial chordate histocompatibility gene. Science, 341(6144):384-387 (link)




  • Questions/Comments?


  • Last updated by Aaron Newman: 11-08-13


     © Weissman Lab, Stanford University 2013