This directory contains all codes necessary to run RegVar and to reproduce the figures and results in the RegVar manuscript. Directories =========== Annotation_profiles/ -- contains processed sequential, evolutionary, epigenetic profiles used as RegVar features. DNN_models/ -- contains the trained integrated RegVar models in 17 tissues. Python_scripts/ -- contains the Python scripts to annotate SNP-TSS samples (RegVar_annotate_variants.py), to train RegVar models (RegVar_training.py), and to retrieve RegVar scores for any annotated samples (RegVar_prediction.py). Also scripts to annotate the HGMD variants (annotate_variants_hgmd.py), to train the pathogenic model (RegVar_training_hgmd.py), and to retrieve pathogenic scores (RegVar_prediction_hgmd.py) are included. RegVar_paper/ -- contains all result files and R scripts to generate all figures in the RegVar munuscript. Tmp/ -- initially empty, but used by the annotation script for temporary annotation files. Training_sets/ -- contains all SNP-TSS samples, and HGMD variants used in RegVar training. Also all variants in chromosome 22 and 100,000 SNPs randomly selected across the genome were included. Running RegVar ============== Required -------- The software requires the following programs and packages, and we recommend a Linux set up. Our own clusters were running in Ubuntu 18.04 with an NVIDIA TITAN Xp GPU for boosting the training process. -- Python 2.7.* https://www.python.org/ -- tensorflow https://www.tensorflow.org/ -- pandas https://pandas.pydata.org/ -- scikit-learn https://scikit-learn.org/stable/ -- bedtools https://bedtools.readthedocs.io/en/latest/ Files preparation ----------------- SNP-TSS samples should be in a 6 column bed file, with the following tab delimited columns (you could also go into the Training_sets/ directory to see some examples): Annotation ---------- You will need to go into the Python_scripts/ directory to run the RegVar_annotate_variants.py script to annotate the variant samples (this should also be set to run other Python scripts described in this Readme file). In the following command line, option -i is followed by the variant file containing the SNP-TSS samples (eqtl.bed); option -o is followed by the output file containing the annotated features of SNP-TSS samples (annotated.txt); option -t is followed by the tissue in which you want to annotate the samples (tissue). With a single option -h, the help information would be displayed. python RegVar_annotate_variants.py -i eqtls.bed -o annotated.txt -t tissue Model training -------------- RegVar models could be trained in the RegVar_training.py script. In the following command line, options -p and -n are followed by the annotated files of positive (positive_annotation.txt) and negative (negtive_annotation.txt) samples, respectively; option -m is followed by the file path where you want to save the trained model (model_path); option -t is followed by the tissue for which you want to train a DNN model (tissue). It would take dozens of minutes to train a DNN model depending on the number of training samples, and we have provide the trained integrated RegVar models (in zip format, which should be unziped if used) in 17 tissues in the DNN_models/ directory. If you want to predict SNP-TSS samples from the integarted RegVar model, this training process could be skiped. python RegVar_training.py -p positive_annotation.txt -n negtive_annotation.txt -m model_path -t tissue Prediction ---------- Prediction for a specific annotated file could be run in the RegVar_prediction.py. In the following command line, option -i is followed by the annotated files of your SNP-TSS samples (annotated.txt); option -m is followed by the file path where the trained model was saved (model_path) (for the integrated models, the default model_path is ../DNN_model/); option -t is followed by the tissue in which you want to predict the samples (tissue). python RegVar_prediction.py -i annotated.txt -m model_path -t tissue Usage of scripts processing the HGMD variants is generally the same to above description, except that there is no need to provide a parameter of tissue. Also the help information would be displayed with a single option -h. Reproducing figures from the manuscript ======================================= All figures in the RegVar manuscript could be repreduced from the files and R scripts in the RegVar_paper/ directory. You could go into the corresponding directory and run the R scripts to get the corresponding figures. Example command lines: cd Figure1/ Rscript Figure1.R