Projects
Projects at UT Southwestern Medical Center, 2016/01 - present
A New Method to Identify Somatic Copy Number Alternations in Tumors
Somatic copy number alteration (SCNAs) have been observed frequently in tumors, and SCNA profiling plays an important role in studying the mechanisms of tumor development and in guiding therapeutics. Although several bioinformatics methods have been developed for SCNA detection using exome-sequencing data, these methods rely on a strong assumption that the median or mean value of the copy numbers of the genome is two (normal status). This assumption was generally used to give an estimation of the depth or depth ratio that represents the normal status. The assumption usually holds for germline copy number variations, because most parts of the genome don’t have copy number changes. However, the assumption may not hold in tumor cells where large scale copy number alterations were observed frequently. Therefore, existing methods may not perform well in calling SCNAs for the unstable tumor genomes. To solve this problem, we developed a new method, DEFOR, to detect SCNAs in tumor samples from exome-sequencing data. DEFOR supports the estimation of copy numbers in six different statuses and adopts a model considering allele frequency, depth and purity. The performance of DEFOR is outstanding in our evaluation, even for unstable tumor genomes with a larger proportion of SCNAs.Genome-wide Mutation Landscape in POLEP286R Engineered Mice
DNA polymerase ε (POLE) plays a key role in leading strand DNA synthesis. POLEP286R is the most common recurrent substitution observed in endometrial cancer, and results in the strongest mutator phenotype. To study the mutation rate and pattern caused by POLEP286R, a conditional POLEP286R allele was engineered in mice in the lab of my collaborator, Dr Diego Castrillon. I designed a whole genome sequencing study for MEF cell lines in different passages (P15, P30 and P45) and samples from lung adenocarcinomas and T cell lymphomas. I contributed key ideas about sample selection, and finalized all of the data analysis. We found that the average mutation rate in POLEP286R/+ MEFs was around 1.6 substitutions per megabase per cell division, at least three orders of magnitude higher than the normal DNA replication error rate. POLEP286R/+ cancers and MEFs showed a high incidence of C>A and C>T substitutions and rare C>G and T>A substitutions. [PMID:30124468]Ribosome profiling of eIF5A cell lines in human and yeast
The eukaryotic translation factor eIF5A plays a key role in translation elongation and termination. Collaborating with Dr. Joshua Mendell, we performed ribosome-profiling in normal human cells lines and eIF5A depletion cells. I developped a pipeline to analyze the ribo-seq data and analyzed all of the data for this project. We found that more ribosomes accumulate on the 5’UTR in eIF5Ad cells than normal cells, and identified motifs that cause pausing of ribosomal in transcript.Genetics of Kidney Cancer
Kidney cancer is among ten most common cancers in United States. To better understand the genetic mechanism of the different types of kidney cancer, Kidney Cancer Project (KCP) was initiated by Dr. James Brugarolas. I am responsible for processing and analyzing all exome-sequencing data and RNA-sequencing data of 697 samples from 219 patients with kidney cancer. To study the possible genetic mechanism under invasion of tumor cells from primary tumor site, samples from thrombus and primary tumor site were compared to identify the difference in somatic mutations and gene expression levels. Although there is no significant difference between thrombus and primary tumor in terms of somatic mutations, we observed several candidate genes with significantly lower expression level in thrombus, including known tumor suppressor genes (e.g. PCDH10 and EYA4). That indicates that down-regulation of tumor suppressor genes may play an important role during the invasion of tumor cells.
Projects at University of Michigan, 2012/04 – 2016/01
PUUMA myocardial infarction (MI) and plasma lipid association study
The goal of this project is to identify variants associated with MI and plasma lipid levels in a Chinese population. 10,030 samples from China were genotyped using Illumina Human Exome-chip. I participated in the design of this study and led the analysis of this project. Finally, we identified four novel Asian-specific variants associated with plasma lipid values, and two of these variants are very low frequent in East Asian population. [PMID:26690388]HUNT MI and plasma lipid association study
The goal of this project is to identify variants associated with MI and plasma lipid levels in Norwegians. 6,000 samples from Norway were genotyped using Illumina Human Exome-chip. I led the analysis of this project, including genotype-calling, quality controls and statistical analysis. We identified a new causal coding variant in TM6SF2 associated with total cholesterol, and our collaborators verified its function in mouse model. [PMID:24633158] [PMID:24728188]NHLBI Exome sequencing project (ESP)
The goal of the NHLBI GO Exome Sequencing Project (ESP) is to discover novel genes and mechanisms contributing to heart, lung and blood disorders by pioneering the application of next-generation sequencing of the protein coding regions of the human genome. I performed gene-based burden tests for LDL-cholesterol based on whole exome sequencing data of 5,500 samples. We identified several low-frequency variants with large effect on LDL-cholesterol. [PMID:24507775]GLGC meta-analysis of plasma lipid association study
This project involved more than 300,000 samples from 92 studies. I participated in the analysis of ancestry inference, variants summary and association tests. [PMID:29083408] [PMID:29083407]HUNT Whole Genome Sequencing Project
The purpose of this on-going project is to perform whole genome sequencing in 2,202 samples (1,100 MI cases and 1,102 controls) from Norway, and provide a reference panel for association study, population genetics study and imputation. I participated in the analysis of this project, and performed analysis including reads-mapping, genotype-calling, quality control and association tests. [PMID:28861891]GeneZoom
GeneZoom is a convenient tool to visualize the association test results at both variant and gene levels. GeneZoom can be used as a program or in our website interface. I participated in the design, development and testing of GeneZoom. The manuscript is in preparation.
PhD candidate in Peking University, 2006/09 – 2011/07
Phylogeny guided genome alignment method (PGGA)
I developed a method to construct genome alignment and identify duplication and rearrangement in genomes. This method uses a dynamic programming algorithm, and is implemented using C and Perl.Plant transcription factors database (PlantTFDB)
As one of the major contributors, I participated in the development of a method to identify transcription factors from plant genomes, and constructed PlantTFDB 1.0 and 2.0. PlantTFDB is the most famous database in this area. [PMID:17933783] [PMID:21097470] [PMID:24174544]Rice-Map
Rice-Map is a web-based browser used to visualize the genome sequence, gene structure, expression, homology and many other features of the rice genome. Users can browse the genome as browsing a map. I contributed my ideas in the design of the Rice-Map and participated in a part of development. [PMID:21450055]Evolution and function analysis of DNMT3a and DNMT3b
I performed evolution analysis for DNMT3a and DNMT3b, and identified a positive selected mutation. Our collaborators verified that this mutation plays a key role in the functional divergence of DNMT3a and DNMT3b. [PMID:20507910]Expression pattern divergence of duplicated genes in rice
I participated in the analysis of gene expression array data from 45 different tissues and conditions in rice development. We found the pattern of expression divergence were different for block duplication, tandem duplication and disperse duplication. [PMID:19534757]