Career Profile

Self-motivated data scientist and machine learning engineer with extensive project experience in deep learning, statistical analysis, modeling, full stack web development and database design.

Experiences

Data Scientist Intern

Jun 2019 - Sep 2019
Uber Technologies Inc.

Data scientist internship at Uber for Business team.

  • Developed a CNN-LSTM deep learning model to forecast organization customer life time value (LTV).
  • Software Engineer Intern (Data Mining)

    Jun 2018 - Jul 2018
    Microsoft Corp.

    Software engineer internship at Microsoft Satori knowledge graph team.

  • Developed a convolutional neural network (CNN) model to extract entity relations from text.
  • Applied machine translation model to translate English entity information into Chinese to enrich Chinese knowledge graph content.
  • Graduate Research Assistant

    2015 - present
    University of Texas Southwestern Medical Center

    Graduate student in Quantitative Biomedical Research Center (QBRC). My major research involved:

  • Developing machine learning, and deep learning models to predict anti-cancer drug sensitivity and drug synergistic effect using high-throughput sequencing data.
  • Developing Bayesian statistical models to analyze methylated RNA immunoprecipitation sequencing (MeRIP-Seq) data, and understand cellular methylation regulation events.
  • Developing statistical methods to infer gene regulatory network using high-throughput sequencing data.
  • Projects

    Machine learning and deep learning

    Deep learning model to forecast customer life time value
    Developed a deep learning model to forecast customer life time value from historical time series time. Used CNN/LSTM to extract time series features. Combined time series features with time independent (geographical, business) features. Model extendable to forecast different time horizons (1 month, 3 months, 12 months).
    Deep Learning, Convolutional neural network, Long short-term memory, Life time value, Time series modeling
    Convolutional neural network model for entity relation extraction
    Developed a convolutional neural network model to extract entity relations from text corpus. Used word embedding to transform tokens into numeric vectors, and applied convolutional kernels to extract semantic features to do prediction.
    Deep Learning, Convolutional neural network, Word embedding, Relation extraction, Information retrieval
    DrivenData competition: predicting water pump functionality using machine learning
    Developed a random forest classification model to predict water pump functionality in Africa. This is a open data challenge hosted by DrivenData.
    Machine Learning, Data challenge, Random forest, XGBoost, Classification
    Deep learning model to predict anticancer drug response
    Developed a deep learning model to predict anticancer drug response in lung cancer. Used convolutional neural network framework to extract features from tens of thousands of genomic mutation locus. Combined gene expression profile and drug information to predict drug sensitivity.
    Convolutional neural network, Deep learning, Drug response

    Statistical methodology and modeling

    Bayesian hierarchical model to predict mRNA methylation sites
    Developed a Bayesian hierarchical model to analyze methylated RNA immunoprecipitation sequencing (MeRIPseq) data. Used zero-inflated negative binomial distribution to model sequencing read count. Modeled spatial dependency by hidden markov model.
    Bayesian hierarchical model, Hidden markov model, Count data, Negative binomial distribution, Zero inflation
    EM algorithm to identify genes that have tri-modal distribution
    Developed an expectation maximization (EM) algorithm to deconvolute gene expression profile into three Gaussian components (lower than normal, close to normal and higher than normal). Correlated breast cancer patients' survival with this stratification to search for genes whose both high and lower expression are associated with worse survival outcome, which may potential have dual role (both oncogene and tumor suppressor) in cancer development.
    Expectation maximization, Gaussian mixture, Survival analysis, Breast Cancer

    Bioinformatics software and web portal

    Large scale profiling of RBP-circRNA interaction using CLIP-Seq data
    Developed a bioinformatics software named Clirc to identify RBP-bound circRNAs through analysis of CLIP-Seq data.
    Circ-RNA, CLIP-Seq, RNA-binding protein (RBP)
    Ensemble-based method to infer gene network structure
    Developed an ensemble method to aggregate network constructed by multiple statistical methods (correlation-, mutual information-, Bayesian-, likelihood-based methods); Developed a web server for online network inference and visualization.
    Ensemble method, Hub node, Precision matrix, Network visualization, Online tool
    DIGREM: an integrated web-based platform for detecting effective multi-drug combinations
    Developed a computational algorithm to predict drug combination synergistic effect with transcriptomic profile, drug dose response curve and gene regulatory network. Developed a user friendly web server to enable online prediction. Implement this algorithm with an R package.
    Drug-Induced Genomic Response models (DIGREM), Precision matrix, Drug synergy, Online tool

    Publication

    Skills & Proficiency

    R

    Python

    HTML/CSS

    PHP

    MySQL

    Perl

    Git

    Shell