Nilotpal Sanyal
Assistant Professor
Assistant Director, Data Analytics Lab
Department of Mathematical Sciences University of Texas at El Paso 500 W University Ave El Paso, TX 79968-0514 |
Office: Bell Hall 328 Phone: (915)747-6763 E-mail: nsanyal@utep.edu Personal Webpage |
I am an Assistant Professor in the Department of Mathematical Sciences at the University of Texas at El Paso.
I obtained a PhD in Statistics from the University of Missouri-Columbia with a dissertation on Bayesian functional magnetic resonance imaging (fMRI) data analysis and Bayesian optimal design. Following that, I had an extensive postdoctoral research experience at Stanford University, the University of California-San Diego, and Texas A&M University working in biological data applications.
My research interests are Bayesian statistics, high-dimensional variable selection, nonparametric regression, statistical genetics, computational neuroscience, and survival analysis.
I am truly passionate about teaching and have immense respect for the value of good teaching and good mentoring.
Research
Overview
My current research interests are Bayesian statistics, high-dimensional variable selection, nonparametric regression, statistical genetics, and survival analysis. I work in both the Bayesian and the frequentist frameworks. I enjoy developing computationally efficient statistical/machine learning methods and software. My research has found applications in omics, epidemiology, public health, and neuroscience.
Specific research areas
- High-dimensional variable selection and inference methods. Such methods are extremely useful for sparse data that contain a large number of variables/features (e.g., public health data), often much larger than the number of observations (e.g., GWAS data, or gene expression data), where only a few features have significant effects.
- Multiscalar methods. Such methods are useful for data that may contain information at multiple scales or resolution levels (say, image data or areal data) by virtue of the implicit multiscalar nature of the process (say, fMRI brain activation) and/or availability of information at multiple scales (say, time series data).
- Survival data methods in the presence of competing risks. Such methods help to predict the risk of the event (say, death) due to the primary cause of interest (say, cardio-vascular disease) correctly by accounting for the presence of other causes (say, accident) that may lead to the same event. [Note that if we simply exclude from sample the persons who die from accident, we lose the information that those persons do not die from cardio-vascular disease up to the time of their accident. A competing risks model incorporates this information.]
- Survival data methods in the presence of cure fraction. Such methods help to account explicitly for the presence of possibly cured persons (say, long-time meditators) in the population who may never experience the event of interest (say, depression).
- Gene by environment (GxE) interaction methods. Such methods help to understand how environmental factors (say, pollution) and lifestyle factors (say, smoking) may modify the effect of genetic factors on a trait or disease (say, lung cancer).
I have developed/co-developed the R software packages GWASinlps, CGEN, and BHMSMAfMRI based on my research. See the Software tab for more details about them.
Publications
- Google Scholar- ResearchGate
- ORCiD
Software
Here are some software packages that I have developed/co-developed based on my research.
- BHMSMAfMRI: This is an R software package that performs Bayesian hierarchical multi-subject multiscale analysis of function MRI (fMRI) data, or other multiscale data, as described in Sanyal & Ferreira (2012) using wavelet based prior that borrows strength across subjects and provides posterior smooth estimates of the effect sizes and samples from their posterior distribution. Description and download instructions are available at the package webpage at https://nilotpalsanyal.github.io/BHMSMAfMRI/.
- GWASinlps: This is an R software package that performs Bayesian non-local prior based iterative variable selection for data from genome-wide association studies (GWAS), or other high-dimensional data, as described in Sanyal et al. (2019). Description and download instructions are available at the package webpage at https://nilotpalsanyal.github.io/GWASinlps/.
- CGEN: This is an R software package that analyzes case-control data in genetic epidemiology. It provides a set of statistical methods for evaluating gene x environment (or gene x gene) interactions under multiplicative and additive risk models (Sanyal et al., 2021; Rochemonteix et al., 2021), with or without assuming gene-environment (or gene-gene) independence in the underlying population. Description and download instructions are available at the package webpage at https://www.bioconductor.org/packages/release/bioc/html/CGEN.html. A tutorial for the additive gene x environment interaction tests under the trend effect of genotypes, proposed in the above references, are available at https://github.com/thehanlab/AdditiveGxEtrendtest.
SPLC-RAT: My past colleagues at Stanford University have developed this shiny app based on our joint work on the development and validation of the first risk prediction tool for second primary lung cancer that incorporates comprehensive risk factors including smoking information, medical history, treatment, and tumor characteristics using large population-based data. It is available at https://splc-risk-prediction.shinyapps.io/SPLC-RiskAssessmentTool/.
Univariate probability distribution viewer: A shiny app to visualize various univariate probability distributions. Feel free to use for non-commercial classroom teaching.
References:
Teaching
I ardently love to teach and have immense respect for the value of good teaching and good mentoring.
Current courses (Spring 2024)
- STAT 6329 - Statistical Programming, UTEP
- DS 6390 - DS Research Collaborative, UTEP
Past Courses
- STAT 6370 - Special Topics (Advanced Regression Analysis), UTEP.
- Statistical Data Analysis (with project supervision for 12 students), International Statistical Education Center, ISI, Kolkata, 2022-23.
- Statistical Methods, International Statistical Education Center, ISI, Kolkata, 2022-23.
- Descriptive Statistics, International Statistical Education Center, ISI, Kolkata, 2022-23.
Workshop teaching
- Special Lecture on Survival Analysis, Maulana Azad College, Kolkata, April 2023.
- R Sessions for CoxBoost modeling, Virtual workshop, Stanford University Quantitative Science Unit, January 2021.
- Random Forest for Competing Risk Data, Virtual workshop, Stanford University Quantitative Science Unit, December 2020.
- Predictive Modeling of Competing Risk Data Using Penalized Regression, Virtual workshop, Stanford University Quantitative Science Unit, November 2020.
- Time Series Analysis, Winter School on Statistical Data Analysis Methods, Indian Statistical Institute, Kolkata, February 2015.
- Introduction to R, Winter School on Statistical Data Analysis Methods, Indian Statistical Institute, Kolkata, February 2015.
- Descriptive Statistics, Winter School on Statistical Data Analysis Methods, Indian Statistical Institute, Kolkata, February 2015.
- Time Series Analysis, Short-term Course on Statistical Methods, Arya Vidyapeeth College, Guwahati, Assam, India, November 2014.
- Introduction to R, Short-term Course on Statistical Methods, Arya Vidyapeeth College, Guwahati, Assam, India, November 2014.
- Design of Experiments, Workshop on Techniques of Data Analysis, Dimapur Govt. College, Nagaland, India, September 2014.
- Time Series Analysis, Workshop on Techniques of Data Analysis, Dimapur Govt. College, Nagaland, India, September 2014.
- R for Time Series, Workshop on Techniques of Data Analysis, Dimapur Govt. College, Nagaland, India, September 2014.
Some materials from past teaching/workshops:
- Penalized regression analysis for competing risks data: Concepts and data analysis.
- Random forest analysis for competing risks data: Concepts and data analysis.
- Boosting for competing risks data: Concepts and data analysis.
- Competing risks data simulation
- Introductory time series analysis: Concepts and data analysis.
- Introduction to R and exercises.
Service
This is the content for the third link.
Learn
Here are some self-made precise guides for quick learning.Links
Here are miscelleneous useful links for research.Probability / Statistics / Linear algebra
- A History of the Central Limit Theorem - Hans Fischer
- A Geometrical Understanding of Matrices: Gregory Gundersen blog
- Affine transformations: Arcane Algorithm Archive
- What's So Special About Logit?: Statistical Horizons
Data Science
- Data Science Blog by Matthias Döring. A blog about everything related to data science and programming.
Computer
- Mac keyboard shortcuts
- Sublime Text Regular Expression Cheat Sheet
- LaTeX accents
- LaTeX Beamer themes
- Draw symbol to get LaTeX command
- Common Math Symbols in HTML, TeX, and Unicode
- Text to HTML
Free datasets / search
- UC Irvine Machine Learning Repository - Machine learning
- Arizona State Univ Datasets
- MIT-BIH Arrhythmia Database
- KDnuggets: Datasets for Data Science, Machine Learning, AI & Analytics
- KDD Cup Archives (the annual Data Mining and Knowledge Discovery competition)
- Kaggle Datasets - Miscellaneous
- Data.gov: The home of the U.S. Government’s open data - Government
- Datahub.io - Miscellaneous
- Gene Expression Omnibus
- Google Dataset Search - Miscellaneous
- NASA Earth Data - Earth observation data
- CERN Open Data - Particle physics
- Global Health Observatory data repository: WHO - Health
- Tableau: Free Public Data Sets - Miscelleneous
- OpenML: A worldwide machine learning lab - Machine learning
- Data.world: 132355 free datasets - Miscelleneous
Free online courses
- MIT OpenCourseWare (OCW): Web based publication of virtually all MIT course content, open and available to the world.
Miscelleneous
- Scimago Journal & Country Rank
- Statistics and Probability Journals
- Convert DOI/ArXiv/ISBN to BibTeX, etc.
- A website of timelines
- Arcane Algorithm Archive: A collaborative effort to create a guide for all important algorithms in all languages.https://www.algorithm-archive.org. Corresponsing youtube channel is here.
- Philosophy of mathematics
- Visa rules for all countries
- External Funding Sources
- Choose graphics by data
Statistics Teaching
Bioinformatics / Biomedical Informatics / Biostatistics
- NCBI (National Center for Biotechnology Information): The National Center for Biotechnology Information advances science and health by providing access to biomedical and genomic information.
- dbGaP (database of Genotypes and Phenotypes): An archive and distribution center for the description and results of studies which investigate the interaction of genotype and phenotype. These studies include genome-wide association (GWAS), medical resequencing, molecular diagnostic assays, as well as association between genotype and non-clinical traits.
- dbVar (Database of Genomic Structural Variation): The dbVar database has been developed to archive information associated with large scale genomic variation, including large insertions, deletions, translocations and inversions. In addition to archiving variation discovery, dbVar also stores associations of defined variants with phenotype information.
- dbSNP (Database of Short Genetic Variations): Includes single nucleotide variations, microsatellites, and small-scale insertions and deletions. dbSNP contains population-specific frequency and genotype data, experimental conditions, molecular context, and mapping information for both neutral variations and clinical mutations.
- GenBank: The NIH genetic sequence database, an annotated collection of all publicly available DNA sequences. GenBank is part of the International Nucleotide Sequence Database Collaboration, which comprises the DNA DataBank of Japan (DDBJ), the European Molecular Biology Laboratory (EMBL), and GenBank at NCBI. These three organizations exchange data on a daily basis.
- Gene: A searchable database of genes, focusing on genomes that have been completely sequenced and that have an active research community to contribute gene-specific data. Information includes nomenclature, chromosomal localization, gene products and their attributes (e.g., protein interactions), associated markers, phenotypes, interactions, and links to citations, sequences, variation details, maps, expression reports, homologs, protein domain content, and external databases.
- Gene Expression Omnibus (GEO) Database: A public functional genomics data repository supporting MIAME-compliant data submissions. Array- and sequence-based data are accepted and tools are provided to help users query and download experiments and curated gene expression profiles.
- Genome: Contains sequence and map data from the whole genomes of over 1000 organisms. The genomes represent both completely sequenced organisms and those for which sequencing is in progress. All three main domains of life (bacteria, archaea, and eukaryota) are represented, as well as many viruses, phages, viroids, plasmids, and organelles.
- RefSeq (Reference Sequence): A collection of curated, non-redundant genomic DNA, transcript (RNA), and protein sequences produced by NCBI. RefSeqs provide a stable reference for genome annotation, gene identification and characterization, mutation and polymorphism analysis, expression studies, and comparative analyses. The RefSeq collection is accessed through the Nucleotide and Protein databases.
- PubMed: A database of citations and abstracts for biomedical literature from MEDLINE and additional life science journals. Links are provided when full text versions of the articles are available via PubMed Central (described below) or other websites.
- SRA (Sequence Read Archive): The Sequence Read Archive (SRA) stores sequencing data from the next generation of sequencing platforms including Roche 454 GS System®, Illumina Genome Analyzer®, Life Technologies AB SOLiD System®, Helicos Biosciences Heliscope®, Complete Genomics®, and Pacific Biosciences SMRT®.
- SARS CoV: A summary of data for the SARS coronavirus (CoV), including links to the most recent sequence data and publications, links to other SARS related resources, and a pre-computed alignment of genome sequences from various isolates.
- cBioPortal for Cancer Genomics
- GWAS Catalog
- GSEA (Gene Set Enrichment Analysis): Gene Set Enrichment Analysis (GSEA) is a computational method that determines whether an a priori defined set of genes shows statistically significant, concordant differences between two biological states (e.g. phenotypes).
- MSigDB (Molecular Signature Database): A collection of annotated gene sets for use with GSEA software.
- NITRC (NeuroImaging Tools & Resources Collaboratory): Award-winning free web-based resource offering comprehensive information on an ever expanding scope of neuroinformatics software and data.
- TCGA (The Cancer Genome Atlas Program): The Cancer Genome Atlas (TCGA), a landmark cancer genomics program, molecularly characterized over 20,000 primary cancer and matched normal samples spanning 33 cancer types. This joint effort between NCI and the National Human Genome Research Institute began in 2006, bringing together researchers from diverse disciplines and multiple institutions. Over the next dozen years, TCGA generated over 2.5 petabytes of genomic, epigenomic, transcriptomic, and proteomic data. The data, which has already led to improvements in our ability to diagnose, treat, and prevent cancer, will remain publicly available for anyone in the research community to use.
- Handbook of Biological Statistics - John H. McDonald
Funding sources
Others
Alongside academic research I have multifarious interests. Feel free to explore some of them here, to comment, and to connect.
UTEP Miscelleneous Links
- UTEP
- UTEP Mathematical Sciences
- MyUTEP / Single Sign On
- UTEP E-mail
- Blackboard
- Tech Support, Service Desk
- Academic Calendar
- On-campus Housing: Miner Village, Miner Canyon, and Miner Heights
- UTEP News
- Center for Faculty Leadership and Development
- Tenure and Promotion
- Holiday Schedule
- Travel Office
- Shuttles
- Pay Dates
- Computing Equipment
- Computer Purchases
- Campus Directory
- Campus Map
- Events Calendar
- Bookstore and Shop
- Library
- Building Addresses
- Blackboard Tutorials
El Paso Miscelleneous Links
- El Paso Official Website: Official website of the city with information on local government services, departments, permits, and regulations.
- 26 Things You Need To Know About El Paso Before You Move There
- What do I need to know before moving to El Paso, TX?
- Visit El Paso: Comprehensive resource for exploring tourism, attractions, events, dining, and recreational opportunities in El Paso.
- Electric
- Water
- Sun Metro: Provides information on public transportation services, routes, schedules, and fares in El Paso.
- County Transportation
- Public Libraries
- Zoo
- Museum of Art
- Museum of Archaeology
- Museum of History
- Symphony Orchestra: Provides details on upcoming concerts, ticket information, and educational programs related to classical music.
- Parks and Recreation
- County Parks & Recreation
- Convention Center
- Craigslist EL Paso
- Community College: Provides information on academic programs, admissions, campus locations, and resources for students.
- El Paso International Airport
- Public Health Department: Offers information on healthcare services, immunizations, disease prevention, and community health programs.
- El Paso Times: A local newspaper that covers news, events, and community updates in El Paso.
- Chamber of Commerce: Provides business resources, networking opportunities, and information on local businesses and industries.
- County: Offers resources on government services, departments, taxes, property records, and elections.
- County Clerk: Provides access to various services such as marriage licenses, birth and death certificates, property records, and voting information.
- Community Foundation: Offers information on philanthropic initiatives, grant opportunities, and community programs aimed at improving the quality of life in El Paso.
- Independent School District Police Department: Provides information on school safety, emergency protocols, and resources related to the El Paso Independent School District Police Department.
- County Tax Assessor-Collector: Provides information on property taxes, motor vehicle registration, and other tax-related services.
- 311: A centralized platform where residents can submit service requests, report issues, and seek information on various city services.
- Sun City Driving School West
- Cherokee Driving School