Interdisciplinary Dean's Speaker Series in Data Science

Interdisciplinary Dean's Speaker Series in Data Science

The talk titled "Selected Subgroups in Clinical Trials: Is this Data Snooping?" originally scheduled for Thursday, April 2, with Xuming He, Harry Clde Carver Professor of Statistics and chair of the Department of Statistics at the University of Michigan has been canceled due to the coronavirus pandemic. 

Amanda Larracuente
Assistant Professor and Stephen Biggar and Elisabeth Asaro Fellow in Data Science
University of Rochester

3:30–4:30 p.m. Friday, Feb. 21, 2020

The topic of her talk is "Intragenomic Conflict in Drosophila: Satellite DNA and Drive."

Conflicts arise within genomes when genetic elements are selfish and fail to play by the rules. Meiotic drivers are selfish genetic elements found in a wide variety of taxa that cheat meiosis to bias their transmission to the next generation. One of the best-studied drive systems is an autosomal male driver found on the 2nd chromosome of Drosophila melanogaster called Segregation Distorter (SD). Males heterozygous for SD and sensitive wild type chromosomes transmit SD to >95% of their progeny, whereas female heterozygotes transmit SD fairly, to 50% of their progeny. SD is a sperm killer that targets sperm with large blocks of tandem satellite repeats (called Responder) for destruction through a chromatin condensation defect after meiosis. The molecular mechanism of drive is unknown. We combine genomic, cytological, and molecular methods to study the population dynamics of this system and how the driver and the target satellite DNA interact. These interactions provide insight into the regulation of satellite DNAs in spermatogenesis and the mechanisms of meiotic drive.

Arthur Spirling
Department of Politics and Center for Data Science at New York University

10–11:30 a.m. Tuesday, Nov. 19, 2019

The topic of his talk is "Word Embeddings: What works, what doesn't, and how to tell the difference for applied research."

We consider the properties and performance of word embeddings techniques in the context of political science research. In particular, we explore key parameter choices — including context window length, embedding vector dimensions and the use of pre-trained vs locally fit variants — with respect to efficiency and quality of inferences possible with these models. Reassuringly, we show that results are generally robust to such choices for political corpora of various sizes and in various languages. Beyond reporting extensive technical findings, we provide a novel, crowdsourced “Turing test”-style method for examining the relative performance of any two models that produce substantive, text-based outputs. Encouragingly, we show that popular, easily available pre-trained embeddings perform at a level close to — or surpassing — both human coders and more complicated locally-fit models. For completeness, we provide best practice advice for cases where local fitting is required.

Spirling is professor of politics and data science at New York University. He is the deputy director and the director of graduate studies (MSDS) at the Center for Data Science, and chair of the executive committee of the Moore-Sloan Data Science Environment. He studies British political development and legislative politics more generally. His particular interests lie in the application of text-as-data/natural language processing, Bayesian statistics, machine learning, item response theory and generalized linear models. His substantive field is comparative politics, and he focuses primarily on the United Kingdom. Spirling received his PhD from the University of Rochester, Department of Political Science, in 2008. From 2008 to 2015, he was an assistant professor and then the John L. Loeb Associate Professor of the Social Sciences in the Department of Government at Harvard University. He is the faculty coordinator for the NYU Text-as-Data speaker series.

The paper Spirling will present can be found online

FAQ that could be useful can also be found online.

RSVP online.

For questions, contact David Clark at or Xingye Qiao at

Andrew Gordon Wilson
Assistant Professor
Courant Institute of Mathematical Sciences and Center for Data Science at New York University

3:30–4:30 p.m. Friday, Nov. 8, 2019
A reception will be held in CW-112 at 4:30 pm with refreshment available.

The topic of his talk is "How do we build models that learn and generalize?"

To answer scientific questions, and reason about data, we must build models and perform inference within those models. But how should we approach model construction and inference to make the most successful predictions? How do we represent uncertainty and prior knowledge? How flexible should our models be? Should we use a single model, or multiple different models? Should we follow a different procedure depending on how much data are available?

In this talk, he will present a philosophy for model construction, grounded in probability theory. He will exemplify this approach for scalable kernel learning and Gaussian processes, Bayesian deep learning, and understanding human learning.

Andrew Gordon Wilson is faculty in the Courant Institute and Center for Data Science at NYU. Before joining NYU, he was an assistant professor at Cornell University from 2016-2019. He was a research fellow in the Machine Learning Department at Carnegie Mellon University from 2014-2016, and completed his PhD at the University of Cambridge in 2014. His interests include probabilistic modelling, scientific computing, Gaussian processes, Bayesian statistics, and loss surfaces and generalization in deep learning. His webpage is andrewgw.

RSVP online

For questions, contact Ken Kurtz or Xingye Qaio. Contact Kurtz to request a meeting with the speaker.

Inaugural Speaker

Joseph Hogan
Carole and Lawrence Sirovich Professor of Public Health
Deputy Director of the Data Science Initiative at Brown University

11 a.m. Wednesday, Oct. 9, 2019
AM-189 (Admissions Center)

Hogan will speak on “Using Electronic Health Records Data for Predictive and Causal Inference About the HIV Care Cascade."

The HIV care cascade is a conceptual model describing essential steps in the continuum of HIV care. The cascade framework has been widely applied to define population-level metrics and milestones for monitoring and assessing strategies designed to identify new HIV cases, link individuals to care, initiate antiviral treatment and ultimately suppress viral load. Comprehensive modeling of the entire cascade is challenging because data on key stages of the cascade are sparse. Many approaches rely on simulations of assumed dynamical systems, frequently using data from disparate sources as inputs. However, growing availability of large-scale longitudinal cohorts of individuals in HIV care affords an opportunity to develop and fit coherent statistical models using single sources of data, and to use these models for both predictive and causal inferences. Using data from 90,000 individuals in HIV care in Kenya, we model progression through the cascade using a multistate transition model fitted using Bayesian Additive Regression Trees (BART), which allows considerable flexibility for the predictive component of the model. We show how to use the fitted model for predictive inference about important milestones and causal inference for comparing treatment policies. Connections to agent-based mathematical modeling are made. This is joint work with Yizhen Xu, Tao Liu, Rami Kantor and Ann Mwangi.

Hogan's research concerns development and application of statistical methods for large-scale observational data with emphasis on applications in HIV/AIDS. He is program director for the Moi-Brown Partnership for Biostatistics Training, which focuses on research capacity building at Moi University in Kenya.

For questions, contact Changqing Cheng or Xingye Qiao

The Interdisciplinary Dean's Speaker Series in Data Sciences is supported by the:

  • Dean's Office of Harpur College of Arts and Sciences
  • Department of Biological Sciences
  • Department of Mathematical Sciences
  • Department of Political Science
  • Department of Systems Science and Industrial Engineering
  • Data Science Transdisciplinary Area of Excellence

RSVP at this link.