Invited Speakers and Guest Lecturers 2021-2022
Hoada Fu, research fellow and an enterprise lead for machine learning, artifical intelligence
and digital connected care for Eli Lilly and Company
Noon-1 p.m. Tuesdsay, Sept. 21
This talk is organized by the Mathematical Sciences Department Data Science Seminar.
"Our Recent Development on Cost Constrained Machine Learning Models"
Suppose we can only pay $100 to diagnose a disease subtype for selecting the best treatments. We can either measure 10 cheap biomarkers or 2 expensive ones. How can we pick the optimal combinations to achieve the highest diagnostic accuracy? This is a nontrivial problem. In a special case where each variable costs the same, the total cost constraint will be reduced to an L0 penalty which is the best subset selection problem. Until recently, there is no good solution even for this special case. Traditional algorithms can only solve up to ~35 variables for best subset selections. Thanks to algorithm breakthroughs in the field of optimization research, we have modified and extended a recently developed algorithm to handle our cost constraint problems with thousands of variables. In this talk, we will introduce the background of this problem, methods development, and theoretical results. We will also show an impressive example of dynamic programming. It will tell a story on how algorithms can make a difference in computing. We hope that through this presentation, the audience can have a feel of modern statistics, which combines computer science, statistics, and algorithms.
Haoda Fu is a research fellow and an enterprise lead for machine learning, artificial intelligence and digital connected care from Eli Lilly and Company. Fu is a Fellow of ASA (American Statistical Association). He is also an adjunct professor of biostatistics at the University of North Carolina Chapel Hill and Indiana University School of Medicine. He received his PhD in statistics from the University of Wisconsin - Madison in 2007, and joined Lilly after that. Since he joined Lilly, he has been very active in statistics methodology research. He has more than 90 publications in such areas as Bayesian adaptive design, survival analysis, recurrent event modeling, personalized medicine, indirect and mixed treatment comparison, joint modeling, Bayesian decision making and rare events analysis. In recent years, his research has focused on machine learning and artificial intelligence. His research has been published in various top journals including JASA, JRSS, Biometrika, Biometrics, ACM, IEEE, JAMA, Annals of Internal Medicine, etc. He has been teaching topics of machine learning, AI in large industry conferences and FDA workshops. He was on the board of directors for statistics organizations and was the program chair and committee chair of ICSA, ENAR and the ASA Biopharm section.
Invited Speakers and Guest Lecturers 2020-2021
Catherine D'Ignazio, assistant professor in the Department of Urban Studies and Planning,
and director of the Data + Feminism Lab at MIT; and Lauren Klein, associate professor
in the departments of English and Quantitative Theory and Methods, and director of
the Digital Humanities Lab at Emory University
2-4 p.m. Friday, March 12 (presentation and Q&A to end at 3:30 p.m., followed by an
informal coffee hour)
As data are increasingly mobilized in the service of governments and corporations, their unequal conditions of production, their asymmetrical methods of application, and their unequal effects on both individuals and groups have become increasingly difficult for data scientists--and others who rely on data in their work--to ignore. But it is precisely this power that makes it worth asking: "Data science by whom? Data science for whom? Data science with whose interests in mind? These are some of the questions that emerge from what we call data feminism, a way of thinking about data science and its communication that is informed by the past several decades of intersectional feminist activism and critical thought. Illustrating data feminism in action, this talk will show how challenges to the male/female binary can help to challenge other hierarchical (and empirically wrong) classification systems; it will explain how an understanding of emotion can expand our ideas about effective data visualization; how the concept of invisible labor can expose the significant human efforts required by our automated systems; and why the data never, ever “speak for themselves.” The goal of this talk, as with the project of data feminism, is to model how scholarship can be transformed into action: how feminist thinking can be operationalized in order to imagine more ethical and equitable data practices.
Catherine D’Ignazio is a scholar, artist/designer and hacker mama who focuses on feminist technology, data literacy and civic engagement. She has run reproductive justice hackathons, designed global news recommendation systems, created talking and tweeting water quality sculptures and led walking data visualizations to envision the future of sea level rise. With Rahul Bhargava, she built the platform Databasic.io, a suite of tools and activities to introduce newcomers to data science. Her 2020 book from MIT Press, Data Feminism, co-authored with Lauren Klein, charts a course for more ethical and empowering data science practices. Her research at the intersection of technology, design & social justice has been published in the Journal of Peer Production, the Journal of Community Informatics, and the proceedings of Human Factors in Computing Systems (ACM SIGCHI). Her art and design projects have won awards from the Tanne Foundation, Turbulence.org and the Knight Foundation and exhibited at the Venice Biennial and the ICA Boston. D’Ignazio is an assistant professor of urban science and planning in the Department of Urban Studies and Planning at MIT. She is also Director of the Data + Feminism Lab which uses data and computational methods to work towards gender and racial equity, particularly in relation to space and place.
Lauren Klein is an associate professor in the departments of English and Quantitative Theory & Methods at Emory University, where she also directs the Digital Humanities Lab. She works at the intersection of data science, digital humanities, and early American literature, with a research focus on issues of race and gender. She has designed platforms for exploring the contents of historical newspapers, recreated forgotten visualization schemes with fabric and addressable LEDs, and, with her students, cooked meals from early American recipes — and then visualized the results. In 2017, she was named one of the “rising stars in digital humanities” by Inside Higher Ed. She is the author of An Archive of Taste: Race and Eating in the Early United States (University of Minnesota Press, 2020) and, with Catherine D’Ignazio, Data Feminism (MIT Press, 2020). With Matthew K. Gold, she edits Debates in the Digital Humanities, a hybrid print-digital publication stream that explores debates in the field as they emerge. Her current project, "Data by Design: An Interactive History of Data Visualization, 1786-1900," was recently funded by an NEH-Mellon Fellowship for Digital Publication.
Toby Burrows, senior research fellow at the Oxford e-Research Centre, University of Oxford, and the School of Humanities, University of Western Australia
This talk is in cooperation with the Center for Medieval and Renaissance Studies (CEMERS) at Binghamton University.
3 p.m. Wednesday, March 10
Zoom meeting ID is 970 8903 8153
"Mapping Manuscript Migrations: Tracking the Travels of 220,000 Medieval and Renaissance Manuscripts"
Hundreds of thousands of medieval and Renaissance manuscripts still survive today, and detailed information about their history and provenance is scattered across a large number of databases and Web sites. Combining this kind of data was the focus of the Mapping Manuscript Migrations (MMM) project, which was funded by the Digging into Data program of the Trans-Atlantic Partnership between 2017 and 2020, and brought together manuscript researchers, curators, librarians, and computing specialists from institutions in Oxford, Philadelphia, Paris, and Helsinki. The project combined three large collections of data relating to the history and provenance of more than 220,000 medieval and Renaissance manuscripts.
This talk will discuss the work done by the MMM project, especially its deployment of Semantic Web and Linked Open Data technologies in order to transform, aggregate, and harmonize such a large body of data. It will also examine the various ways in which the data have been published: through a public Web portal, as a searchable Linked Open Data store, and as a downloadable dataset. It will demonstrate some of the ways in which the data can be used to answer research questions, including creating visualizations through the Web portal, and running SPARQL queries against the data store. We will also look at the way in which the project was organized, and how the contributions of specialists from such diverse fields were brought together. We will finish with some thoughts about how the lessons learned from the MMM project can be applied and developed in the future.
Hengchen Dai, assistant professor of management and organizations and behavioral decision
making at the UCLA Anderson School of Management
2-3 p.m. Friday, Feb. 26, 2021
Join the Zoom meeting at https://binghamton.zoom.us/j/92622837791?pwd=RWFReXU4bDVUaFFGWmlDWDU4NXd2UT09
Meeting ID: 926 2283 7791
Passcode: 5-digit zip code for Binghamton University
"The Value of Customer-Related Information on Service Platforms: Evidence From a Large Field Experiment"
As digitization enables service platforms to access users' information, important questions arise about how digital service platforms should disseminate information to improve service capacity and enjoyment. We examine a strategy that involves providing customer-related information to individual service providers at the beginning of a service encounter. We causally evaluate this strategy via a field experiment on a large live-streaming platform that connects viewers and individual broadcasters. When viewers entered shows, we provided viewer-related information to broadcasters who were randomly assigned to the treatment condition (but not to control broadcasters). Our analysis, involving a subsample of 49,998 broadcasters, demonstrates that relative to control broadcasters, treatment broadcasters expanded service capacity by 12.62% by increasing both show frequency (3.31%) and show length (7.10%), thus earning 10.44% more based on our conservative estimate. Moreover, our intervention increased service enjoyment (measured by viewer watch time) by 4.51%. Two surveys and additional analyses provide evidence for two mechanisms and rule out several alternative explanations. Our low-cost, information-based intervention has important implications for digital service platforms that have little control over service providers’ work schedules and service quality.
Hengchen Dai is an assistant professor of management and organizations as well as a faculty member in the behavioral decision making area at Anderson School of Management at UCLA. She received her bachelor's degree from Peking University and her PhD from the University of Pennsylvania.
Her research primarily applies insights from behavioral economics and psychology to motivate people to behave in line with their long-term best interests and pursue their personal and professional goals. Her research also examines how different social forces, incentives, and technology affect users’ judgments and behaviors on online platforms.
She has published in leading academic journals such as Academy of Management Journal, Management Science, The Journal of Applied Psychology, Journal of Consumer Research, Journal of Marketing Research, and Psychological Science. Her research has been covered in major media outlets such as The Financial Times, The Wall Street Journal, Harvard Business Review, The New York Times, The Huffington Post, and The New Yorker.
Dennis Zhang, Associate Professor of Operations and Manufacturing Management at the
Washington University in St. Louis Olin Business School
2-3 p.m. Friday, Oct. 30, 2020
"Customer Choice Models versus Machine-Learning: Finding Optimal Product Displays on Alibaba"
We compare the performance of two approaches for finding the optimal set of products to display to customers landing on Alibaba's two online marketplaces, Tmall and Taobao. Both approaches were placed online simultaneously and tested on real customers for one week. The first approach we test is Alibaba's current practice. This procedure embeds thousands of product and customer features within a sophisticated machine learning algorithm that is used to estimate the purchase probabilities of each product for the customer at hand. The products with the largest expected revenue (revenue * predicted purchase probability) are then made available for purchase. The downside of this approach is that it does not incorporate customer substitution patterns; the estimates of the purchase probabilities are independent of the set of products that eventually are displayed. Our second approach uses a featurized multinomial logit (MNL) model to predict purchase probabilities for each arriving customer. In this way we use less sophisticated machinery to estimate purchase probabilities, but we employ a model that was built to capture customer purchasing behavior and, more specifically, substitution patterns. We use historical sales data to fit the MNL model and then, for each arriving customer, we solve the cardinality-constrained assortment optimization problem under the MNL model online to find the optimal set of products to display. Our experiments show that despite the lower prediction power of our MNL-based approach, it generates significantly higher revenue per visit compared to the current machine learning algorithm with the same set of features. We also conduct various heterogeneous-treatment-effect analyses to demonstrate that the current MNL approach performs best for sellers whose customers generally only make a single purchase.
Dennis Zhang is a tenured associate professor of operations and manufacturing Management at the Olin Business School. His research focuses on data-driven operations in digital economy and platforms. He implements field experiments and uses observational data to improve operations.
Join the Zoom meeting at https://binghamton.zoom.us/j/92622837791?pwd=RWFReXU4bDVUaFFGWmlDWDU4NXd2UT09
Meeting ID: 926 2283 7791
Passcode: 5-digit zip code for Binghamton University
Interdisciplinary Dean's Speaker Series in Data Science, 2019-2020
The Interdisciplinary Dean's Speaker Series in Data Science was in place for the 2019-2020 academic year and brought in the following speakers:
Amanda Larracuente, Assistant Professor and Stephen Biggar and Elisabeth Asaro Fellow
in Data Science at the University of Rochester
Feb. 21, 2020
"Intragenomic Conflict in Drosophila: Satellite DNA and Drive"
Conflicts arise within genomes when genetic elements are selfish and fail to play by the rules. Meiotic drivers are selfish genetic elements found in a wide variety of taxa that cheat meiosis to bias their transmission to the next generation. One of the best-studied drive systems is an autosomal male driver found on the 2nd chromosome of Drosophila melanogaster called Segregation Distorter (SD). Males heterozygous for SD and sensitive wild type chromosomes transmit SD to >95% of their progeny, whereas female heterozygotes transmit SD fairly, to 50% of their progeny. SD is a sperm killer that targets sperm with large blocks of tandem satellite repeats (called Responder) for destruction through a chromatin condensation defect after meiosis. The molecular mechanism of drive is unknown. We combine genomic, cytological, and molecular methods to study the population dynamics of this system and how the driver and the target satellite DNA interact. These interactions provide insight into the regulation of satellite DNAs in spermatogenesis and the mechanisms of meiotic drive.
Arthur Spirling, Professor in the Department of Politics and Center for Data Science
at New York University
Nov. 19, 2019
"Word Embeddings: What works, what doesn't, and how to tell the difference for applied research"
We consider the properties and performance of word embeddings techniques in the context of political science research. In particular, we explore key parameter choices — including context window length, embedding vector dimensions and the use of pre-trained vs locally fit variants — with respect to efficiency and quality of inferences possible with these models. Reassuringly, we show that results are generally robust to such choices for political corpora of various sizes and in various languages. Beyond reporting extensive technical findings, we provide a novel, crowdsourced “Turing test”-style method for examining the relative performance of any two models that produce substantive, text-based outputs. Encouragingly, we show that popular, easily available pre-trained embeddings perform at a level close to — or surpassing — both human coders and more complicated locally-fit models. For completeness, we provide best practice advice for cases where local fitting is required.
Spirling is professor of politics and data science at New York University. He is the deputy director and the director of graduate studies (MSDS) at the Center for Data Science, and chair of the executive committee of the Moore-Sloan Data Science Environment. He studies British political development and legislative politics more generally. His particular interests lie in the application of text-as-data/natural language processing, Bayesian statistics, machine learning, item response theory and generalized linear models. His substantive field is comparative politics, and he focuses primarily on the United Kingdom. Spirling received his PhD from the University of Rochester, Department of Political Science, in 2008. From 2008 to 2015, he was an assistant professor and then the John L. Loeb Associate Professor of the Social Sciences in the Department of Government at Harvard University. He is the faculty coordinator for the NYU Text-as-Data speaker series.
Andrew Gordon Wilson, Assistant Professor at the Courant Institute of Mathematical
Sciences and Center for Data Science at New York University
Nov. 8, 2019
"How do we build models that learn and generalize?"
To answer scientific questions, and reason about data, we must build models and perform inference within those models. But how should we approach model construction and inference to make the most successful predictions? How do we represent uncertainty and prior knowledge? How flexible should our models be? Should we use a single model, or multiple different models? Should we follow a different procedure depending on how much data are available?
In this talk, he will present a philosophy for model construction, grounded in probability theory. He will exemplify this approach for scalable kernel learning and Gaussian processes, Bayesian deep learning, and understanding human learning.
Andrew Gordon Wilson is faculty in the Courant Institute and Center for Data Science at NYU. Before joining NYU, he was an assistant professor at Cornell University from 2016-2019. He was a research fellow in the Machine Learning Department at Carnegie Mellon University from 2014-2016, and completed his PhD at the University of Cambridge in 2014. His interests include probabilistic modelling, scientific computing, Gaussian processes, Bayesian statistics, and loss surfaces and generalization in deep learning. His webpage is https://cims.nyu.edu/~ andrewgw.
Joseph Hogan, Carole and Lawrence Sirovich Professor of Public Health and Deputy Director
of the Data Science Initiative at Brown University
Oct. 9, 2019
“Using Electronic Health Records Data for Predictive and Causal Inference About the HIV Care Cascade"
The HIV care cascade is a conceptual model describing essential steps in the continuum of HIV care. The cascade framework has been widely applied to define population-level metrics and milestones for monitoring and assessing strategies designed to identify new HIV cases, link individuals to care, initiate antiviral treatment and ultimately suppress viral load. Comprehensive modeling of the entire cascade is challenging because data on key stages of the cascade are sparse. Many approaches rely on simulations of assumed dynamical systems, frequently using data from disparate sources as inputs. However, growing availability of large-scale longitudinal cohorts of individuals in HIV care affords an opportunity to develop and fit coherent statistical models using single sources of data, and to use these models for both predictive and causal inferences. Using data from 90,000 individuals in HIV care in Kenya, we model progression through the cascade using a multistate transition model fitted using Bayesian Additive Regression Trees (BART), which allows considerable flexibility for the predictive component of the model. We show how to use the fitted model for predictive inference about important milestones and causal inference for comparing treatment policies. Connections to agent-based mathematical modeling are made. This is joint work with Yizhen Xu, Tao Liu, Rami Kantor and Ann Mwangi.
Hogan's research concerns development and application of statistical methods for large-scale observational data with emphasis on applications in HIV/AIDS. He is program director for the Moi-Brown Partnership for Biostatistics Training, which focuses on research capacity building at Moi University in Kenya.
Invited speakers from prior years
Ivo D. Dinov
Associate Director, Michigan Institute for Data Science
Director, Statistics Online Computational Resource
Professor, Computational Medicine and Bioinformatics, Human Behavior and Biological
Sciences of the University of Michigan
Ivo Dinov is an expert in mathematical modeling, statistical analysis, computational processing and visualization of Big Data. He is involved in longitudinal morphometric studies of human development (e.g., Autism, Schizophrenia), maturation (e.g., depression, pain) and aging (e.g., Alzheimer's and Parkinson's diseases). Dinov is developing, validating and disseminating novel technology-enhanced pedagogical approaches for scientific education and active learning.
Dinov will give two talks.
Michigan Institute of Data Science – Organization, Education Challenges and Research
April 24, 2018
I will present the Michigan Institute of Data Science (MIDAS), a trans-collegiate Institute at the University of Michigan. I will start by describing the multidisciplinary activities in data science at the University of Michigan. Then I will cover some of scientific pursuits (development of concepts, methods, and technology) for data collection, management, analysis, and interpretation as well as their innovative use to address important problems in science, engineering, business, and other areas. We will end with an open-ended discussion of educational challenges, research opportunities and infrastructure demands in data science.
Compressive Big Data Analytics
April 24, 2018
I will start by showing examples of specific Big Data driving biomedical and health challenges. These will help us identify the common characteristics of Big Biomedical Data. We will also provide working definitions for "Data Science" and "Predictive Analytics". The core of the talk will be the mathematical foundation for analytically representing multisource, complex, incongruent, and multi-scale information as computable data objects. Specifically, I will describe the Compressive Big Data Analytics (CBDA) technique. Several applications of neurodegenerative disorders will be presented as case-studies.
J. S. Marron
Amos Hawley Distinguished Professor of Statistics and Operations Research and Professor
University of North Carolina at Chapel Hill
J.S. Marron is widely recognized as a world research leader in the statistical disciplines of high- dimensional, functional and object-oriented data analysis, as well as data visualization. He has made broad major contributions ranging from the invention of innovative new statistical methods, through software development and on to statistical and mathematical theory. His research continues with a number of ongoing deep, interdisciplinary research collaborations with colleagues in computer science, genetics, medicine, mathematics and biology. A special strength is his strong record of mentoring graduate students, postdocs and junior faculty, in both statistics and related disciplinary fields.
Data Integration by JIVE: Joint and Individual Variation Explained
March 15, 2018
Abstract: A major challenge in the age of Big Data is the integration of disparate data types into a data analysis. That is tackled here in the context of data blocks measured on a common set of experimental subjects. This data structure motivates the simultaneous exploration of the joint and individual variation within each data block. This is done here in a way that scales well to large data sets (with blocks of wildly disparate size), using principal angle analysis, careful formulation of the underlying linear algebra, and differing outputs depending on the analytical goals. Ideas are illustrated using mortality, cancer and neuroimaging data sets.
OODA of Tree Structured Data Objects Using Persistent Homology
March 15, 2018
The field of Object Oriented Data Analysis has made a lot of progress on the statistical analysis of the variation in populations of complex objects. A particularly challenging example of this type is populations of tree-structured objects. Deep challenges arise, whose solutions involve a marriage of ideas from statistics, geometry, and numerical analysis, because the space of trees is strongly non-Euclidean in nature. Here these challenges are addressed using the approach of persistent homologies from topological data analysis. The benefits of this data object representation are illustrated using a real data set, where each data point is the tree of blood arteries in one person's brain. Persistent homologies gives much better results than those obtained in previous studies.
Object Oriented Data Analysis
March 16, 2018
Object Oriented Data Analysis is the statistical analysis of populations of complex objects. In the special case of Functional Data Analysis, these data objects are curves, where standard Euclidean approaches, such as principal components analysis, have been very successful. Challenges in modern medical image analysis motivate the statistical analysis of populations of more complex data objects which are elements of mildly non-Euclidean spaces, such as Lie Groups and Symmetric Spaces, or of strongly non-Euclidean spaces, such as spaces of tree-structured data objects. These new contexts for Object Oriented Data Analysis create several potentially large new interfaces between mathematics and statistics. The notion of Object Oriented Data Analysis also impacts data analysis, through providing a language for discussion of the many choices needed in many modern complex data analyses.
Robin & Tim Wentworth Director of the Goergen Institute for Data Science and Professor
Department of Computer Science, University of Rochester
Henry Kautz has served as department head at AT&T Bell Labs in Murray Hill, N.J., and as a full professor at the University of Washington, Seattle. In 2010 he was elected president of the Association for Advancement of Artificial Intelligence (AAAI), and in 2016 was elected chair of the AAAS Section on Information, Computing and Communication. His research in artificial intelligence, pervasive computing and healthcare applications has led him to be honored as a Fellow of the American Association for the Advancement of Science, Fellow of the Association for Kautz will visit Binghamton University in Nov. 2-3, 2017, and give two talks. The first talk is a technical talk and the second one is an overview presentation targeting general audience.
Mining Social Media to Improve Public Health
Abstract: People posting to social media on smartphones can be viewed as an organic sensor network for public health data, picking up information about the spread of disease, lifestyle factors that influence health, and pinpointing sources of disease. We show how a faint but actionable signal can be detected in vast amounts of social media data using statistical natural language and social network models. We present case studies of predicting influenza transmission and per-city rates, discovering patterns of alcohol consumption in different neighborhoods, and tracking down the sources of foodborne illness.
Data Science: Foundation for the Future of Science, Healthcare, Business, and Education
Abstract: Data science is the synthesis of computer science and statistics that is driving fundamental changes in essentially all aspects of society. While the applications of data science are incredibly broad, the discipline has a surprisingly small and coherent intellectual core, based on principles of statistical prediction and information management. In 2013, the University of Rochester adopted data science as the unifying theme for its five-year strategic plan, and created the Goergen Institute for Data Science. The Institute has created undergraduate and graduate degree programs in data science, helped hire faculty engaged in interdisciplinary research, seeded new research efforts, and grown partnerships with industry. As the University works on its 2018 five-year strategic plan, data science remains a key priority.
Joshua White, Defense Consultant
Joshua White is VP of engineering for Rsignia Inc., engaged primarily in data science-related activities as they relate to terrorism studies, social media/social sciences, high-performance computing, and high-speed network and protocol analysis research for both the defense and intelligence communities. He is an adjunct professor at the State University of New York Polytechnic Institute Utica/Rome campus in the Network Computer Security Department, Utica College in the Social Data Science Program and MVCC in the Data Analytics Micro Credential Program since 2014.
Social Networks and Big Data Analysis Techniques
Dec. 1, 2017
Abstract: "The #bluewhalechallenge is presented as a test analysis case for various social network and big data analysis techniques. In this presentation, we present the techniques used for collecting, indexing, and analyzing billions of documents in an attempt to discover who was controlling the challenge and who is participating in the challenge. Various techniques are not suitable for true large-scale analysis given the time and resources required. We identify those techniques that require no more than a reasonable amount of time and resources to compute while still resulting in reasonable results."