Seed grants are awarded with funding provided by the Binghamton University Road Map through the Provost's Office and the Division of Research.
The goal of these seed grants is to encourage faculty to develop collaborative projects that stimulate the advancement of new ideas that can build Binghamton University's expertise toward a national reputation in the broad area of data science. This competitive, peer-reviewed program is providing initial support for proposed long-term programs of collaborative research that have strong potential to attract external funding.
The call for proposals for seed grant funding for the 2021–2022 academic year, including an overview, an explanation of the process and eligibility, a proposal cover page and a proposal budget page is available on this website. The deadline to apply for a Data Science TAE seed grant is March 1, 2021.
For the 2020-2021 academic year, the following seed grants were awarded:
Physics-Guided Machine Learning for Quantum Mechanical Problems
Lead scientists: Wei-Cheng Lee, physics, and Kenneth Chiu, computer science
The objective of this project is to develop fundamentally new paradigm of machine learning, namely the physics-guided machine learning (PGML), and to employ this new paradigm to assist theoretical physics research that has limited amount of data due to the high demands in computational resources. In the past years, while we have witnessed the great success in applying machine learning methods to a number of important commercial problems such as image and voice recognitions, the impact of machine learning on the scientific discovery remains limited. The key reason is that traditional ML methods rely solely on data and ignore existing scientific knowledge, which makes them susceptible to overfitting and may slow down learning. Overfitting is learning spurious relationships that do not generalize well outside the data they are trained for, and this problem is even worsened with data scarcity. To overcome these challenges, PGML is designed to incorporate guidance based on physical constraints from scientific knowledge into the learning process, enabling novel ML methods for more accurate predictions and better generalizability even with data scarcity. The framework of PGML developed in this project can be applied to other research areas in which the physical constraint can be written in a well-defined equation.
Data Science Modeling for Shape and Genome Data
Lead scientists: Guifang Fu, mathematical sciences, Lijun Yin, computer science
Shape variation reflects both a response to, and a source of, natural selection. As shape has been reported to have high heritability, unraveling genetic mysteries of shape has attracted a lot of attention in various disciplines. Newest cutting-edge genotyping technologies dramatically revolutionize the landscape of contemporary Data Science research. With the number of single nucleotide polymorphisms (SNPs) increasing from thousands to millions, Shape-GWAS represents one of the newest but largely unexplored research directions because of a series of “double big data” challenges that could not be fully addressed in existing literature yet. Prevailing approaches either described shape by a loose number of landmark points, or individually modeled each SNP isolating others, and hence greatly limited their potential of new findings. The overarching objective of the proposed research is to develop novel statistical models that will provide ground-breaking methodological support to Shape-GWAS data analyses. The novelty includes new methodologies, new applications, newly collected human face datasets, and a new collaboration team. Successful accomplishment of this research will bridge the gap between theory and application, and ensure that data analytical strategies keep pace with high-end technologies that generate datasets, while boosting the progress of multidisciplinary collaborations.
Understanding and Predicting Chronic Absenteeism from School in Autism Spectrum Disorder: A Data-Driven Approach
Lead scientists: Daehan Won, systems science and industrial engineering, Jennifer Gillis, psychology
Absenteeism from school is a serious public health issue for educators, mental health professionals, and families. Also, chronic absenteeism, which is defined as missing 10% or more of school days due to absence for any reason, makes it hard for a student to keep pace with school. Although there is much attention to chronic absenteeism (CA) at the middle and high school levels, recent studies indicate that chronic absenteeism is even more severe in pre and elementary school. Also, children with developmental disabilities are far more likely to be missing many schools, and they have more than two times higher chronic absenteeism than typically developing children (TD). However, school absenteeism with autism spectrum disorder (ASD) has received less attention, with only a small number of studies only examining older children with higher cognitive abilities and ASD. The proposed project will be the first study to analyze the school absenteeism in children with lower cognitive abilities while assessing their time-dependent behaviors and school performance, which is still in the unexplored area. Through the data-driven approaches including artificial intelligence, this work revolutionizes knowledge of how students with disabilities attend the schools, the relationship of in-school activities, and introduce predictive modeling to provide early intervention for improving attendance and educational outcomes.
Learning-aided Distributed Anomaly Detection in Internet-of-Things
Lead scientists: Jian Li, electrical and computer engineering, Ping Yang, computer science
The Internet-of-Thing (IoT) refers to the networked interconnection of everyday objects such as physical devices, sensors, and home appliances, which are often equipped with ubiquitous intelligence. The advent of smart environments and a massive number of devices connected through IoT have led to an unprecedented generation of large amounts of data. The environment evolves rapidly and the capability of IoT systems may be degraded. Therefore, an IoT system must be able to quickly identify environmental changes, and actively adapt to avoid any disruption in its function and performance so as to be resilient to adversarial perturbations and robust to uncertainty. Detecting an abrupt change from data collected from nodes in IoT systems has been a fundamental problem in various applications, such as fraud detection and environmental monitoring. Sensors in IoT systems are observed sequentially with time, and an anomaly may occur at any time. State-of-the-art systems use a centralized approach with a fusion center gathering information from all nodes for decision making, which is inefficient and hard to implement. In addition, the fusion center is the single point of failure of the entire system, which makes the centralized approach even less feasible. This proposal aims to address these issues by developing real-time learning-aided distributed anomaly detection algorithms that are mathematically well-founded and robust in an adversarial environment.
For the 2018–2019 academic year, the following seed grant was awarded:
Using Data Science to Decipher Processing-Structure-Property-Performance Relationships of Additively manufactured metals
Lead scientist: Congrui Jin, mechanical engineering
During the last decade, various additive manufacturing techniques have been developed for the processing of complex metallic components. However, our understanding of the Processing-Structure-Property-Performance (PSPP) relationships of additively manufactured metals has not kept pace with the proliferation of the systems put into service. In particular, for additively manufactured high-temperature components, accurate prediction of their mechanical properties, such as creep rupture and fatigue strength, becomes a fundamentally significant issue. The overarching goal of the proposed research is to explore data science techniques to decipher PSPP relationships of additively manufactured metals, especially to predict creep rupture and fatigue strength of additively manufactured high-temperature components based on the processing parameters and material micro-structures. The proposed project will be the first application of data science techniques to study additively manufactured materials. Successful accomplishment of this research will result in highly reliable causal linkages among processing parameters, material micro-structures, and their mechanical properties, which can be utilized to provide us multiple optimal solutions for a specific application. This interdisciplinary effort couples the expertise of Congrui Jin and Pu Zhang in additive manufacturing and the expertise of Sanjeena Dang in data science. This work will provide the necessary preliminary results to aggressively seek external grants.
For the 2017-2018 academic year, the following seed grants were awarded:
Adaptive Network Modeling of Real-World Temporal Social Networks
Lead scientist: Hiroky Sayama, systems science and industrial engineering
The objective of this proposal is to develop algorithms and software that can overcome the challenges identified in existing temporal network analysis methods and effectively produce mechanistic, dynamical models from real-world temporal social network data. The data can involve temporarily varying network size and state-topology coevolution, which would not be captured in existing analytical methods.
Modeling and analysis of temporal social networks has attracted a lot of attention in various disciplines. A number of research methods have been proposed for temporal network analysis, but they are limited in capturing certain temporal dynamics, such as addition or removal of nodes, changes of node states, transitions of mesoscopic structures, and state-topology coevolution. An illustrative example is customers' network---new customers may join, some old customers may leave, their preferences may change because of social influence, and their social ties may also change based on their preferences. These temporal social network dynamics are essential in understanding the customers' behaviors, but they are not fully captured by existing methods. What is currently missing is a modeling/analysis tool for generating more detailed, more mechanistic dynamical models that can describe those nontrivial temporal social network dynamics in a uniform, tractable way.
To meet the aforementioned need, the PIs have adopted a unique, unconventional approach to model temporal network dynamics as a "computational" process, represented by repeated extraction and replacement of subgraphs. Prototype versions of algorithms and software have demonstrated promising results for small-scale, simulated network data, yet there are still algorithmic challenges: How can one handle a high volume of noise and temporal sparseness of real-world temporal social network data, and how can one automatically discover nontrivial dynamical models beyond user-provided ones and generalize them to unobserved situations? The proposed project aims to address these challenges.
Automated Generation of Urban Land Use Data by Integrating Remote Sensing and Social Sensing
Lead scientist: Chengbin Deng, geography
Land use and land cover (LULC) data provides invaluable spatial-explicit and functional information of urban lands transformed by human beings. There are a large number of detailed land use types in a heterogeneous urban environment, including single-family, multi-family, commercial, industrial, transportation, and civic land. Such information is helpful to city administrators, scholars and researchers, public health officials, and especially, urban planners for a variety of purposes. Detailed land use data has served as an important input in socioeconomic studies and planning practices. Nowadays, detailed urban land use information relies heavily on manual digitizing, local knowledge from field surveys, as well as other data sources (e.g., building permit records, appraisal materials, census information). Rapid urban expansion requires frequent updates of urban land use data, which is always time consuming and labor intensive. Public information such as tax payment or tax status are also updated and included in the latest databases. It is still very difficult, and almost impossible, to implement automated urban land use updates. Therefore, generating accurate and timely urban land use products in a more manageable time framework can provide a more intelligent approach for a variety of applied practices and urban studies.
In this proposal, we proposed a new method to address the major gaps in traditional urban land use acquisition. This will be done by state-of-the-art statistical learning methods, including random forests, to integrate and analyze geospatial and social big data. On the one hand, remote sensing data provides environmental information of urban physical environments. On the other hand, social media data provides sufficient information of human activities. Eventually, our long term goal is to automatically generate and update land use products by integrating such geospatial open datasets. This will significantly improve the efficiency of LULC mapping to support sustainable urban planning and other practices.
Development of an Intelligent Mental Disease Prediction System Prototype based on Dietary Pattern Analysis: a Pilot Study
Lead scientist: Lina Begdache, health and wellness studies
Nutrition and mental health research is an emerging interdisciplinary field. Nutrition is one of the modifiable risk factors for mental health. Traditionally, studies on the association of diet and mental distress have focused on single nutrients; however current trends in nutritional epidemiology research is leaning toward assessing dietary patterns in relation to comorbidities. This rationale considers the complexity of nutrient interaction and the daily variation in diet. The human brain is continuously changing during development or with age. Therefore, dietary changes may necessitate with age. Our lab has established a prototype that describes the relationship between a healthy diet, exercise, healthy practices and mental wellbeing. Eating healthy may promote healthy habits and mental wellbeing by elevating dopamine levels in the brain. Mental wellbeing then acts as a positive reinforcement to further healthy diet, healthy practices and exercise to improve health. This loop can become a virtuous cycle optimizing mental health. When healthy diet, exercise or healthy practices are absent, lower dopamine levels depresses mood which in turn reduces healthy diet, exercise and healthy practices resulting in a vicious cycle which reflects that mental distress is multidimensional. In addition, individuals have genetic variations, and so the approach of "one size fits all" is losing ground as often medications don't work effectively. Personalized therapy is at the forefront of Precision Medicine, an emerging approach for disease treatment. The significance of this research is that it will support development of targeted nutritional interventions to better mood which will increase precision of other therapies.
Multitask Transfer Learning Enhanced Rare Event Detection using Sensing Data
Lead scientist: Changqing Cheng, systems science and industrial engineering
Rare events are those that often occur at low frequency but with catastrophic consequence, e.g., seismic activity, stock market flash crash, and terrorism attacks. While most of such events are not preventable, the accurate and timely detection will enable promote actions to significantly reduce the severity of the effect and the associated cost. Recently, the widespread of wireless sensors and smart devices have offered an unprecedented opportunity to monitor various complex systems, from manufacturing to healthcare. Remarkably, the time series sensing data contain considerable causal information about the underlying dynamics, and enable us to harness fundamental patterns for diagnosis, prognosis and decision making. Thus, the objective of this study is to design an integrated platform for process monitoring, particularly the rare event detection, using the sensing data. Nonetheless, the inherent nonlinearity and nonstationarity of the sensing data have increasingly become a persistent challenge for sensing-driven process monitoring. Therefore, we propose to design a multitask transfer learning approach to fuse information from multiple sensing sources to enhance the monitoring resolution.
Particularly, abrupt changes in ultra-precision machining exemplify an immense challenge faced by the modern advanced manufacturing. As shown in the Figure, the machining process experiences a scratch on the workpiece surface at time index 10,000. Offline measurement indicates that the surface roughness deteriorates to 82 nm from 35 nm before the scratch occurs. A timely detection of such events from the in situ vibration signals will enable corrective actions to avoid escalating cost.
The (Data) Science of Gerrymandering
Lead scientist: Daniel B. Magleby, political science
In early October 2017, the Supreme Court of the United States heard oral arguments in the case of Gill v. Whitford. The case calls into question the constitutionality of partisan gerrymandering -- the practice of drawing boundaries of districts in such a way that one political party receives an unfair advantage. The plaintiffs in the case argued that data scientists brought to bear a set of tools that allowed Wisconsin's legislature and governor to draw and enact maps that favored the Republican party. By every estimation, the Republicans' strategy was extremely effective. For example, in the 2012 elections, Republican candidates received just 48% of the vote while managing to carry 60% of the districts in the state.
Just as data science can be used to build an unfair advantage, data science can be used to identify and remedy these unfair redistricting practices in Wisconsin and elsewhere. With the seed grant from the Binghamton University Data Science Transdisciplinary Working Group, the team will implement an algorithm that PI Magleby developed with Daniel Mosesson on a high performance computing cluster. In a forthcoming paper, the team show that the algorithm produces maps without any indication of bias. Moreover, the method the team propose is vastly more efficient that alternatives. Access to a cluster will allow to use the algorithm to draw hundreds of millions of hypothetical maps. That large number of counter-factual maps will allow to make inferences about the impact of certain redistricting criteria that mapmakers have used as a defense of partisan outcomes. In particular, it will allow to understand the ways that considerations like race, communities of interest, and other political jurisdictions interact with the biased, partisan outcomes that analysts have observed in recent decades.