Doctoral Dissertation Oral Proposal, Graham Roberts

Tuesday, April 29, 2025 10:00–11:00 AM

Description
Abstract: Machine learning (ML) has shown great promise in a variety of applications. One particular space where ML applications are desired is the analysis of scientific data. Many of these applications differ from the more discussed applications of ML in that data are more difficult to acquire. A wealth of written works, photographs, and data on topics such as browsing habits exist and have resulted in many of the most powerful ML models ever created. In many scientific applications, data collected from a particular technique are scarcer. Analyzing these data may require years of expertise and hours of work to analyze a single datum. ML solutions are therefore highly desirable, but difficult to achieve. Here I present several works to create domain specific ML applications under these limitations. The first application is a hierarchical classifier for small angle scattering curves to identify the structures of nanoparticles. To do this, a hierarchical model creates an interpretable series of decision boundaries that align with physical differences. Creating this hierarchical model involves both selecting the optimal tree structure and tuning each decision along the hierarchical model. Each decision in this hierarchical model is independently trained as a standalone binary classifier. We propose and justify a rebalancing of k-fold cross validation to tune models which generalize better and minimize the risk of overfitting model selection, by using more data for validation than training at each fold. A second project is symbolic regression for implicit equations. In many scientific applications, it can be more informative to learn an actual description of a relationship between two quantities, rather than a tool for mapping input to output. Symbolic regression is an ML technique that learns an actual symbolic expression of relationships between data. Implicit expressions, however, are a uniquely challenging problem, due to only possessing a single label, and often non-invertible relationships between variables. We create a probabilistic featurization of implicit equations utilizing importance sampling. This allows many tools for symbolic regression to generalize to implicit equations and discovering conserved quantities in series of differential equations. Symbolic regression is, in essence, a special case of model selection. We utilize other tools including weighted bootstrap sampling and complexity-focused checkpoints to find the symbolic expression with the fewest operators, while carefully avoiding false answers such as trivial models. By utilizing careful model selection and careful construction of features and models we can create ML tools using small amounts of data.
Website
https://events.uconn.edu/event/1066555-doctoral-dissertation-oral-proposal-graham-roberts
Categories
Conferences & Speakers

More from Master Calendar

View all