Data Quality-aware Graph Machine Learning at CIKM’24

Abstract

Recent years have witnessed revolutionary success (e.g., large foundational models such as GPT/Stable Diffusion in Language/Image Generation) of shifting from the model-centric perspective, which focuses on developing top-performing models, to the data-centric perspective, emphasizing the quality of the data. Concurrently, the architecture design of graph machine learning (GML) has achieved significant progress with the emergence of Graph Neural Networks (GNNs) and more recent Graph Foundational Models (GFMs). Despite the recent advancements in model-centric architecture design especially for GNNs and GFMs, their heavy reliance on node features and graph structures makes them vulnerable to multiple data quality issues from a data-centric perspective.

Graph-structured data, like many other data modalities, suffer from conventional data-quality issues, e.g., imbalance issues where uneven training data across different classes can lead to inconsistent model performance. Even worse, the incorporation of graph structures can potentially exacerbate the aforementioned issues and even bring up new ones. For example, the imbalance issue that initially happens at the quantitative level could also happen at the structural level. These issues not only compromise the practical applications of GML but also present diagnostic challenges due to the complex and abstract nature of graph data, obstructing the progress of data-centric AI in GML. Given the revolutionary success achieved by data-centric AI in other fields, the ubiquity of GML in tremendous real-world applications, and the challenges of applying data-centric AI in GML, our tutorial aims to make a timely contribution by sensitizing the graph ML community towards graph data-quality issues with the following contributions:

Introducing the background, basics, and more recent foundational models in Graph Machine Learning
Giving a comprehensive overview of discovering/investigating/mitigating/resolving these graph-data quality issues from model/data-centric solutions
Discussing data-centric research in recent Graph Foundational Models and some crucial challenges for future works

Time and Location

Time: TBD
Location: Boise Centre, Boise, Idaho, USA

Tutorial Outline

Introduction and Background
- Graph-structured data, Graph Machine Learning and GNNs
- Summary of Graph Data Quality Issues
Topology Issues
- Global Positional Issues
- Local Topology Issues
- Missing Graph Issues
- Future Directions
Imbalance Issues
- Node-level Imbalance
- Edge-level Imbalance
- Graph-level Imbalance
- Future Directions
Bias Issues
- Group Fairness on Graphs
- Individual Fairness on Graphs
- Degree Fairness on Graphs
- Future Directions
Limited Labeled Data Issues
- Graph Data Augmentations
- Self-supervised Learning on Graphs
- Future Directions
Abnormal Graph Data Issues
- Missing and Noisy Features
- Adversarially Attacked Data
- Future Directions

Summary

Incoming Soon!

Slides

Incoming Soon!

Speakers Bio’s

Yu Wang is an incoming Assistant Professor in the Department of Computer Sciences at the University of Oregon. Before joining University of Oregon, he obtained his Ph.D. degree in Computer Science at Vanderbilt University in 2024 under the supervision of Prof. Tyler Derr. His research interests include Graph Data Mining and Data-centric Machine Learning. He received numerous honors including the Best Paper Award in the 2020 Smokey Mountain Data Challenge Competition by ORNL and GLFrontiers Workshop at Neurips’23, Best Doctoral Forum Poster Runner-ups at SDM’24. He actively contributed to the community, both in terms of publishing and serving as PC member/reviewer/organizer such as ICLR, NeurIPS, AAAI, KDD, WWW, CIKM, WSDM, TKDD, TIST. He has contributed to the organization of workshops in WSDM’22/24, and the student travel award chair in CIKM’24.

Kaize Ding is an Assistant Professor in the Department of Statistics and Data Science at Northwestern University. Before joining Northwestern, he obtained his Ph.D. degree in Computer Science at Arizona State University in 2023 under the supervision of Prof. Huan Liu. His research interests are generally in data mining, machine learning, and natural language processing, with a particular focus on GML, data-efficient learning, and reliable AI. His work has been published in top-tier conferences and journals (e.g., AAAI, EMNLP, IJCAI, KDD, NeurIPS, TheWebConf, and TNNLS), and has been recognized with several prestigious awards, including the AAAI New Faculty Highlights, SDM Best Posters Award, Best Paper Award at the Trustworthy Learning on Graphs workshop, etc.

Xiaorui Liu is an assistant professor in the Computer Science Department at North Carolina State University. He received his Ph.D. degree in Computer Science from Michigan State University in 2022. His research interests include graph machine learning, large-scale machine learning, and trustworthy artificial intelligence. He has published innovative works in top-tier conferences such as NeurIPS, ICML, ICLR, KDD, AISTATS, and SIGIR. He was awarded the ACM SIGKDD Outstanding Dissertation Award (Runner-up) in 2023, Amazon Research Award in 2022, Chinese Government Award for Outstanding Students Abroad in 2022, and Best Paper Honorable Mention Award at ICHI’19. He has organized and co-presented two tutorials related to Graph Machine Learning at KDD’21/23.

Jian Kang is an Assistant Professor in the Department of Computer Science at the University of Rochester. His research aims to develop data mining and machine learning techniques on graphs that are trustworthy and can advance scientific discovery. He was recognized as a Rising Star in Data Science by The University of Chicago, Mavis Future Faculty Fellow by the University of Illinois Urbana-Champaign and top reviewer at multiple conferences (ICML’20, ICLR’21, CIKM’21, NeurIPS’22, LOG’22). He received his Ph.D. in Computer Science from the University of Illinois Urbana-Champaign. He has presented four tutorials on fair graph learning (CIKM’21, KDD’22, SDM’23, SDM’24) and organized two workshops on trustworthy graph learning (CIKM’22, WWW’24).

Ryan Rossi is a Senior Research Scientist at Adobe Research. His research includes machine learning and spans theory, algorithms, and applications of large complex graph data. He has authored over 100 papers published in top-tier conferences/journals such as NeurIPS, ICML, AAAI, KDD, IJCAI, ICLR, COLT, WWW, ICML, WSDM, JMLR, etc. Before joining Adobe Research, he worked at many industrial, government, and academic research labs including Palo Alto Research Center (Xerox PARC), Lawrence Livermore National Laboratory, Naval Research Laboratory, NASA Jet Propulsion Laboratory/Caltech, and University of Massachusetts Amherst. Ryan earned his Ph.D./M.S. in Computer Science at Purdue University. He was a recipient of the National Science Foundation Graduate Research Fellowship, National Defense Science and Engineering Graduate Fellowship, the Purdue Frederick N. Andrews Fellowship, and Bilsland Dissertation Fellowship. He has co-organized many workshops and challenges including the Scientific Figure Captioning Challenge at IJCAI’24.

Tyler Derr is an Assistant Professor in the Department of Computer Science at Vanderbilt University and directs the Network and Data Science (NDS) lab, which conducts research in the areas of data mining and machine learning, with emphasis on (1) social network analysis and recommender systems, (2) deep learning on graphs, (3) responsible and trustworthy AI, and (4) interdisciplinary social good applications. He is actively involved in top conferences in his field, both in terms of publishing and serving as a PC/SPC member, while receiving recognition such as the Best Student Poster Award at SDM’19 and Best Reviewer Awards at ICWSM’19/’21, as well as WSDM’22. He has contributed to the organization of numerous international conferences and workshops, including serving on the organizing committee of KDD (2021-2024), DSAA (2024), and WSDM (2022, 2024), along with co-founding the Machine Learning on Graphs (MLoG) Workshop at WSDM (2022-2024) along with at ICDM (2022-2023). Being passionate about sharing knowledge, he has delivered tutorials on Graph Neural Networks at KDD’20, AAAI’21, and SDM’24. He serves as Associate Editor for four journals including Tsinghua Science and Technology and IEEE Transactions on Big Data. Tyler has received several awards including Vanderbilt’s Fall 2020 Teaching Innovation Award from the School of Engineering and the NSF CAREER Award in 2023.