Return to Colloquia & Seminar listing
Programming and Correctness Support for Large-Scale Data Processing
Mathematics of Data & Decisions| Speaker: | Caleb Stanford, UC Davis (CS) |
| Related Webpage: | https://web.cs.ucdavis.edu/~cdstanford/ |
| Location: | 1025 PDSB |
| Start time: | Tue, Oct 28 2025, 3:10PM |
Modern data science and computing workloads require massively parallel computations over large distributed datasets, sometimes operating in real time, using popular systems and frameworks such as Apache Spark and MapReduce. Bugs in such workloads can be difficult to detect due to the scale of the data involved, difficult-to-reproduce distributed executions, and semantic discrepancies between the program that is written and the one that is executed in parallel on individual machines. I will overview some of my PhD work and ongoing directions in providing programming language and correctness support to ensure the safety and correct execution of parallel and distributed workloads. Particular questions include: (1) What is an appropriate semantics for programs that operate over large-scale distributed data streams that captures parallelism and distribution requirements? (2) How can we ensure large-scale data computations are correct by providing formal guarantees such as type-safety? (3) How can we obtain formal bounds on the performance and data ingestion requirements of data processing operators? I will discuss a selection of these topics, as well as some potential areas for future work.
