Synthetic Data Generation for Supply Chain

Yunbo Long 506 words 3 minutes Synthetic Data Generative AI Benchmarking

Synthetic data generation has emerged as a foundational enabler of supply chain AI research, addressing three persistent challenges: data scarcity, confidentiality, and the difficulty of constructing reproducible benchmarks. Real-world supply chain data is typically proprietary, fragmented across trading partners, and subject to strict regulatory and contractual constraints—creating a significant barrier for academic research and the fair comparison of AI methods.

Modern synthetic data techniques—ranging from generative adversarial networks (GANs) and variational autoencoders to diffusion models, conditional generative models, and agent-based simulation—can produce artificial datasets that preserve the statistical distributions, temporal dynamics, and network topology of real supply chains. These synthetic datasets enable researchers to train and evaluate models for demand forecasting, delay prediction, supplier link prediction, and risk analysis without ever exposing sensitive operational data. They also support differential-privacy and secure-multiparty-computation pipelines, making them a natural complement to privacy-preserving learning paradigms such as federated learning.

Research from the Supply Chain AI Lab at the University of Cambridge has contributed to this field through the release of open synthetic supply chain datasets and benchmarks—including networks for link prediction, shipment records for delay forecasting, and simulated procurement environments for agent-based research. These efforts are part of a broader community movement towards open, reproducible, and privacy-respecting supply chain AI.

We invite you to explore the curated collection of key publications below, offering insights into the methods and applications of synthetic data generation for supply chains.

List of Publications

  1. Xu, L., Proselkov, Y., Brintrup, A. and Long, Y., 2024. Synthetic supply chain datasets for benchmarking AI methods. IFAC-PapersOnLine, 58(19), pp.807-812. [PDF]
  2. Xu, L., Skoularidou, M., Cuesta-Infante, A. and Veeramachaneni, K., 2019. Modeling tabular data using conditional GAN. Advances in Neural Information Processing Systems, 32. [PDF]
  3. Jordon, J., Yoon, J. and van der Schaar, M., 2019. PATE-GAN: Generating synthetic data with differential privacy guarantees. International Conference on Learning Representations (ICLR). [PDF]
  4. Assefa, S.A., Dervovic, D., Mahfouz, M., Tillman, R.E., Reddy, P. and Veloso, M., 2020. Generating synthetic data in finance: Opportunities, challenges and pitfalls. Proceedings of the First ACM International Conference on AI in Finance, pp.1-8. [PDF]
  5. Lu, Y., Shen, M., Wang, H. and Wang, X., 2023. Machine learning for synthetic data generation: A review. arXiv preprint arXiv:2302.04062. [PDF]
  6. Figueira, A. and Vaz, B., 2022. Survey on synthetic data generation, evaluation methods and GANs. Mathematics, 10(15), p.2733. [PDF]
  7. Hernandez, M., Epelde, G., Alberdi, A., Cilla, R. and Rankin, D., 2022. Synthetic data generation for tabular health records: A systematic review. Neurocomputing, 493, pp.28-45. [PDF]