Zero-Trust Data Pipelines for AI Systems: A Framework for Secure, Verifiable, and Auditable Data Engineering

Authors

  • Sunil Kumar Mudusu Lead AI Data Engineer, Church Mutual Insurance Company, S.I., USA Author
  • Sunil Gentyala Lead Cybersecurity and AI Security Consultant, HCLTech (HCL America Inc.), Dallas, TX, USA Author

DOI:

https://doi.org/10.70589/JRTCSE.2026.14.2.2

Keywords:

Zero-trust architecture, AI pipeline security, data provenance, cryptographic integrity, anomaly detection, ECDSA-P384, Open Policy Agent, trustworthy AI

Abstract

Training data is the primary attack surface that most AI security strategies leave unguarded. Pipelines built on perimeter trust treat every connected component as reliable after initial authentication, a design that worked tolerably in closed, homogeneous environments but fails badly in the distributed multi-vendor architectures where modern AI systems operate. An adversary with access to a pipeline ingestion layer, whether through credential theft, supply chain compromise, or insider privilege, can silently corrupt training data at a scale that no downstream evaluation benchmark is designed to detect. This paper introduces the Zero-Trust Data Pipeline (ZTDP) framework which applies the never-trust-always-verify principle to data artifacts directly rather than to the network channels that carry them. The framework is organized across five coordinated security planes: a Data Plane that enforces cryptographic authentication on every artifact, a Control Plane that maintains continuous identity verification a Provenance Plane that builds a tamper-evident lineage graph through ECDSA-P384-signed JWT tokens, a Policy Plane that enforces least-privilege access through co-located Open Policy Agent instances, and an Observability Plane that combines statistical distribution monitoring Isolation Forest scoring and Hidden Markov Model sequence analysis for real-time anomaly detection. Evaluated against a 500-million-record financial AI training workload, the full ZTDP configuration achieves 100% detection of data tampering insider injection and component impersonation, and 96% detection of behavioral anomalies, while adding 9.1% throughput overhead and 0.8 ms per-stage latency. The results establish that rigorous data-layer security and production-grade throughput are not in fundamental tension.

References

Barocas, S., Hardt, M., & Narayanan, A. (2023). Fairness and machine learning: Limitations and opportunities. MIT Press. https://fairmlbook.org/

Bates, A., Tian, D., Butler, K. R. B., & Moyer, T. (2015). Trustworthy whole-system provenance for the Linux kernel. In Proceedings of the 24th USENIX Security Symposium (pp. 319-334). USENIX Association. https://www.usenix.org/conference/usenixsecurity15/technical-sessions/presentation/bates

Bertino, E., & Ferrari, E. (2018). Big data security and privacy. In S. Flesca, S. Greco, E. Masciari, & D. Sacca (Eds.), A comprehensive guide through the Italian database research over the last 25 years (pp. 425-439). Springer. https://doi.org/10.1007/978-3-319-61893-7_25

Buneman, P., Khanna, S., & Tan, W.-C. (2001). Why and where: A characterization of data provenance. In J. Van den Bussche & V. Vianu (Eds.), Proceedings of the 8th International Conference on Database Theory (ICDT 2001) (pp. 316-330). Springer. https://doi.org/10.1007/3-540-44503-X_20

Carminati, B., Ferrari, E., & Viviani, P. C. (2013). Security and trust in online social networks. Morgan & Claypool. https://doi.org/10.2200/S00549ED1V01Y201311SPT008

Cybersecurity Insiders. (2024). 2024 insider threat report. Gurucul. https://gurucul.com/2024-insider-threat-report/

European Parliament and Council of the European Union. (2024). Regulation (EU) 2024/1689 laying down harmonised rules on artificial intelligence (Artificial Intelligence Act). Official Journal of the European Union, L 2024/1689. https://eur-lex.europa.eu/eli/reg/2024/1689/oj

Gentyala, S. (2025). ContextGuard: Zero-trust middleware for MCP supply chain security [Computer software]. GitHub. https://github.com/sunilgentyala/contextguard

Goldblum, M., Tsipras, D., Xie, C., Chen, X., Schwarzschild, A., Song, D., Madry, A., Li, B., & Goldstein, T. (2023). Dataset security for machine learning: Data poisoning, backdoor attacks, and defenses. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(2), 1563-1580. https://doi.org/10.1109/TPAMI.2022.3162397

Gu, T., Liu, K., Dolan-Gavitt, B., & Garg, S. (2019). BadNets: Evaluating backdooring attacks on deep neural networks. IEEE Access, 7, 47230-47244. https://doi.org/10.1109/ACCESS.2019.2909068

IBM Security. (2024). Cost of a data breach report 2024. IBM Corporation. https://www.ibm.com/reports/data-breach

Jagielski, M., Oprea, A., Biggio, B., Liu, C., Nita-Rotaru, C., & Li, B. (2018). Manipulating machine learning: Poisoning attacks and countermeasures for regression learning. In Proceedings of the 2018 IEEE Symposium on Security and Privacy (pp. 19-35). IEEE. https://doi.org/10.1109/SP.2018.00057

Kindervag, J. (2010). Build security into your network's DNA: The zero trust network architecture. Forrester Research. https://www.forrester.com/report/build-security-into-your-networks-dna-the-zero-trust-network-architecture/-/E-RES56682

LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521(7553), 436-444. https://doi.org/10.1038/nature14539

Mehraj, S., & Banday, M. T. (2020). Establishing a zero trust strategy in cloud computing environment. In Proceedings of the 2020 IEEE International Conference on Intelligent Computing and Control Systems (pp. 562-566). IEEE. https://doi.org/10.1109/ICICCS48265.2020.9121018

Merkle, R. C. (1987). A digital signature based on a conventional encryption function. In C. Pomerance (Ed.), Advances in cryptology: CRYPTO 1987 (pp. 369-378). Springer. https://doi.org/10.1007/3-540-48184-2_32

Moreau, L., & Missier, P. (Eds.). (2013). PROV-DM: The PROV data model (W3C Recommendation). World Wide Web Consortium. https://www.w3.org/TR/prov-dm/

National Institute of Standards and Technology. (2023). Artificial intelligence risk management framework (NIST AI 100-1). U.S. Department of Commerce. https://doi.org/10.6028/NIST.AI.100-1

Ramachandran, A., & Kantarcioglu, M. (2017). Using blockchain and smart contracts for secure data provenance management (arXiv:1709.10000). arXiv. https://arxiv.org/abs/1709.10000

Rose, S., Borchert, O., Mitchell, S., & Connelly, S. (2020). Zero trust architecture (NIST Special Publication 800-207). National Institute of Standards and Technology. https://doi.org/10.6028/NIST.SP.800-207

Ruan, P., Chen, G., Dinh, T. T. A., Lin, Q., Ooi, B. C., & Zhang, M. (2019). Fine-grained, secure and efficient data provenance on blockchain systems. Proceedings of the VLDB Endowment, 12(9), 975-988. https://doi.org/10.14778/3329772.3329775

Schelter, S., Rukat, T., & Biessmann, F. (2020). Learning to validate the predictions of black box classifiers on unseen data. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data (pp. 1289-1299). ACM. https://doi.org/10.1145/3318464.3380604

Schlegel, M., & Sattler, K.-U. (2024). Capturing end-to-end provenance for machine learning pipelines. Information Systems, 121, Article 102470. https://doi.org/10.1016/j.is.2024.102470

Shafahi, A., Huang, W. R., Najibi, M., Suciu, O., Studer, C., Dumitras, T., & Goldstein, T. (2018). Poison frogs! Targeted clean-label poisoning attacks on neural networks. In Advances in neural information processing systems 31 (NeurIPS 2018) (pp. 6103-6116). Curran Associates. https://proceedings.neurips.cc/paper/2018/hash/22722a343513ed45f14905eb07621686-Abstract.html

Shetty, S., Kamhoua, C., & Njilla, L. (Eds.). (2019). Blockchain for distributed systems security. Wiley-IEEE Press. https://doi.org/10.1002/9781119519621

Souza, R., Azevedo, L. G., Lourenco, V., Soares, E., Thiago, R., Brandao, R., Civitarese, D., Brazil, E. V., Moreno, M., Valduriez, P., Mattoso, M., Cerqueira, R., & Netto, M. A. S. (2022). Workflow provenance in the lifecycle of scientific machine learning. Concurrency and Computation: Practice and Experience, 34(14), Article e6544. https://doi.org/10.1002/cpe.6544

Syed, N. F., Shah, S. W., Shaghaghi, A., Anwar, A., Baig, Z., & Doss, R. (2022). Zero trust architecture (ZTA): A comprehensive survey. IEEE Access, 10, 57143-57179. https://doi.org/10.1109/ACCESS.2022.3174679

Vassilev, A., Oprea, A., Fordyce, A., Anderson, H., Davies, X., & Hamin, M. (2025). Adversarial machine learning: A taxonomy and terminology of attacks and mitigations (NIST AI 100-2e2025). National Institute of Standards and Technology. https://doi.org/10.6028/NIST.AI.100-2e2025.

Downloads

How to Cite

Sunil Kumar Mudusu, & Sunil Gentyala. (2026). Zero-Trust Data Pipelines for AI Systems: A Framework for Secure, Verifiable, and Auditable Data Engineering. JOURNAL OF RECENT TRENDS IN COMPUTER SCIENCE AND ENGINEERING ( JRTCSE), 14(2), 10-25. https://doi.org/10.70589/JRTCSE.2026.14.2.2