Zero-Trust Data Pipelines for AI Systems: A Framework for Secure, Verifiable, and Auditable Data Engineering
DOI:
https://doi.org/10.70589/JRTCSE.2026.14.2.2Keywords:
Zero-trust architecture, AI pipeline security, data provenance, cryptographic integrity, anomaly detection, ECDSA-P384, Open Policy Agent, trustworthy AIAbstract
Training data is the primary attack surface that most AI security strategies leave unguarded. Pipelines built on perimeter trust treat every connected component as reliable after initial authentication, a design that worked tolerably in closed, homogeneous environments but fails badly in the distributed multi-vendor architectures where modern AI systems operate. An adversary with access to a pipeline ingestion layer, whether through credential theft, supply chain compromise, or insider privilege, can silently corrupt training data at a scale that no downstream evaluation benchmark is designed to detect. This paper introduces the Zero-Trust Data Pipeline (ZTDP) framework which applies the never-trust-always-verify principle to data artifacts directly rather than to the network channels that carry them. The framework is organized across five coordinated security planes: a Data Plane that enforces cryptographic authentication on every artifact, a Control Plane that maintains continuous identity verification a Provenance Plane that builds a tamper-evident lineage graph through ECDSA-P384-signed JWT tokens, a Policy Plane that enforces least-privilege access through co-located Open Policy Agent instances, and an Observability Plane that combines statistical distribution monitoring Isolation Forest scoring and Hidden Markov Model sequence analysis for real-time anomaly detection. Evaluated against a 500-million-record financial AI training workload, the full ZTDP configuration achieves 100% detection of data tampering insider injection and component impersonation, and 96% detection of behavioral anomalies, while adding 9.1% throughput overhead and 0.8 ms per-stage latency. The results establish that rigorous data-layer security and production-grade throughput are not in fundamental tension.
References
Barocas, S., Hardt, M., & Narayanan, A. (2023). Fairness and machine learning: Limitations and opportunities. MIT Press. https://fairmlbook.org/
Bates, A., Tian, D., Butler, K. R. B., & Moyer, T. (2015). Trustworthy whole-system provenance for the Linux kernel. In Proceedings of the 24th USENIX Security Symposium (pp. 319-334). USENIX Association. https://www.usenix.org/conference/usenixsecurity15/technical-sessions/presentation/bates
Bertino, E., & Ferrari, E. (2018). Big data security and privacy. In S. Flesca, S. Greco, E. Masciari, & D. Sacca (Eds.), A comprehensive guide through the Italian database research over the last 25 years (pp. 425-439). Springer. https://doi.org/10.1007/978-3-319-61893-7_25
Buneman, P., Khanna, S., & Tan, W.-C. (2001). Why and where: A characterization of data provenance. In J. Van den Bussche & V. Vianu (Eds.), Proceedings of the 8th International Conference on Database Theory (ICDT 2001) (pp. 316-330). Springer. https://doi.org/10.1007/3-540-44503-X_20
Carminati, B., Ferrari, E., & Viviani, P. C. (2013). Security and trust in online social networks. Morgan & Claypool. https://doi.org/10.2200/S00549ED1V01Y201311SPT008
Cybersecurity Insiders. (2024). 2024 insider threat report. Gurucul. https://gurucul.com/2024-insider-threat-report/
European Parliament and Council of the European Union. (2024). Regulation (EU) 2024/1689 laying down harmonised rules on artificial intelligence (Artificial Intelligence Act). Official Journal of the European Union, L 2024/1689. https://eur-lex.europa.eu/eli/reg/2024/1689/oj
Gentyala, S. (2025). ContextGuard: Zero-trust middleware for MCP supply chain security [Computer software]. GitHub. https://github.com/sunilgentyala/contextguard
Goldblum, M., Tsipras, D., Xie, C., Chen, X., Schwarzschild, A., Song, D., Madry, A., Li, B., & Goldstein, T. (2023). Dataset security for machine learning: Data poisoning, backdoor attacks, and defenses. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(2), 1563-1580. https://doi.org/10.1109/TPAMI.2022.3162397
Gu, T., Liu, K., Dolan-Gavitt, B., & Garg, S. (2019). BadNets: Evaluating backdooring attacks on deep neural networks. IEEE Access, 7, 47230-47244. https://doi.org/10.1109/ACCESS.2019.2909068
IBM Security. (2024). Cost of a data breach report 2024. IBM Corporation. https://www.ibm.com/reports/data-breach
Jagielski, M., Oprea, A., Biggio, B., Liu, C., Nita-Rotaru, C., & Li, B. (2018). Manipulating machine learning: Poisoning attacks and countermeasures for regression learning. In Proceedings of the 2018 IEEE Symposium on Security and Privacy (pp. 19-35). IEEE. https://doi.org/10.1109/SP.2018.00057
Kindervag, J. (2010). Build security into your network's DNA: The zero trust network architecture. Forrester Research. https://www.forrester.com/report/build-security-into-your-networks-dna-the-zero-trust-network-architecture/-/E-RES56682
LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521(7553), 436-444. https://doi.org/10.1038/nature14539
Mehraj, S., & Banday, M. T. (2020). Establishing a zero trust strategy in cloud computing environment. In Proceedings of the 2020 IEEE International Conference on Intelligent Computing and Control Systems (pp. 562-566). IEEE. https://doi.org/10.1109/ICICCS48265.2020.9121018
Merkle, R. C. (1987). A digital signature based on a conventional encryption function. In C. Pomerance (Ed.), Advances in cryptology: CRYPTO 1987 (pp. 369-378). Springer. https://doi.org/10.1007/3-540-48184-2_32
Moreau, L., & Missier, P. (Eds.). (2013). PROV-DM: The PROV data model (W3C Recommendation). World Wide Web Consortium. https://www.w3.org/TR/prov-dm/
National Institute of Standards and Technology. (2023). Artificial intelligence risk management framework (NIST AI 100-1). U.S. Department of Commerce. https://doi.org/10.6028/NIST.AI.100-1
Ramachandran, A., & Kantarcioglu, M. (2017). Using blockchain and smart contracts for secure data provenance management (arXiv:1709.10000). arXiv. https://arxiv.org/abs/1709.10000
Rose, S., Borchert, O., Mitchell, S., & Connelly, S. (2020). Zero trust architecture (NIST Special Publication 800-207). National Institute of Standards and Technology. https://doi.org/10.6028/NIST.SP.800-207
Ruan, P., Chen, G., Dinh, T. T. A., Lin, Q., Ooi, B. C., & Zhang, M. (2019). Fine-grained, secure and efficient data provenance on blockchain systems. Proceedings of the VLDB Endowment, 12(9), 975-988. https://doi.org/10.14778/3329772.3329775
Schelter, S., Rukat, T., & Biessmann, F. (2020). Learning to validate the predictions of black box classifiers on unseen data. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data (pp. 1289-1299). ACM. https://doi.org/10.1145/3318464.3380604
Schlegel, M., & Sattler, K.-U. (2024). Capturing end-to-end provenance for machine learning pipelines. Information Systems, 121, Article 102470. https://doi.org/10.1016/j.is.2024.102470
Shafahi, A., Huang, W. R., Najibi, M., Suciu, O., Studer, C., Dumitras, T., & Goldstein, T. (2018). Poison frogs! Targeted clean-label poisoning attacks on neural networks. In Advances in neural information processing systems 31 (NeurIPS 2018) (pp. 6103-6116). Curran Associates. https://proceedings.neurips.cc/paper/2018/hash/22722a343513ed45f14905eb07621686-Abstract.html
Shetty, S., Kamhoua, C., & Njilla, L. (Eds.). (2019). Blockchain for distributed systems security. Wiley-IEEE Press. https://doi.org/10.1002/9781119519621
Souza, R., Azevedo, L. G., Lourenco, V., Soares, E., Thiago, R., Brandao, R., Civitarese, D., Brazil, E. V., Moreno, M., Valduriez, P., Mattoso, M., Cerqueira, R., & Netto, M. A. S. (2022). Workflow provenance in the lifecycle of scientific machine learning. Concurrency and Computation: Practice and Experience, 34(14), Article e6544. https://doi.org/10.1002/cpe.6544
Syed, N. F., Shah, S. W., Shaghaghi, A., Anwar, A., Baig, Z., & Doss, R. (2022). Zero trust architecture (ZTA): A comprehensive survey. IEEE Access, 10, 57143-57179. https://doi.org/10.1109/ACCESS.2022.3174679
Vassilev, A., Oprea, A., Fordyce, A., Anderson, H., Davies, X., & Hamin, M. (2025). Adversarial machine learning: A taxonomy and terminology of attacks and mitigations (NIST AI 100-2e2025). National Institute of Standards and Technology. https://doi.org/10.6028/NIST.AI.100-2e2025.
Downloads
Issue
Section
License
Copyright (c) 2026 Sunil Kumar Mudusu, Sunil Gentyala (Author)

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.




