Generative AI in Data Management: Enhancing Metadata Discovery and Policy Enforcement via LLMs

Ajay Kumar Punia; Arun Chaudhary

doi:10.70589/JRTCSE.2024.5.10

Authors

Ajay Kumar Punia Citizens Bank, Phoenix, AZ, USA Author
Arun Chaudhary Credit One Bank, Las Vegas. Nevada, USA Author

DOI:

https://doi.org/10.70589/JRTCSE.2024.5.10

Keywords:

Generative AI, Large Language Models (LLMs), Metadata Discovery, Data Governance, Policy Enforcement, Data Lineage, Semantic Enrichment, Enterprise Data Management, AI-driven Compliance, Responsible AI

Abstract

The rapid growth of enterprise data ecosystems has intensified challenges in metadata discovery, governance, and policy enforcement. Traditional rule-based and manual approaches to data management struggle to scale across heterogeneous, distributed, and semi-structured data environments. Generative Artificial Intelligence (GenAI), particularly Large Language Models (LLMs), offers transformative capabilities in automating metadata extraction, semantic enrichment, lineage inference, and policy compliance monitoring. This paper investigates how LLMs enhance metadata discovery and enable intelligent policy enforcement within modern data platforms. We analyze pre-2024 literature on AI-driven data governance, metadata management frameworks, and natural language processing applications in enterprise systems. We propose an LLM-augmented architecture for automated metadata lifecycle management, present workflow and sequence models, and evaluate benefits and limitations in terms of scalability, explainability, bias, and compliance risks. The study demonstrates that generative AI significantly improves semantic metadata resolution, regulatory alignment (e.g., GDPR, HIPAA), and real-time policy auditing, while introducing new governance challenges requiring hybrid human-AI oversight models.

References

Abedjan, Z., Golab, L., & Naumann, F. (2015). Profiling relational data: A survey. The VLDB Journal, 24(4), 557–581. https://doi.org/10.1007/s00778-015-0389-y

Bommasani, R., Hudson, D. A., Adeli, E., et al. (2021). On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258.

Brown, T. B., Mann, B., Ryder, N., et al. (2020). Language models are few-shot learners. Advances in Neural Information Processing Systems, 33, 1877–1901.

Brundage, M., Avin, S., Clark, J., et al. (2018). The malicious use of artificial intelligence: Forecasting, prevention, and mitigation. arXiv preprint arXiv:1802.07228.

DAMA International. (2017). DAMA-DMBOK: Data management body of knowledge (2nd ed.). Technics Publications.

Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of deep bidirectional transformers for language understanding. NAACL-HLT 2019 Proceedings, 4171–4186.

Floridi, L., & Cowls, J. (2019). A unified framework of five principles for AI in society. Harvard Data Science Review, 1(1).

Halevy, A., Rajaraman, A., & Ordille, J. (2006). Data integration: The teenage years. Proceedings of the VLDB Endowment, 2(2), 1590–1601.

Hogan, A., Blomqvist, E., Cochez, M., et al. (2021). Knowledge graphs. ACM Computing Surveys, 54(4), 1–37.

Jagadish, H. V., Gehrke, J., Labrinidis, A., et al. (2014). Big data and its technical challenges. Communications of the ACM, 57(7), 86–94.

Khatri, V., & Brown, C. V. (2010). Designing data governance. Communications of the ACM, 53(1), 148–152.

Liu, P., Yuan, W., Fu, J., et al. (2023). Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys, 55(9), 1–35.

Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.

Mittelstadt, B. D., Allo, P., Taddeo, M., Wachter, S., & Floridi, L. (2016). The ethics of algorithms: Mapping the debate. Big Data & Society, 3(2), 1–21.

National Institute of Standards and Technology (NIST). (2023). Artificial intelligence risk management framework (AI RMF 1.0). U.S. Department of Commerce.

Otto, B. (2011). Organizing data governance: Findings from the telecommunications industry and consequences for large service providers. Communications of the Association for Information Systems, 29(1), 45–66.

Paulheim, H. (2017). Knowledge graph refinement: A survey of approaches and evaluation methods. Semantic Web, 8(3), 489–508.

Pennington, J., Socher, R., & Manning, C. (2014). GloVe: Global vectors for word representation. Proceedings of EMNLP 2014, 1532–1543.

Raffel, C., Shazeer, N., Roberts, A., et al. (2020). Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140), 1–67.

Raji, I. D., Smart, A., White, R. N., et al. (2020). Closing the AI accountability gap: Defining an end-to-end framework for internal algorithmic auditing. Proceedings of FAT ’20, 33–44.

Sivarajah, U., Kamal, M. M., Irani, Z., & Weerakkody, V. (2017). Critical analysis of big data challenges and analytical methods. Journal of Business Research, 70, 263–286.

Vaswani, A., Shazeer, N., Parmar, N., et al. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30, 5998–6008.