On 21 November at 14:00, Henrik Bongertmann will defend his Master's thesis on ‘Application of Large Language Models for Structuring and Integrating Heterogeneous Data in Relational Database Systems’ in room 242 (Konrad-Zuse-Haus). The defence will take place hybrid, for the dial-in data please contact benjamin.nastuni-rostockde in advance. Henrik is a Master's student in the Information Systems programme and was supervised by Benjamin Nast, Leon Griesch (both Chair of Information Systems) and Henry Rotzoll (DVZ).
Abstract
In this thesis, a proof-of-concept was developed to investigate the potential of large language models (LLMs) for the automated structuring of heterogeneous data. As part of a case study, an LLM-based ETL process was implemented in which over 5,000 unstructured documents were transferred to a relational database using an open source LLM. The results show that a high data structuring success rate can be achieved through targeted optimisations such as prompt engineering, post-processing pipelines and correction loops. This paper discusses the limitations of the approach used, particularly with regard to model selection and the generalisability of the results. In addition, recommendations are given for future use cases in which LLMs can be used to process unstructured data. The results of the case study provide a valuable basis for the further use of LLMs in ETL processes as well as for future research in the field of LLM-based data structuring.