Smart System for Semantic Search of Text Data
Students Name: Kok Petro Andriiovych
Qualification Level: magister
Speciality: Information Control Systems and Technologies
Institute: Institute of Computer Science and Information Technologies
Mode of Study: full
Academic Year: 2025-2026 н.р.
Language of Defence: ukrainian
Abstract: Relevance. The rapid growth of text data volumes in corporate, scientific and educational systems makes keyword search insufficient: it poorly takes into account synonymy, context, long documents and multilingualism. Vector representations of text and RAG architectures allow searching for information by content, but their implementation requires solving practical problems - correctly dividing texts into fragments, choosing embedding models, building an index and ensuring an acceptable response delay in an interactive chat interface [2]. This makes the development of a smart semantic search system focused on working with multilingual user documents relevant. The object of the study is semantic search of text data in the information environment. The subject of the research is methods and means of implementing interactive RAG systems for analyzing, combining and reproducing text information in a context-sensitive format, which will allow for effective semantic search of text data in the information environment [2]. The goal of the research is to develop a prototype of a smart system for semantic text data search based on the RAG architecture, which will provide interactive interaction of the user with his own files via a chat interface, make it possible to form generalized answers to his queries, search for information by its content within one or more documents and generate new texts taking into account the data obtained. The purpose of the research is to develop a prototype of a smart semantic text data search system based on the RAG architecture, which will provide interactive interaction of the user with his own files via a chat interface, make it possible to generate generalized responses to his requests, search for information by its content within one or more documents, and generate new texts taking into account the data obtained. Research methods and tools. The methods of semantic text analysis, vectorization and dense-retrieval with hybrid ranking of results taking into account structural metadata and the context of the query were used [1]. The search quality was assessed by the metrics Recall@k, Mean Reciprocal Rank (MRR) and Normalized Discounted Cumulative Gain (nDCG). The software implementation was carried out using LangChain and LangGraph (query orchestration), Pinecone (vector index), PostgreSQL (metadata and dialog history), Redis (caching), LangSmith (monitoring), as well as the React + NestJS stack for a client-server chat platform. Results and practical significance. A prototype of a smart system was created that supports loading multilingual documents, their automatic division into fragments, vectorization, indexing in Pinecone, and semantic search for relevant fragments [30]. A hybrid search scheme was proposed that combines primary denseretrieval with re-ranking, which increases the relevance of results compared to traditional algorithms such as BM25 [1]. Interactive work was provided through a chat interface with the ability to link to source fragments, summarize information from several documents, and generate new texts based on a knowledge base. Structure of the work. The master’s thesis consists of an introduction, four chapters, conclusions, a list of used sources and appendices; the chapters are devoted to a review of approaches to semantic search, building a conceptual model of the system, software implementation and experimental assessment of quality and performance. The total volume of the work is 126 pages, of which 76 pages are the main text, 13 figures and 0 tables. The list of sources contains 38 items. Keywords: text vectorization; vector embeddings; RAG-architecture; vector index; hybrid search model; ranking.