Natural Language Processing and Machine Learning for Semantic Subject Indexing of Google Books Using Annif
Main Article Content
Abstract
This contemplate investigates the application of the Annif Toolkit for automatic (semantic) subject indexing in large-scale collections of digitized books with Google Books as a sample test collection. This study addresses the issues related to manual subject indexing in the context of large-scale collections of heterogeneous content that are characterized by their volume, heterogeneity and lack of scalability and consistency. In this interpretation thoroughly explored the use of a machine learning approach based on a vocabulary driven approach using a curated set of 5000 Google Book records analyzed through an automated indexing pipeline utilizing a TF-IDF backend and a controlled subject vocabulary. It has evaluated system performance using common Information Retrieval metrics (precision, recall, F1-Score, etc.) as well as ranking-based metrics (Normalized Discounted Cumulative Gain). Therefore, consequences indicate that the system is capable of identifying highly relevant subject concepts from the contents of a wide range of different types of books at very high rates of recall and that it provides high-quality rankings; however, our results also indicate that the system has moderate precision due to its design, which was optimized for high recall, making it suitable for large-scale digital libraries. Finally, the integrated system has highlighted the potential of the Annif Toolkit to provide real-time subject recommendations, to index content in search systems automatically, and to provide decision-support tools to enable efficient, consistent and semantically-enriched organization of knowledge.