Authors:
Vivekchowdary Attaluri
Addresses:
Department of Cyber, Identity Access Management, Capital One, Plano, Texas, United States of America.
With exponential growth in the quantity of real-time applications, data cleaning of the unstructured text form poses significant challenges as it relates more to large-volume data. Therefore, advanced data cleaning pipelines emerge as highly crucial solutions to maintain the accuracy, completeness, and integration of data or their consistency besides easy integration in analytics or machine learning applications. The paper will discuss the latest methodologies and frameworks that aim to construct pipelines to clean real-time data appropriately for handling unstructured text data. Our approach will include automated pre-processing steps such as tokenization, noise removal, and text normalization with dynamic anomaly detection and context-aware semantic validation. Latency and scalability challenges would be addressed by combining state-of-the-art NLP models with distributed computing frameworks such as Apache Spark. The results obtained through the experiment, over a huge dataset of text from social media and news articles, show that this system significantly outperforms existing works regarding processing time and most data quality metrics. The paper also provides a comparative analysis of existing tools, a computational trade-off evaluation, and challenges like real-time deployment and domain-specific bias. The pipeline framework proposed here provides robust data quality for the diverse classes of real-time applications in customer sentiment analysis, fraud detection, and personalized recommendation systems.
Keywords: Real-Time Data Cleaning; Unstructured Text Processing; Scalable NLP Pipelines; High-Volume Datasets; Semantic Validation; Apache Spark; Unstructured Text Data.
Received on: 09/05/2024, Revised on: 28/07/2024, Accepted on: 10/09/2024, Published on: 14/12/2024
AVE Trends in Intelligent Computing Systems, 2024 Vol. 1 No. 4, Pages: 209-218