Advanced Data Cleaning Pipelines for High Volume Unstructured Text Datasets in Real-Time Applications

AVE Trends in Intelligent Computing Systems Submit Paper

Archives
Vol. 2
- 2025 Vol.2 No.2
- 2025 Vol.2 No.1
Vol. 1

Advanced Data Cleaning Pipelines for High Volume Unstructured Text Datasets in Real-Time Applications

Authors:
Vivekchowdary Attaluri

Addresses:
Department of Cyber, Identity Access Management, Capital One, Plano, Texas, United States of America.

Abstract:

With exponential growth in the quantity of real-time applications, data cleaning of the unstructured text form poses significant challenges as it relates more to large-volume data. Therefore, advanced data cleaning pipelines emerge as highly crucial solutions to maintain the accuracy, completeness, and integration of data or their consistency besides easy integration in analytics or machine learning applications. The paper will discuss the latest methodologies and frameworks that aim to construct pipelines to clean real-time data appropriately for handling unstructured text data. Our approach will include automated pre-processing steps such as tokenization, noise removal, and text normalization with dynamic anomaly detection and context-aware semantic validation. Latency and scalability challenges would be addressed by combining state-of-the-art NLP models with distributed computing frameworks such as Apache Spark. The results obtained through the experiment, over a huge dataset of text from social media and news articles, show that this system significantly outperforms existing works regarding processing time and most data quality metrics. The paper also provides a comparative analysis of existing tools, a computational trade-off evaluation, and challenges like real-time deployment and domain-specific bias. The pipeline framework proposed here provides robust data quality for the diverse classes of real-time applications in customer sentiment analysis, fraud detection, and personalized recommendation systems.

Keywords: Real-Time Data Cleaning; Unstructured Text Processing; Scalable NLP Pipelines; High-Volume Datasets; Semantic Validation; Apache Spark; Unstructured Text Data.

Received on: 09/05/2024, Revised on: 28/07/2024, Accepted on: 10/09/2024, Published on: 14/12/2024

AVE Trends in Intelligent Computing Systems, 2024 Vol. 1 No. 4, Pages: 209-218

Views : 165
Downloads : 7

Download PDF

Navigation

Archives

Advanced Data Cleaning Pipelines for High Volume Unstructured Text Datasets in Real-Time Applications

Publishing with us

Resources

Get In Touch