Image Document Classification Prediction based on SVM and gradient-boosting Algorithms

Image document classification is crucial in various domains, including healthcare, finance, and security. Automatically categorizing images into predefined classes can significantly improve data management and decision-making processes. For this research, we investigate the effectiveness of two machine learning algorithms, Support Vector Machines (SVM) and Gradient Boosting, for image document classification. First, we preprocess the image data by extracting relevant features, such as Image Embedding, to create a feature vector for each image. These features are essential for representing the content of the images accurately. Next, we apply SVM, a robust supervised learning algorithm, to train a classification model. SVM aims to Determine the optimal hyperplane for effectively distinguishing the images into different classes while maximizing the margin. Furthermore, we explore the Gradient Boosting algorithm, an ensemble learning method combining multiple weak learners to create a robust classifier. We experimented with different classification results with ten classes. We employ Multiple measures, including accuracy, precision, recall, F1-score, and ROC-AUC, are used to assess the performance of the SVM and Gradient Boosting models. The higher result of 0.964 for SVM compared with Adaboost is achieved. 0.853. 


INTRODUCTION
All corporate Communication and record-keeping depend on documents [1].Automatic document information extraction is difficult [2].Typically, Prior to information extraction, physical documents undergo the process of being scanned and photographed.Many Document Image Processing Pipelines require document classification.Document classification improves document processing systems [3].Thus, several document classification methods leverage text content [5], document structure [7] or both [12].This field has advanced, especially with deep learning [14].Since AlexNet [16], deep neural network research and performance have grown significantly.Numerous experiments and studies have been undertaken with computer vision models, including VGNet [17], Resnet [18], and InceptionNet [19].The models performed well with classification tasks since the documents are viewed as images [27].Many documents are essentially identical but have different text.Thus, these document images communicate high-level structural information, but low-level traits that can distinguish visually comparable images have been neglected for a long time [28].Several articles [13] consider adding features to increase accuracy.These papers yielded cutting-edge outcomes.They employ a robust OCR engine, such as [20], which is used to extract text from document images and acquire knowledge of both textual and visual characteristics.This helps solve visually similar document instances.We use SVM and AdaBoost for Image Document Classification Prediction in this paper.The structure of this paper is as follows: Part 2 will focus on the classification of related work.Section 3 illustrates the proposed methodology.Section 4 provides an explanation of the results and a discussion of classification.Finally, Section 5 delves into the conclusion and prospects discussion

RELATED WORK
Much work has been completed on the classification of document images.Prior research mostly concentrated on extracting textual characteristics from the documents, leading to its widespread recognition as text document categorization [24].Using SVM and AdaBoost Algorithms for the document images in our proposed paper.In this related work, the authors of Afzal et al.'s paper [1], The utilization of Alexnet as a classifier model, considering documents as images, has facilitated further investigation into the inclusion of visual features.This study demonstrated that the utilization of transfer learning significantly enhanced the outcomes.Furthermore, the dataset was utilized for training the model Abstract Image document classification is crucial in various domains, including healthcare, finance, and security.Automatically categorizing images into predefined classes can significantly improve data management and decision-making processes.For this research, we investigate the effectiveness of two machine learning algorithms, Support Vector Machines (SVM) and Gradient Boosting, for image document classification.First, we preprocess the image data by extracting relevant features, such as Image Embedding, to create a feature vector for each image.These features are essential for representing the content of the images accurately.Next, we apply SVM, a robust supervised learning algorithm, to train a classification model.SVM aims to Determine the optimal hyperplane for effectively distinguishing the images into different classes while maximizing the margin.Furthermore, we explore the Gradient Boosting algorithm, an ensemble learning method combining multiple weak learners to create a robust classifier.We experimented with different classification results with ten classes.We employ Multiple measures, including accuracy, precision, recall, F1-score, and ROC-AUC, are used to assess the performance of the SVM and Gradient Boosting models.The higher result of 0.964 for SVM compared with Adaboost is achieved.0.853.by transfer learning.Subsequently, this model was employed to categorize the Tobacco-3482 dataset.Harley et al. [8] created a vast RVL-CDIP dataset comprising 400,000 picture documents.Abuelwafa et al. [25] proposed an unsupervised categorization methodology.The author explicitly states that the model was exclusively trained using the input image and did not utilize any annotated data.The purpose of the model was to analyze the visual characteristics of the input image.The authors trained the model [26] on an auxiliary task where the input was linked to a distinct label and expanded to several images using a data augmentation technique.This strategy significantly enhanced the performance of the model.Roy et al. [26] employed generic, compact, and powerful CNNs in a separate investigation to analyze the characteristics of the input images.The authors in paper [26], Two Stream Deep Networks for Document Image Classificationutilizing Reuters and NLPCC 2014 news categorization datasets.

THE PROPOSED METHODOLOGY
This study aimed to determine the tobacco 3482 dataset, a representation that produces the best results, as different feature extraction strategies can impact classification performance in different ways.The study describes how orientation was categorized in the Tobacco 3482 dataset using image models and classifiers.SVM and AdaBoost Algorithm algorithm-optimized machine-learning techniques were employed to enhance image classification expertise.The proposed solution addresses these challenges by incorporating feature extraction and machine learning.Feature extraction is crucial in defining image content and extracting essential information from documents.Multiclass classification was performed using advanced machine learning methods such as AdaBoost and SVM.These algorithms utilized the extracted features to handle complex document classification tasks across multiple classes.Figure 1 provides a schematic representation of the research's model.A. AdaBoost Algorithm: AdaBoost, commonly called Adaptive Boosting, is a popular machine-learning approach for classification and regression.Ensemble learning combines outputs from multiple weak learners, such as decision trees or simple models, to create a robust learner.AdaBoost can adapt to the dataset, mitigate overfitting, and work well with diverse weak learners.However, this approach may be sensitive to noise and outliers.The user text contains no information that can be rephrased [19].

Support Vector Machines:
The structural risk minimization principle serves as the foundation for SVM.The main goal is to increase classification accuracy by creating an ideal separation hyperplane.It is frequently used to solve binary classification issues.The sample set S = X1, Y1), … , (Xi, Yi), … , (Xn, Yn)} X is an input vector and Y is a classification result with a value of -1 or 1, is based on the assumption that there are n training samples.The best separating for linear separable problems is studied using the Support Vector Machine (SVM) model.ϕ(x) is considered one non-linear function for the non-linear separable problems, which aims to transfer the Map the input space to a feature space with a higher number of dimensions.Within the feature space of higher dimensions, we can derive the ideal separation plane, ϕ(x)•w + b = 0.The SVM model is concerned with (xi•xj), which is the inner product of the two vectors.As a result, all that is required of e is to compute the inner product of o on-linear functions (ϕ(xi)•ϕ(xj)), where )is known as the kernel function.There are numerous forms for the kernel function, including polynomial, RBF, and others.The kernel in the Support Vector Machine (SVM) is considered to be the Gaussian function exp (- ), in this study.
Feature vectors are input to this section to a machine-learning model for document-type prediction.This model can learn to recognize patterns and relationships between the extracted features and the types of documents.

Precision = TP/(TP + FP)(1)
The actual positive rate (T.P.) and false positive rate significantly impact positive instance recall or sensitivity.
The following equation calculates accuracy, percentage of accurate predictions, and false-negative rate (F.N.).

Sensitivity = TP/TP + FN (4)
Particularity is accurately arranging positive records from every positive paper.
The F-measure runs many data recovery accuracy norms and examines measurements.
Correct classification employs True Positives (T.P.) and False Positives (F.P.), while incorrect classification uses False Negatives (F.N.).A test's document classification accuracy is determined by its sensitivity and specificity.The ROC curve illustrates the trade-off between true and false positives.When the emphasis is skewed and false positives are ignored, the results are likely to reflect the accuracy of genuine positives primarily.Conversely, if true positives are neglected and false positives are emphasized, the scores will reflect recall.The Area Under the Curve (AUC) measures classifier efficiency [26].Note that this classifier can be reinforcement by optimization method [31][32][33], where you train a learning agent to solve a complex problem by simply taking the best actions given a state, with the probability of taking each action at each state defined by a policy.An example is running a maze, where the position of each cell is the 'state', the 4 possible directions to move are the actions, and the probability of moving each direction, at each cell (state) forms the policy.This will be a topic of future work [34][35][36].

RESULTS AND DISCUSSION FOR CLASSIFICATION:
The classification results using Support Vector Machines (SVM) and Gradient Boosting on the Tobacco 3482 dataset.The diagonal shows the percentages for correct classifications, ideally 100% for a perfect model.The offdiagonal cells show misclassifications, which ideally should be 0%.In this matrix, there's a variability in the model's ability to classify the different types of documents correctly.Some classes like 'ADVE' and 'News' are predicted with higher accuracy, while others like 'Report,' 'Resume', and 'Scientific' have lower accuracy and higher confusion than others.The matrix helps in understanding which classes are confused by the model, indicating where improvements are needed.For example, the model seems to struggle with distinguishing 'Report,' 'Resume,' and 'Scientific' documents accurately, perhaps due to similarities in their features.The confusion matrix is critical for diagnosing classification models and guiding further refinement.

Table 2. Confusion Matrix Of A Classification Model's Predictions For Gradient Boosting
While using Support Vector Machines (SVM), the result in Table 3 is a confusion matrix of a classification model's predictions for Support Vector Machines (SVM).Here's a detailed explanation of the results: The classes along the top and the left side include ADVE, Email, Form, Letter, Memo, News, Note, Report, Resume, and Scientific.These represent the documents the model has been trained to identify.
The diagonal from the top left to the bottom right shows the percentage of instances for each class that were correctly identified (true positives).For example, the model correctly identified 90.0% of the actual emails as emails and 96.7% of the actual resumes as resumes, which have relatively high success rates.The off-diagonal numbers represent misclassifications (errors).For example, 5.6% of the forms were incorrectly A relatively small percentage of classifications are spread across unrelated categories, indicating that while the model does make mistakes, it does not often confuse completely dissimilar document types.This confusion matrix is crucial for model evaluation as it provides insight into the overall accuracy and how the model performs in each class.By the incorporation of Swin transformer with transformation techniques, the process of selecting, combining, generating or adapting several features to efficiently solve accuracy and computation time problems.One of the motivations for studying Swin transformer is to build systems which can handle of problems rather than solving just one problem [29][30].

CONCLUSION AND FUTURE WORK
We presented an efficient multimodal mech learning multiclassifier for the classification of document images.We demonstrate that the models perform adequately even with limited data.Our studies involve training the suggested model using the Tobacco-3482 dataset.We achieved a level of accuracy of 0.964% for SVM and 0.853 for the AdaBoost Algorithm.In the future, we are planning To make the model better, one could think about acquiring extra training data for classes that aren't performing well.Improve the feature extraction procedure to capture the unique characteristics of the documents better.We are employing methods, such as resampling or class weights in model training, to address class imbalance.Investigating more intricate models or groups of models that could better represent the subtle differences between classes.We are analyzing the misclassified documents through error analysis to find out why the model doesn't work in specific cases.The data scientist can improve the model's performance across all document classes by iteratively resolving these issues .
As future propole a hybrid transform architecture [37] can be used that can do all the types of tasks [38] required and other sensors [39] that could make a much more General Artificial Intelligence model [40].From our work: we specify a method for artificial general intelligence that can simulate human intelligence, implemented by taking in any form of arbitrary input data [41] the method comprising Learning to transform the arbitrary input data into an internal numerical format [42].
Then performing a plurality of numerical operations, the plurality of numerical operations comprises learned and

Fig. 1 .
Fig. 1.Framework of multiclass Document Classification model.It effectively conveys that the following subsections will delve into more detailed discussions of each phase.1. Document Image Dataset: In the context of image documentation prediction, you would typically have a dataset of document images.These images can be of various types: memos, letters, reports, and scientific research papers.Or any other document category.This paper is based on the Tobacco 3482 dataset, a widely used resource in document analysis and optical character recognition (OCR).It was created to facilitate research and development in these areas.Contents and Size in Dataset as suggested by its name, the dataset contains 3,482 scanned document images.These images came from the legacy tobacco industry documents and were made publicly available following legal action against major tobacco companies.The dataset contains various document types, such as memos, letters, reports, and scientific research

3 .
Feature Extraction: When we talk about feature extraction or data embedding, we mean that we use the layers of the VGG-19 model to transform input images into a more 5.Prediction:Model: This section describes the algorithm employed in the construction of a document categorization model.
Research Available online www.jport.coVolume 6, No:4.2023 classified as ADVE and 3.8% of emails were misclassified as memos.The most accurately predicted class is 'Resume' with 96.7% accuracy, followed by 'Email' with 90.0%, and 'Form' with 81.2%.These high percentages indicate that the model is particularly good at classifying these types of documents.The least accurately predicted classes are 'ADVE' and 'Scientific', with 7.2% and 62.1% of the predictions being correct, respectively.These lower percentages could indicate that these categories are more challenging for the model, possibly due to similarities with other classes or insufficient training data.The model commonly confuses 'Note' with 'Letter', as indicated by the 25.4% misclassification rate.

Figure 3 Fig 2
Figure 3 depicts the Receiver Operating Characteristic (ROC) curve which shows the prediction model's performance across different categorization criteria after using Support Vector Machines (SVM), and Gradient Boosting.ROC curve comparisons between multiple classifiers are common, it reaches SVM (Support Vector Machine, a discriminative classifier officially defined by a separating hyperplane) and Gradient Boosting (a machine learning technique that develops a prediction model in an ensemble of weak prediction models, often decision trees) .The gradient-boosting model has very little advantage at specific thresholds, but otherwise, both models are almost overlapping and seem to have extremely comparable performance characteristics in the Map you gave.High performance is suggested by the fact that both curves are very near to the plot's upper-left corner.A flawless classifier is shown if the curves reach the upper left corner (0,1); a random guess is indicated if the curves follow the 45-degree line .For the majority of threshold values, both models have a high TPR and a low FPR, indicating that they are both operating effectively for the task at hand.Nevertheless, selecting the appropriate model requires careful consideration

Table I .
The Performance Of Various Document Classification Algorithms On The Tobacco Dataset.
The accuracy, precision, recall, and F1 score metrics are provided.The result shown in Table 1.Journal port Science Research Available online www.jport.coVolume 6, No:4.2023

Table 2
is a confusion matrix of a classification model's predictions for Gradient Boosting.Here's a detailed explanation of the results: ADVE (Advertisements): This class is correctly predicted 80.6% of the time, which is relatively high.It is sometimes confused with News (5.7%) and Note (3.7%).Email: Correctly classified 78.6% of the time, with a small confusion with Memo (5.0%) and Form (2.1%).Form: This class has a 65.4% correct prediction rate, but there's notable confusion with Resume (14.0%) and Letter (2.1%).Letter: Correctly predicted 51.0% of the time, one of the lower correct prediction rates.There's a significant amount of confusion with memos (13.3%) and Emails (3.5%).Memo: Has a 44.9% correct prediction rate, with confusion occurring with Letter (21.1%) and Report (16.4%).News: This class has a high correct prediction rate of 71.1%, with some confusion with ADVE (6.8%) and Note (3.7%).Note: Correctly classified 64.0% of the time, often confused with Memo (7.4%) and Email (3.3%).Report: This has a correct prediction rate of 36.8%, which is lower than others, indicating significant confusion with other classes, notably Letter (10.5%) and Memo (7.9%).Resume: It has a high correct prediction rate of 41.9%, but there is confusion with Form (14.0%) and Scientific (12.4%).Scientific: This class has a correct prediction of 30.3%, the highest rate of accurate prediction in its row but relatively low overall.It is often confused with Resume (8.1%) and Report (11.5%).

Table 3 :
Compare The Suggested Approach To Related Works.