



Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
This document is the paper thaat I wrote which describes various methods for summarizing text using various machine learning techniques
Typology: Papers
1 / 7
This page cannot be seen from the preview
Don't miss anything!
Faculty of Agriculture Dalhousie University Truro, NS, Canada B2N 5E Email: tayab.soomro@dal.ca
Dept. of Computer Science Dalhousie University Halifax, NS, Canada B3H 4R Email: ab249429@dal.ca
Abstract—With the increasing number gadgets (e.g., cell- phones, smart-watches, laptops, etc.), the amount of data that is being generated per second is innumerable. It is equally cruicial to be able to glean important information from large amounts of data quickly and effciently. This is where text summarizers come in picture. Text summarizers are machine learning models which, given a large piece of document, return a summarized version. In this project, we sought to compare two types of text summarizers (extraction, and abstraction) with WikiHow Dataset in order to get some insight into what method works best for article form of documents. We implement SumBasic as our extraction text summarizer and Seq2Seq (i.e., sequence-to-sequence) as our abstraction text summarizer. Using Recall-Oriented Understudy for Gisting Evaluation (ROUGE) method for evaluation, we obtained the ROUGE-1 F-measure of 22 .6%, 25 .2%, 39 .6% for the three documents respectively and ROUGE-L F-measure of 9 .7%, 15 .0%, 13 .19% for SumBasic for three documents respectively. For the Seq2Seq model, we obtained the ROGUE-1 F-measure of 19 .8%, 20 .6%, 30 .0% for three documents respectively, and for ROUGE-L we obtained the F-measure of 15 .4%, 14 .7%, 18 .9% for three documents respectively. This anlaysis concludes that SumBasic model is marginally superior to the Seq2Seq in terms of the ROUGE scores.
In this day and age, the amount of textual data being generated by various technological means is phenomenal. Just as an exmaple, there are roughly 5 million tweets sent every day [1], and some 2 million research articles are published each year [2]. With all this data being generated at an unprecentad pace, it is extremely crucial to have the ability to automatically parse these gigantic document sets and to obtain summarized and most important information from those documents. The applications for such tools is endless. Search engines such as Google try to summarize the information from a web or PDF document right within the search resutls, eliminating the need for the users to actually
go into the webpage to find the answer manually. This is true for general knowledge questions. For example, if you type on Google as a search query “what is inertia?”. You will get a card like view (shown in Figure 1) containing the most relevant summary that it obtained from Wikipedia. This information is generated using some sort of text sum- marization technique.
Figure 1. Search result from Google when queried with the sentence: “what is inertia?”.
There are various other applications for text summa- rization techniques, such as in academia. With enormous amounts of research articles being written and published every day in countless journals, it is paramount for variuos entites such as medical and industry professionals to be able to summarize the papers and read only the important points in order to make informed decisions.
The aim of this project is to compare two different auto text summarization techniques for sumamrizing WikiHow articles. Two different text summarization techniques will be used in order to compare their differences in terms of performance on the given WikiHow dataset. The sumamries
generated using the two summarizers will be evaluated using Recall-Oriented Understudy for Gisting Evaluation (ROUGE) method which compares the N-grams of reference summary with the system-generated summary. Based on the comparison and the evaluation score, we will conclude on a single summarizer to be used for this type of dataset.
This new dataset is created using the online WikiHow platform and it can be downloaded from https://ucsb.app.box.com/s/ap23l8gafpezf4tq3wapr6u8241zz358. The WikiHow dataset contains 215364 records articles. Each article has multiple paragraphs and when these paragraphs are combined together, they form an article to be summarised. Apart from the articles it also has Headline section which is the summary of the articles. It also contains Title column; it basically describes the title of the articles. The below tables show the description of the dataset.
Figure 2. The metadata for the dataset used in this anlysis. A more detailed description can be found at (https://github.com/mahnazkoupaee/WikiHow- Dataset
For our assignment wikiHow dataset was divided in to three parts:
Data clean up and pre-processing is one of the main steps that need to be performed in any machine learning project to ensure that the model is built on the quality data. Data preprocessing is the technique in which relevant information is extracted from the raw data. The raw data obtained from the sources will likely contain inconsistencies such as null values, and other deformities. It is crucial to clean and process these redundancies before doing the model prediction which will help in achieving the optimized and the accurate scores [4]. In our case WikiHow dataset contained null values, emoticons, symbols, contractions and other deformities that
needed to be cleaned. In this assignment, we have per- formed multiple steps to clean and process the data. Firstly, we dropped the columns which were unnecessary for the modelling. After eliminating the redundant columns, we performed the following data pre-processing operations on the textual data:
A custom function was developed using regular expres- sion library to remove the hyperlinks, hashtags, emoticons, and contractions. We have also used natural language pro- cessing techniques to remove stop words and performed lemmatization of words. The most common stop words such as “I”, “am” etc. tend to skew the model results. Lemma- tization is technique whereby the words are transformed into their lemma. For example, the word “walking” would become “walk”. To perform this modification a prebuilt NLTK library of python was used.
Quite similar to other components of machine learning, data visualization is also one of the most crucial compo- nents, as it provides a visual insight into the kind of data the model is dealing with. More accurate and predictable insights can be drawn from the model if the input data is well understood. As such we created various visualizations for the input data to our model. We did a comparative analysis for the data before and after the pre-processing step to assess the importance the pre-processing step has for the data.
Figure 3. Frequency of words in articles without performing data processing
The above figure shows plot of the articles, where x axis belongs to Number of words in the text paragraph and y axis belongs to count of the text paragraph. The large length of the articles can be explained by the fact that some of the characters that are added in the article text that don’t count towards the total number of characters allowed. Those
Algorithm 1: SumBasic initialize summary holder; Step 1: compute the probability distribution of all the words wi in the document, p(wi), for every i as p(wi) = (^) Nn ; while desired summary legnth not reached do Step 2: calculate the weight of each sentence, Sj as the average of the probabilities of all the sentences in the document; Step 3: pick the sentence, Sj , which scores the best; while there is word, wi in Sj do update the probability of word wi as follows: pnew(wi) = pold(wi) · pold(wi) ; end end
3.1.2. Seq2Seq. Textual data which involves sequential in- formation, a sequence-to-sequence model can be developed for them. In our assignment we built a text summarization model using the concept of sequence-to-sequence modelling where the inputs were articles and output were there pre- dicted summaries from the models of the articles given as input. Below is the figure of sequence-to-sequence architec- ture.
Figure 5. A representative model diagram for Sequence-to-Sequence text summarization model. Image borrowed from Analytics Vidhya
The sequence-to-sequence model developed in the as- signment has two core components:
articles as an input and for the output it was supplied with the maximum vocabulary of the summary by using keras tokenizer [6]. Following are the steps followed in the training:
3.1.3. Hyperparameter Tuning and prediction on the test set. To find out the optimal parameters for our ma- chine learning algorithms. Which was trained using training and the development dataset. We performed hyperparameter tuning. For the tenserflow keras model we selected the parameter based on the least loss obtained during the epoch run and these parameters were used to predict the target summary of the test file.
3.1.4. Evaluation. Recall-Oriented Understudy for Gisting Evaluation (ROUGE) is an evaluation metric for evaluating auto text summarization results of machine learning models.
This evaluation is reference-based and thus relies on a ref- erence summary to compare the system-generated summary against. In the most basic terms, ROUGE tries to give a score to summary based on how much of reference summary is recoverd by the system summary, as well as much it is recovered by reference, as well as how much of the system summary is relevant. These both concepts are referred to as recall and precison, respectively [7]. In order to understand the need for both recall and precision for the final score, lets take a look at an example system generated summary as well as a reference summary:
Fscore = 2 ·
precision · recall precision + recall For the evaluation of two models, we used ROUGE- 1, ROUGE-L and their F-scores respectively to make final conclusions about the choice of a model. ROUGE-1 uses uni-grams of a summary to compare between the system and reference summaries. A uni-gram is just a single word, thus it compares words of summaries (similar to the example menteiodn above). In contrast to ROUGE-1, ROUGE-L uses the longest matching sequence of words as a way of evaluating a summary. This depicts the sentence level word order and thus is a good measure of the summary.
In this project, we set out to compare two automatic text summarization methods, extraction, and abstraction. For the extraction method we used SumBasic, and for abstraction method, we used Seq2Seq. We used WikiHow articles as our input dataset in order to generate summaries for those articles. Using the ROUGE evaluation metric, we compared the two models based on ROUGE-1, ROUGE-L and the F- scores of both ROUGE-1 and ROUGE-L respectively. The scores for our models are tabulated (Table 1 & 2). Based on
extensive evaluation of the two models we concluded that for the given WikiHow dataset, SumBasic performs marginally better.
4.0.1. Comparison of ROUGE between SumBasic & Seq2Seq. As shown in Table 2, Seq2Seq on average has the F-Score of 23%, and 16% for ROUGE-1 and ROUGE- L, respectively. Whereas SumBasic has the F-score of 29%, and 13% for ROUGE-1 and ROUGE-L respectively. To our surprise, we find that SumBasic outperforms, albiet marginally, the Seq2Seq model. We find that SumBasic has the recall rate of 23% for ROUGE-1, roughly a percent more than SeqSeq (22%). This indicates that SumBasic, on average, recovers 22% of the words that are in the reference summary. The precision of Seq2Seq and SumBasic are tied at 45%. As mentioned in the model evaluation earlier, the precision of the model indicates how relevant the output results are. In our case, the output summary is roughly 45% relevant. Seq2Seq Text 1 Text 2 Text 3 Average
ROUGE-
Precision 0.36 0.28 0.92 0. Recall 0.14 0.16 0.17 0. F-Score 0.20 0.20 0.29 0.
ROGUE-L
Precision 0.28 0.20 0.60 0. Recall 0.11 0.12 0.11 0. F-Score 0.15 0.15 0.19 0. TABLE 1. SHOWS THE EVALUATION OF THE SEQ2SEQ MODEL USING PRECISION, RECALL AND F-SCORE FOR ROUGE-1 AND ROUGE-
SumBasic Text 1 Text 2 Text 3 Average
ROUGE-
Precision 0.41 0.51 0.44 0. Recall 0.15 0.17 0.36 0. F-Score 0.23 0.25 0.40 0.
ROUGE-L
Precision 0.19 0.30 0.15 0. Recall 0.06 0.10 0.13 0. F-Score 0.10 0.15 0.14 0. TABLE 2. SHOWS THE EVALUATION OF THE SUMBASIC MODEL USING PRECISION, RECALL AND F-SCORE FOR ROUGE-1 AND ROUGE-
It is quite interesting to find SumBasic, a frequentist approach, which does not use any high-end machine learning processes such as feature engineering, hyper- parameter tuning outperforms Seq2Seq, a very versatile model which leverages on series of encoders and decoders. One of the key things that SumBasic is known for is its frequentist approach to generating extraction-based sum- maries. Briefly, frequentist approach makes decision about adding a sentence into the summary based on the score of the sentence which is determined by the average of all the probabilities of all the words in a particular sentence. Therefore, the sentences which has highly frequent words will have higher score and by extension higher chance of being selected into the summary. This is a very basic con- cept, however it has profound effects. It has been previously shown in the literature that this phenomenon that highly frequent words are likely to be included in the summary is, to some extent, also true in human-generated summaries [3].
[8] Vashisht, A. (2019, November 5). Edmundson Heuristic Method for text summarization. OpenGenus IQ: Learn Computer Science. https://iq.opengenus.org/edmundson-heuristic-method-for-text- summarization/
[9] Misra, S. (n.d.). Let’s give some ‘Attention’ to Summarising Texts.. — by Sayak Misra — Towards Data Science. Retrieved November 30, 2020, from https://towardsdatascience.com/lets-give-some-attention-to- summarising-texts-d0af2c4061d
[10] Nallapati, R., Zhou, B., Dos Santos, C., Gulcehre, C., & Xiang, B. (2016). Abstractive Text Summarization Using Sequence-to-Sequence RNNs and Beyond. 280–290. https://doi.org/10.18653/v1/K16-