Development and Evaluation of Stemming Algorithms

BSc Computing Proposal

Abstract:

The purpose of this project is to implement and assess the effectiveness of different Stemming algorithms in Java. In order to do this effectively, three Stemming algorithms (Lovins, 1968; Porter, 1980; and Paice/Husk, 1990) will be implemented into a single language and environment, such that they can be compared and contrasted for an overall fair evaluation. Stemming algorithms are used in natural language systems (e.g., IR systems) for stripping the endings from words, so that related words such as “oceaneering”, “oceanic”, “oceanics”, “oceanization”, “oceans” can be matched to their stem “ocean”. Currently Paice/Husk has only been implemented in Pascal and in C programming languages but not Java. However, the platform independence of Java makes it a valuable language in which to re-implement the stemming algorithms. Upon completion and proof of the implementations in Java of the three Stemming algorithms, a testing and evaluation process shall occur in which the stemmers will be compared and evaluated in a series of techniques, Identifiable errors during the stemming of words in a sample, the computing of indices of overstemming and understemming errors.

Author: Chris O’Neill
Supervisor: Dr. Chris D. Paice
June 1^st 2000

1 Introduction

The aim of this project is to implement and assess the effectiveness of three different Stemming algorithms (Lovins, 1968; Porter, 1980; and Paice/Husk, 1990) in Java. The Stemming algorithms will be programmed in the manner of software components such that they will have strong object cohesion. This will be in order that the same software driver programs can be used to interface with the algorithms, which will import the text file of words and output the results of the stemming operation.

The advantage stemming from this will be that the algorithms will be made easily transferable for future work, or use in other products where a stemming algorithm is required. Examples of products using stemming algorithms would be search engines, thesauruses and other products using natural language processing for the purpose of information retrieval. Therefore the algorithms will be programmed in Java as stemming algorithms programmed in java are in much demand at the present time. The benefit of implementing these stemmers in Java would be that it is an object orientated language, and hence allows for the strong cohesion and lose coupling that will be required for the algorithms to be programmed as components, in order that they can be utilised subsequently at a later date. A stage of validation and verification will be undertaken after the algorithms have been programmed in Java. This will ensure that the algorithms have been implemented correctly through formal testing methods and defect testing methods. In conclusion, once all three algorithms have been implemented and tested, they will be evaluated. The evaluation will consist of the comparison and contrasting of the algorithms in various ways recently defined within papers (Paice 1996, Fox and Frakes 1999 ). The results of this will then be compiled and presented in a report with suggested conclusions drawn.

Section 2 of this report describes in more detail the background and research, which has been carried out in this area. Section 3 elaborates upon the methods and aims of this project, Section 4 describes a program of work for the project. Section 5 details the resources needed. Section 6 lists references used in this report.

2 Problem Domain

In the context of this project, the problem domain concerns several areas. These areas being: the different stemmers, types of stemming error, evaluation methods of stemmers, the choice of text to be stemmed, and it’s correct grouping. These areas are best served by being covered individually.

2.1 Different types of Stemmer

There are three main different types of Stemmer: Lovins, Porter and Paice/Husk. Each of these three stemmers work in different ways, such that they all have different properties and stemming rules, which result in output from each. Moreover, their stemming performance can vary depending on the given input.

Frequently, programmers of Natural Language Processing Systems or other systems using IR stemming methods have problems in deciding which Stemmer to use. The problem being that they have to decide upon which Stemmers are available in the language in which they are programming, and which would perform best under, the circumstances that it is likely to be under in their given project. At the present time, this is made difficult by the fact that only the Porter algorithm is implemented in java. These particular implementations are known to have errors, and there is limited information of how in general the stemmers perform as little work has been done on contrasting and comparing the different stemmers.

2.2 Stemming Errors

Due to the nature of modern day English language, there are many grammatical anomalies specific to the language, for example there are words that do not conform to the rules of regular grammar. Hence because of this, no Stemmer can be expected to work perfectly. Therefore there are cases where words that should be grouped together are not (understemming), and words that shouldn’t be grouped are (overstemming).

For example given the words “Divide”, “Dividing”, “Divided”, “Division”, “Divisor”, “Divine” and “Divination”, Ideally we would use stemming to group the words into two groups “Divide”, “Dividing”, “Divided”, “Division” and “Divisor” in one group, “Divine” and “Divination” in the other group.

However if a simple truncating stemming algorithm is used to truncate to five letter long stems then we have three groups “Divide”, “Dividing” and “Divided” in one group, “Division” and “Divisor” in a second group, and “Divine” and “Divination” in a third. This is understemming as the first and second groups should be joined and they are not.

If the same type of algorithm is used to truncate to four letter long stems then there is only one group with all the words in the same group. This is overstemming as the words “Divine” and “divination should be in a second group (see figure 1 bellow).

FULL WORD	Truncate (5)	Truncate (4)
Divide	Divid	Divi
Dividing	Divid	Divi
Divided	Divid	Divi
Division	Divis	Divi
Divisor	Divis	Divi

Divine	Divin	Divi
Divination	Divin	Divi

Figure 1: Stem Diagram

2.3 Methods of Evaluation

There are two main methods currently available for evaluating stemmers. Firstly there is information retrieval performance such that the Stemmer is used in an information retrieval system and it’s performance monitored and evaluated. Secondly there is evaluation based on error counting where indices of the Stemmer are generated by assessing the understemming and overstemming errors (Paice 1996). Other less well-known methods are available, such as those described in Fox & Frakes 1999, to do with Stemmer strength and similarity metrics.

2.4 Choice of Text

The Grouping of sample words is one of the most problematic issues within Stemmer evaluation. As the actual correct grouping of words to different stems can prove a contentious issue, for impartial reasons, it is desirable to have these samples grouped by independent groups from the project.

3 Proposed Project

The project should be completed according to a predefined plan of action and programme of work, including full documentation to support each stage of the work.

3.1 Methodology

The project will be developed using a waterfall method. This method will be used because of its structured nature. This will assist in the planning and completion of the project within its limited time scale and will also assist in the documentation of the project development. The Stemming Algorithms have already been implemented in languages other than java and as such have clear specifications within papers. Therefore the requirements and design stages of the algorithms should be solid, which is a requirement of the waterfall method. After implementation of the algorithms they will have to be rigorously tested to ensure that they are implemented correctly. The software engineering approach will be of an object-orientated nature as it is more suitable than a functional approach. With objects readily identifiable and it provides the strong coherence and loose coupling of the components such as the individual stemmers that the project requires.

3.2 Aims

The aims of this project are as follows:

Implementation of Stemmers in Java:

The project will implement the three stemmers of Porter, Lovins and Paice/Husk in Java. This is required as stemmers in java are in much demand. Currently only Porters Stemmer has been implemented in Java. However, the numerous versions of this implementation, have been identified as being problematic, that is, the correct results as given by the algorithm are not given by these implementations. The nature of this implementation should be as components such that they are implemented with strong coherence and loose coupling. This is in order that the same tools can be used to test all three different stemmers. The other purpose for utilising this method is that it will enable the stemmers to be easily transferred into other java programs if desired for future use.

Provide a set of tools in Java for the purpose of evaluating stemmers:

The project will implement a set of tools in Java for the purpose of testing and evaluating the stemmers. This is required to evaluate the stemmers that will be implemented in Java. These tools will be required to be able to interface with the stemming components, to allow the user to select the Stemmer to be used, to select the text file and the output method. It would also be advantageous for the output method to include some statistical analysis of the results of the Stemmer, which can be then viewed by the user and documented.

Evaluation of stemmers in scientific fair tests:

The Evaluation of the stemmers in scientific fair tests, this is required as it is valuable to know before the implementation of a Stemmer in a project or programme the answer to the question, “which is the best Stemmer to use?” Without this information, the only way to answer this question this is to implement the project more than once with different stemmers, and to compare the performance this is obviously very costly and time consuming. Therefore, with this in mind, it is an aim of this project to produce information relating to the performance and evaluation of the three stemmers Lovins, Porter and Paice/Husk

4 Programme of Work

4.1 Schedule

Figure 2 illustrates the work schedule in order to complete the Aims above. The dissertation lasts a total of twenty weeks, starting on 10^th October 2001. The work schedule follows a standard approach, which can be customised where necessary.

		WEEK
TASK		1	2	3	4	5	6		7	8	9		10	11	12	13	14	15	16	17		18	19	20
1	Java and C Orientation
2	Algorithm Analysis	1^st Report
3	Algorithm Implementation
4	Algorithm Testing & Correction							2^nd Report
5	Evaluation Methods Analysis							3^rd Report
6	Tools Design											4^th Report
7	Tools Implementation
8	Tools Testing & Correction																5^th Report
9	Algorithm Evaluation																		6^th Report
10	Final Report																				7^th Report

Figure 2: Work Schedule

4.2 Tasks

1. Java and C Orientation

Successful completion of the project will require further accumulation of knowledge in the areas of Java and C programming so that the analysis of the current implementations of the stemmers can be completed. Moreover without this further knowledge the completion of the implementation of the stemmers and the tools would prove unworkable. For these reasons I have allocated a large period of time for this task (see figure 2) that covers the time periods from the start of the project to the end of the testing at which point the code should be completed and no more changes will need to be made.

2. Algorithm Analysis

In order to enable successful completion of the project, thorough analysis of the algorithms and their previous implementations in other languages is needed. This will be in order to produce the plan of their design (see figure 2, 1^st Report), which will be used in the implementation of the stemmers (see task 3).

3. Algorithm Implementation

The Implementation of the stemmers in Java is one of the main aims of the project should take three weeks (see figure 2) due to the clear nature of the design plan of the stemming algorithms (see task 2).

4. Algorithm Testing & Correction

On completion of the implementation of the stemmers in Java a period of testing and correction will be required in order to confirm that that implemented Stemmer is true to the algorithm. This will require testing using formal and defecting testing methods. Three weeks has been allocated for this and the production of testing reports (see figure 2, 2^nd Report).

5. Evaluation Methods Analysis

The Project will require a period of time for the analysis of the different methods of Evaluation of stemmers such that requirement can be drawn up for the evaluation tools three and a half weeks has been allocated for this (see figure 2, 3^rd Report).

6. Tools Design

The Tools Design phase will encompass the designing of the software tools that will be used to evaluate the stemmers. This will use the information from the implementation of the stemmers in java, together with the analysis of the evaluation methods. This should take two weeks (see figure 2, 4^th Report)

7. Tools Implementation

The implementation of the Tools will be in Java. This phase will use information from the design and analysis phases described above. This stage should take three weeks (see figure 2)

8. Tools Testing and Corrections

The Tools will be tested both structurally and stress tested with test data to ensure they are working correctly and accurately. The Interface will be tested to ensure that it is easily usable. Two weeks has been allocated for these tasks, together with and any corrections to the code that are required (see figure 2, 4^th Report).

9. Algorithm Evaluation

The Evaluation of the Stemming Algorithms implemented in Java is the main aim of the project. During this phase the evaluation of the stemmers will take place and a report will be produced giving the results of the evaluation. This should take approximately one and a half weeks (see figure 2, 5^th Report).

10. Write Final Report

The final report will be compiled covering the project in its entirety from a completed perspective. This will be compiled from the previous progress reports and the project diary. This has been allocated three and a half weeks (see figure 2, 6^th Report)

4.3 Deliverables

The deliverables will consist of the three stemming algorithms with test data, the algorithm evaluation report and full development documentation including four progress reports and the final report.

5 Resources

· Pentium 200MMX 64MB RAM PC – O/S Windows 95 or Linux Mandrake 7

For developing and testing software and producing reports

· Star Office or Microsoft Word

For producing reports

· Simple text editor.

To code the software

· Java Integrated Development Environment (IDE) Includes compiler - current version: 1.3

For the development of software

6 References

6.1 Online References

Porter Stemming Algorithm

· http://www.muscat.com/~martin/stem.html

ACL's Natural Language Processing Software Registry

· http://www.dfki.com/lt/registry/morphan-over.html

Stemming and Co-occurrence on a Larger Corpus,

http://ciir.cs.umass.edu/whatsnew/stemming.html

Interactive Information Text Classification,

http://ai.iit.nrc.ca/II_public/classification

The Microsoft Research Institute Web IR & IE,

http://www.mir.mq.edu.au/~einat/web_ir/

Java.Sun, Sun Mircosystems,

http://java.sun.com

6.2 Books

Frakes, W. B. & Baeza-Yates, R. (eds.) "Information Retrieval: Data Structures & Algorithms" Prentice-Hall 1992. ISBN 0-13-463873-9 {1-> 2}

Kowalski, G. "Information Retrieval Systems: Theory & Implementation" Kluwer 1997. ISBN 0-7923-9926-9 {1}

Paice, Chris D. “Information Retrieval and the Computer” Macdonald and Jane’s 1977. ISBN 0-345-04095-2

Salton, G. & McGill, M. "An Introduction to Modern Information Retrieval" McGraw-Hill 1983. ISBN 0-07-054484-0 {3: out of print}

6.3 Papers

Hull, D.A. “Stemming Algorithms: A Case Study For Detailed Evaluation.” Journal of the American Society for Information Science 47 (1), 1996, 70-84.

Fox, C.J. and Frakes, W.B.“Strength and Similarity of Affix Removal Stemming Algorithms” 1999.

Lovins, J.B. “Development of a Stemming Algorithm.” Mechanical Translation and Computational Linguistics 11, 1968, 22-31.

Paice, Chris D. “Another Stemmer.” SIGIR Forum 24 (3), 1990, 56-61.

Paice, Chris D. “Method for Evaluation of Stemming Algorithms Based on Error Counting.” Journal of the American Society for Information Science 47(8), August 1996, 632-649.

Porter, M.F. “An Algorithm for Suffix Stripping.” Program 14, 1980, 130-137.