Development and Evaluation of Stemming
Algorithms

BSc Computing Proposal
Abstract:
The purpose of this project is to implement and assess the effectiveness of different Stemming algorithms in Java. In order to do this effectively, three Stemming algorithms (Lovins, 1968; Porter, 1980; and Paice/Husk, 1990) will be implemented into a single language and environment, such that they can be compared and contrasted for an overall fair evaluation. Stemming algorithms are used in natural language systems (e.g., IR systems) for stripping the endings from words, so that related words such as “oceaneering”, “oceanic”, “oceanics”, “oceanization”, “oceans” can be matched to their stem “ocean”. Currently Paice/Husk has only been implemented in Pascal and in C programming languages but not Java. However, the platform independence of Java makes it a valuable language in which to re-implement the stemming algorithms. Upon completion and proof of the implementations in Java of the three Stemming algorithms, a testing and evaluation process shall occur in which the stemmers will be compared and evaluated in a series of techniques, Identifiable errors during the stemming of words in a sample, the computing of indices of overstemming and understemming errors.
|
|
Author: Chris O’Neill |
|
|
The aim of this
project is to implement and assess the effectiveness of three different
Stemming algorithms (Lovins, 1968; Porter, 1980; and Paice/Husk, 1990) in Java.
The Stemming algorithms will be programmed in the manner of software components
such that they will have strong object cohesion. This will be in order that the
same software driver programs can be used to interface with the algorithms,
which will import the text file of words and output the results of the stemming
operation.
The advantage
stemming from this will be that the algorithms will be made easily transferable
for future work, or use in other products where a stemming algorithm is
required. Examples of products using stemming algorithms would be search
engines, thesauruses and other products using natural language processing for
the purpose of information retrieval. Therefore the algorithms will be
programmed in Java as stemming algorithms programmed in java are in much demand
at the present time. The benefit of implementing these stemmers in Java would
be that it is an object orientated language, and hence allows for the strong
cohesion and lose coupling that will be required for the algorithms to be
programmed as components, in order that they can be utilised subsequently at a
later date. A stage of validation and verification will be undertaken after the
algorithms have been programmed in Java. This will ensure that the algorithms
have been implemented correctly through formal testing methods and defect
testing methods. In conclusion, once all three algorithms have been implemented
and tested, they will be evaluated. The evaluation will consist of the
comparison and contrasting of the algorithms in various ways recently defined
within papers (Paice 1996, Fox and Frakes 1999 ). The results of this will then
be compiled and presented in a report with suggested conclusions drawn.
Section 2 of this
report describes in more detail the background and research, which has been
carried out in this area. Section 3 elaborates upon the methods and aims of
this project, Section 4 describes a program of work for the project. Section 5
details the resources needed. Section 6 lists references used in this report.
In the context of this project, the problem domain concerns
several areas. These areas being: the different stemmers, types of stemming
error, evaluation methods of stemmers, the choice of text to be stemmed, and
it’s correct grouping. These areas are best served by being covered
individually.
There are three main different types of Stemmer: Lovins,
Porter and Paice/Husk. Each of these three stemmers work in different ways,
such that they all have different properties and stemming rules, which result
in output from each. Moreover, their stemming performance can vary depending on
the given input.
Frequently, programmers of Natural Language Processing
Systems or other systems using IR stemming methods have problems in deciding
which Stemmer to use. The problem being that they have to decide upon which
Stemmers are available in the language in which they are programming, and which
would perform best under, the circumstances that it is likely to be under in
their given project. At the present time, this is made difficult by the fact
that only the Porter algorithm is implemented in java. These particular
implementations are known to have errors, and there is limited information of
how in general the stemmers perform as little work has been done on contrasting
and comparing the different stemmers.
Due to the nature
of modern day English language, there are many grammatical anomalies specific
to the language, for example there are words that do not conform to the rules
of regular grammar. Hence because of this, no Stemmer can be expected to work
perfectly. Therefore there are cases where words that should be grouped
together are not (understemming), and words that shouldn’t be grouped
are (overstemming).
For example given
the words “Divide”, “Dividing”, “Divided”, “Division”, “Divisor”, “Divine” and
“Divination”, Ideally we would use stemming to group the words into two groups
“Divide”, “Dividing”, “Divided”, “Division” and “Divisor” in one group,
“Divine” and “Divination” in the other group.
However if a
simple truncating stemming algorithm is used to truncate to five letter long
stems then we have three groups “Divide”, “Dividing” and “Divided” in one
group, “Division” and “Divisor” in a second group, and “Divine” and
“Divination” in a third. This is understemming as the first and second
groups should be joined and they are not.
If the same type
of algorithm is used to truncate to four letter long stems then there is only
one group with all the words in the same group. This is overstemming as
the words “Divine” and “divination should be in a second group (see figure 1
bellow).
|
FULL WORD |
Truncate (5) |
Truncate (4) |
|
Divide |
Divid |
Divi |
|
Dividing |
Divid |
Divi |
|
Divided |
Divid |
Divi |
|
Division |
Divis |
Divi |
|
Divisor |
Divis |
Divi |
|
|
|
|
|
Divine |
Divin |
Divi |
|
Divination |
Divin |
Divi |
Figure 1: Stem Diagram
There are two
main methods currently available for evaluating stemmers. Firstly there is
information retrieval performance such that the Stemmer is used in an
information retrieval system and it’s performance monitored and evaluated.
Secondly there is evaluation based on error counting where indices of the
Stemmer are generated by assessing the understemming and overstemming errors
(Paice 1996). Other less well-known methods are available, such as those
described in Fox & Frakes 1999, to do with Stemmer strength and similarity
metrics.
The Grouping of
sample words is one of the most problematic issues within Stemmer evaluation.
As the actual correct grouping of words to different stems can prove a
contentious issue, for impartial reasons, it is desirable to have these samples
grouped by independent groups from the project.
The project
should be completed according to a predefined plan of action and programme of
work, including full documentation to support each stage of the work.
The project will be developed using a waterfall method. This
method will be used because of its structured nature. This will assist in the
planning and completion of the project within its limited time scale and will
also assist in the documentation of the project development. The Stemming
Algorithms have already been implemented in languages other than java and as
such have clear specifications within papers. Therefore the requirements and
design stages of the algorithms should be solid, which is a requirement of the
waterfall method. After implementation of the algorithms they will have to be
rigorously tested to ensure that they are implemented correctly. The software
engineering approach will be of an object-orientated nature as it is more
suitable than a functional approach. With objects readily identifiable and it
provides the strong coherence and loose coupling of the components such as the
individual stemmers that the project requires.
The aims of this project are as follows:
Implementation of Stemmers in Java:
The project will implement the three stemmers of Porter,
Lovins and Paice/Husk in Java. This is required as stemmers in java are in much
demand. Currently only Porters Stemmer has been implemented in Java. However,
the numerous versions of this implementation, have been identified as being
problematic, that is, the correct results as given by the algorithm are not
given by these implementations. The nature of this implementation should be as
components such that they are implemented with strong coherence and loose
coupling. This is in order that the same tools can be used to test all three
different stemmers. The other purpose for utilising this method is that it will
enable the stemmers to be easily transferred into other java programs if
desired for future use.
Provide a set of tools in Java for the purpose of evaluating
stemmers:
The project will implement a set of tools in Java for the
purpose of testing and evaluating the stemmers. This is required to evaluate
the stemmers that will be implemented in Java. These tools will be required to
be able to interface with the stemming components, to allow the user to select
the Stemmer to be used, to select the text file and the output method. It would
also be advantageous for the output method to include some statistical analysis
of the results of the Stemmer, which can be then viewed by the user and
documented.
Evaluation of stemmers in scientific fair tests:
The Evaluation of the stemmers in scientific fair tests,
this is required as it is valuable to know before the implementation of a Stemmer
in a project or programme the answer to the question, “which is the best
Stemmer to use?” Without this information, the only way to answer this
question this is to implement the project more than once with different
stemmers, and to compare the performance this is obviously very costly and time
consuming. Therefore, with this in mind, it is an aim of this project to
produce information relating to the performance and evaluation of the three
stemmers Lovins, Porter and Paice/Husk
Figure 2
illustrates the work schedule in order to complete the Aims above. The
dissertation lasts a total of twenty weeks, starting on 10th October
2001. The work schedule follows a standard approach, which can be customised
where necessary.
|
|
WEEK |
||||||||||||||||||||||||||
|
TASK |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
10 |
11 |
12 |
13 |
14 |
15 |
16 |
17 |
18 |
19 |
20 |
|||||||
|
1 |
Java and C Orientation |
|
|
||||||||||||||||||||||||
|
2 |
Algorithm Analysis |
1st Report |
|
||||||||||||||||||||||||
|
3 |
Algorithm Implementation |
|
|
|
|||||||||||||||||||||||
|
4 |
Algorithm Testing & Correction |
|
2nd Report |
|
|||||||||||||||||||||||
|
5 |
Evaluation Methods Analysis |
|
3rd Report |
|
|||||||||||||||||||||||
|
6 |
Tools Design |
|
4th Report |
|
|||||||||||||||||||||||
|
7 |
Tools Implementation |
|
|
|
|||||||||||||||||||||||
|
8 |
Tools Testing & Correction |
|
5th Report |
|
|||||||||||||||||||||||
|
9 |
Algorithm Evaluation |
|
6th Report |
|
|||||||||||||||||||||||
|
10 |
Final Report |
|
7th Report |
||||||||||||||||||||||||
Figure 2: Work Schedule
1. Java and C Orientation
Successful completion of the project will require further
accumulation of knowledge in the areas of Java and C programming so that the
analysis of the current implementations of the stemmers can be completed.
Moreover without this further knowledge the completion of the implementation of
the stemmers and the tools would prove unworkable. For these reasons I have
allocated a large period of time for this task (see figure 2) that covers the
time periods from the start of the project to the end of the testing at which
point the code should be completed and no more changes will need to be
made.
2. Algorithm Analysis
In order to enable successful completion of the project,
thorough analysis of the algorithms and their previous implementations in other
languages is needed. This will be in order to produce the plan of their design
(see figure 2, 1st Report), which will be used in the implementation
of the stemmers (see task 3).
3. Algorithm Implementation
The Implementation of the stemmers in Java is one of the
main aims of the project should take three weeks (see figure 2) due to the
clear nature of the design plan of the stemming algorithms (see task 2).
4. Algorithm
Testing & Correction
On completion of the implementation of the stemmers in Java
a period of testing and correction will be required in order to confirm that
that implemented Stemmer is true to the algorithm. This will require testing
using formal and defecting testing methods. Three weeks has been allocated for
this and the production of testing reports (see figure 2, 2nd
Report).
5. Evaluation
Methods Analysis
The Project will require a period of time for the analysis
of the different methods of Evaluation of stemmers such that requirement can be
drawn up for the evaluation tools three and a half weeks has been allocated for
this (see figure 2, 3rd Report).
6. Tools
Design
The Tools Design phase will encompass the designing of the
software tools that will be used to evaluate the stemmers. This will use the
information from the implementation of the stemmers in java, together with the
analysis of the evaluation methods. This should take two weeks (see figure 2, 4th
Report)
7. Tools Implementation
The implementation of the Tools will be in Java. This phase
will use information from the design and analysis phases described above. This
stage should take three weeks (see figure 2)
8. Tools Testing and Corrections
The Tools will be tested both structurally and stress tested
with test data to ensure they are working correctly and accurately. The
Interface will be tested to ensure that it is easily usable. Two weeks has been
allocated for these tasks, together with and any corrections to the code that
are required (see figure 2, 4th Report).
9. Algorithm
Evaluation
The Evaluation of the Stemming Algorithms implemented in
Java is the main aim of the project. During this phase the evaluation of the
stemmers will take place and a report will be produced giving the results of
the evaluation. This should take approximately one and a half weeks (see figure
2, 5th Report).
10. Write Final Report
The final report will be compiled covering the project in
its entirety from a completed perspective. This will be compiled from the
previous progress reports and the project diary. This has been allocated three
and a half weeks (see figure 2, 6th Report)
The deliverables will consist of the three stemming
algorithms with test data, the algorithm evaluation report and full development
documentation including four progress reports and the final report.
·
Pentium 200MMX 64MB RAM PC – O/S Windows 95 or Linux Mandrake 7
For developing and testing software and producing
reports
·
Star Office or Microsoft Word
For producing reports
·
Simple text editor.
To code the software
· Java Integrated Development Environment (IDE) Includes compiler - current version: 1.3
For the development of software
Porter Stemming Algorithm
·
http://www.muscat.com/~martin/stem.html
ACL's Natural Language
Processing Software Registry
· http://www.dfki.com/lt/registry/morphan-over.html
Stemming and Co-occurrence on a Larger Corpus,
Interactive Information Text Classification,
The Microsoft Research Institute Web IR & IE,
Java.Sun, Sun Mircosystems,
Frakes, W. B. & Baeza-Yates, R. (eds.) "Information Retrieval: Data Structures & Algorithms" Prentice-Hall 1992. ISBN 0-13-463873-9 {1-> 2}
Kowalski, G.
"Information Retrieval Systems: Theory & Implementation" Kluwer 1997. ISBN 0-7923-9926-9 {1}
Paice, Chris D. “Information Retrieval and the Computer” Macdonald and Jane’s 1977. ISBN 0-345-04095-2
Salton, G. & McGill, M. "An Introduction to Modern Information Retrieval" McGraw-Hill 1983. ISBN 0-07-054484-0 {3: out of print}
Hull, D.A. “Stemming Algorithms: A Case
Study For Detailed Evaluation.” Journal of the American Society for Information
Science 47 (1), 1996, 70-84.
Fox, C.J. and Frakes, W.B.“Strength and
Similarity of Affix Removal Stemming Algorithms” 1999.
Lovins,
J.B. “Development of a Stemming Algorithm.” Mechanical Translation and
Computational Linguistics 11, 1968, 22-31.
Paice,
Chris D. “Another Stemmer.” SIGIR Forum 24 (3), 1990, 56-61.
Paice,
Chris D. “Method for Evaluation of Stemming Algorithms Based on Error
Counting.” Journal of the American Society for Information Science 47(8),
August 1996, 632-649.
Porter,
M.F. “An Algorithm for Suffix Stripping.” Program 14, 1980, 130-137.