Home PorterStemmer.d
Post
Cancel

PorterStemmer.d

I’ve been interested in NLP for a while. And I’ve been into D for almost a whole month! I wanted to implement an NLP subtask, and continue familiarization with D. A simple (but not trivial) algorithm to implement is the Porter Stemmer. A stemmer takes words (English in this case) and resolves them down to a common form.

Ex:

  • resolution -> resolut
  • resolute -> resolut

Notice that the result of the stem is not a valid english word. The porter stemmer uses a deterministic alorithm that follows a simple set of translations and calculations to remove suffixes. It’s a relatively simple implementation. It only took a couple hours to implement, despite requiring a couple hundred lines. I also wanted some confirmation of D’s performance superiority over the preeminent existing library’s implementation, Python’s NLTK - Porter Stemmer.

In this article I will discuss my implementation of the porter stemmer, the necessity and qualitative experience of constructing the test suite, and the benchmark I used to compare with the NLTK implementation.

Differences with Python NLTK porter stemmer

Python differs from the standard by handling a couple of cases differently. For example the fully suffix is stripped in the python implementation, but not the standard.

  • (carfully -> carefulli) standard
  • (carfully -> care) Python

I decided to stick to the standard porter stemmer, as defined in the standard. I do, however, see the advantages of the python decisions. One case I sided with python on was skipping words with fewer than 3 characters. The python stemmer doesn’t process words under three and that seemed good enough to me (is should not stem to i).

Tests

This algorithm would have been impossible to implement without a test suite. Firstly the definition is far from clear, the notation only becomes clear after testing, and it is far from consistent. For example, a representation of the stem is sometimes shown on the left hand side, is missing from the right, but should not be removed. Other times the stem is omitted from the left hand since. Also, I had to look at the source of the NLTK implementation to realize that the measure referred to the measure of the possible stem, not the whole word.

However, by constructing from the top down, following the examples as test cases, and updating expected output as the number of steps increased, and checking against the NLTK implementation, I was eventually able to understand the definition.

Benchmarks

To compare my implementation with the NLTK version. I wanted to isolate as much as possible the actual stemming code, and make the benchmarks as similar as possible. Below are excerpts from the D benchmark and the python version.

D:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
	// load the text and split into alphanumeric tokens
  auto txt = readText("./test/de-bello-gallico.txt");
  auto r = regex(r"[^\w]+");
  auto tokens = split(txt, r);
  string[] stems;

	// loop stems and fill the output
  void f1 (){
    stems = [];
    foreach(t; tokens){
      stems ~= stem(t);
    }
    writeln("Cycle completed...");
  }

  writeln("\nRunning Benchmark...\n");
  int cycles = 10;

	// run function 10 times
  auto res = benchmark!(f1)(cycles);

Python:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
# load file and split into alphanumeric tokens
p = nltk.PorterStemmer()
b = re.compile(r"\b\w+\b")
tokens = b.findall(open("./text/de-bello-gallico.txt").read().lower())

loop tokens and fill the output (run 10 times)
def test(tokens):
    stems = []
    print("call...")
    for i in range(0, 10):
        stems = []
        print("loop...", i)
        for t in tokens:
            stems.append(p.stem(t))

    return stems

start = time.time()
stems = test(tokens)
end = time.time()

You’ll notice a couple of places where similarity of tests was given pride of place. Firstly, the text is loaded into memory all at once, instead of streamed or loaded in chunks. Also, the text is split all at once; tokens are not found one at a time. My intuition is that handling tokens as they appeared in text would be faster, but I didn’t want any advantage D might have in IO to influence the results.

One noticeable difference is that the D program loops outside of the benched method, and the python function loops inside. My intuition is that this would only slow the D benchmark relative to the python. But it the call time for the function would be dwarfed by the duration spent in the stemming method.

Performance

As you can see above the text for De Bello Gallico is stemmed 10 times in each bench. The text totals around 87k tokens, bringing the grand total to 870k tokes per test. The python stemmer took roughly 0.5 seconds to stem every word. My implementation of the porter stemmer to 0.3 seconds to stem. The NLTK implementation stemmed over 1.5 million words per seconds, and the D implementation stemmed over 2 million. Significantly faster, but not too astounding.

Conclusion

I’m not sure that I’ll be doing exploratory NLP work in D, having to implement each took would be a major hassle, and list comprehension is really useful. I will be looking for more reasons to use D. There are other feature’s of the language I’m excited to explore.

-->