Dlang D C Python NLP Stemming

PorterStemmer.d: Working on my Bench (C, Python, D)

by Ross Harrison

Feb 27, 2016

In my previous post I compared my D implementation of the PorterStemmer to the Python NLTK version. The results were ok, the D stemmer ran almost twice as fast.

However, John Colvin was disatified by the performance of my implementation. With his changes He was able to reduce the runtime to an astounding 0.04s per cycle, down from 0.6s. Quite, the speedup. I was excited to try it out against the C implementation provided by Martin Porter himself.

D implementation changes

This was quite impressive, considering the changes were relatively simple. The three major changes were adjustments to iterations, changes in the character replacement, and using static tuples for the adjustment pairs.

Loop Changes

The loop adjustments involved moving from for to foreach, even for indexed loop, ex:

foreach (i; 1 .. len)
  ...

This didn’t just improve the runtime, it is also more idiomatic for D, and is less prone to error than a traditional for.

Removing Suffixes

Previously I was removing suffixes by adjusting the length of the string s.length--. Using slices of the source string proved to be a significant improvement s = s[0 .. $-1].

Substring replacement

John’s final improvement was to use slice indices to directly replace the characters instead of using unicode compatible methods. i was originally replacing substrings like this replaceInPlace(s, s.length - 1, s.length, "i");. John simply used slices s[$-1] = 'i';. I was initially concerned about breaking compatibility with the text set I like to work with. Below I go over those concerns.

#Python updates

I also spend a small amount of time (like 5 minutes) speeding up the python benchmark. By adjusting the benchmark script to use list comprehension (instead of appending to an array, I was able to reduced runtime from 0.7 to about 0.5 seconds per iteration. This is still over 10x slower than the updated D implementation.

Comparison with C implementation

Initially I wanted to test against the Snowball generated implementation. However, it turned out to be a massive pain to get working. So I opted to go with the version manually coded by Dr. Porter.

The C implementation is doing a bit more work than the Python or D language benchmarks. The stemfile function is reading the file for strings, as well as stemming each string. Originally it was also printing to stdout. In order to make them more similar, I ended up removing output from the D and Python implementations to make the benchmarks more similar. This had a dramatic effect on the C, but not the D or Python benchmarks (likely because they were using a more efficient output method).

The C implementation is indeed fast. It completes each cycle in around 0.015s, 2x faster than the D implementation. This isn’t too surprising.

Still unicode compatible?

Initially, I thought that I needed to use handle unicode compatibility for a historical text set that I like to work with. This text contains mostly English characters. But also some extra Latin, Germanic, French, Greek, and other characters from western and southern European languages. However, when I began to test this assumption against the C and D implementations, I was surprised. Both implementations handled a wide variety of characters that I assumed would crash. The below example passed in both implementations

"⚷☧Œstraße" -> "⚷☧Œstraß")

This, despite using direct access to the string indices. The only conclusion I can come up with is that I don’t really understand what unicode is…..

Conclusion

With a couple of changes the D implementation was able to run in one tenth the time. Bringing it from the realm of Python performance closer to the C implementation.

My misgivings of John’s changes proved to be unfounded. It turns out I don’t understand character sets as well as I thought, but the D implementation still handle the character set I need it too work with, and that’s All I really care about.