Home Parsing Emails with Java and Go on AWS Lambda
Post
Cancel

Parsing Emails with Java and Go on AWS Lambda

As part of my design for a temporary email service, I created a microservice for parsing received emails. This microservice receives SES messages via an SNS source topic parses some data out of the mime messages. The returned data is then consumed by another SNS topic configured as the destination.

Lambda Destination

The lambdas themselves (I ended up writing two), were written in Kotlin an Go. In the rest of the post, I’ll be discussing those implementations, a very brief performance comparison, and go over an issue I had with SQS and go over an issue I had connecting destinations.

All of my configuration was done using AWS CDK. CDK is great, and I’ll probably write about in the future.

Why no SQS queue?

I first tried to use an SQS Queue as the source for the lambda event. However, the result message would never reach the destination topics. The logs showed that the incoming message were properly parsed and the return value looked correct. It took an embarrassing amount of debugging, but I found the answer.

When a function is invoked asynchronously, Lambda sends the event to an internal queue. A separate process reads events from the queue and executes your Lambda function. When the event is added to the queue

Lambda polls the queue and invokes your Lambda function synchronously

Destinations only work when the Lambda functions are executed asynchronously. SQS event sources will cause the Lambda functions to execute synchronously, so function sourced by SQS queues will never deliver to a destination. In addition, the asynchronous execution uses an “internal queue”, so SQS is redundant.

Using SNS topics worked as expected.

On the bright side I found out how to invoke Lambda functions directly using the cli:

1
2
3
4
5
aws lambda invoke --debug \
  --function-name [FUNCTION_NAME] \
  --invocation-type Event
  --payload '{"data": "its got what you want"}'\
  --cli-binary-format raw-in-base64-out response.json

This helped me confirm the issue was with my configuration, and not the implementation.

Kotlin Implementation

The Kotlin implementation uses the Gson library to parse the incoming JSON event data. It uses the Javax mail and the Apache commons MimeMessageParser libraries to parse the email content. Below you can see the POM dependencies I added. Lastly I needed the aws-lambda-java-core and aws-lambda-java-events packages to handle the incoming messages.

With destinations the request handler logic becomes pretty simple. Loop over incoming records, parse the records, and return a serialized string of the results:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
class ParseEmailHandler : RequestHandler<SNSEvent, String> {
    private val gson = GsonBuilder().setPrettyPrinting().create()

    override fun handleRequest(event: SNSEvent, context: Context): String {
        if (event.records == null) {
            throw Error("No records to process")
        }

        val results = event.records.map { snsRecord ->
            EmailNotification.fromSNSRecord(snsRecord)
        }

        return gson.toJson(results)!!
    }
}

The bulk of the work is in deserializing the received JSON objects and Mime message. This was done in the EmailNotification class.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
// classes receive
data class SNSJsonData(@Expose val content: String? = null)

data class EmailNotification(
    val from: String,
    val to: List<String>,
    val subject: String,
    val htmlContent: String,
) {
    companion object {
        /**
         *  Parse message body from string, and email from content property
         */
        fun fromSNSRecord(snsRecord: SNSEvent.SNSRecord): EmailNotification {
            val gson = GsonBuilder().setPrettyPrinting().create()
            val SNSJsonData = gson.fromJson(snsRecord.sns.message, SNSJsonData::class.java)!!
            val emailParser = parseEmailContent(SNSJsonData.content!!)!!

            return EmailNotification(
                from = emailParser.from,
                to = emailParser.to.map { (it as InternetAddress).address },
                subject = emailParser.subject,
                htmlContent = emailParser.htmlContent,
            )
        }

        /**
         *  Return parsed
         */
        private fun parseEmailContent(content: String): MimeMessageParser? {
            return ByteArrayInputStream(content.toByteArray()).use { contentBytes ->
                try {
                    val session = Session.getDefaultInstance(Properties())
                    val parser = MimeMessageParser(MimeMessage(session, contentBytes))
                    parser.parse()
                    parser
                } catch (ex: MessagingException) {
                    null
                } catch (ex: IOException) {
                    null
                }
            }
        }
    }
}

You can see that I’m using the null assert !! operators. This will throw if the email parsing failed. Not the best error handling for production, but it simplifies the program for this prototype.

The Kotlin implementation got me off the ground, but I wasn’t stoked about the performance. The lambda would consistently take more than a second to execute, event after warming.

Go Implementation

The Go implementation is a direct translation of the Kotlin function. Then handler loops over the received records, parses, and returns a set of results (this time not serialized).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
package main

import (
	"context"
	"github.com/aws/aws-lambda-go/events"
	"github.com/aws/aws-lambda-go/lambda"
	"gitlab.com/vapormail/mailbox-services-go/lib/dtos"
	"log"
)

func main() {
	lambda.Start(parserHandler)
}

func parserHandler(ctx context.Context, event events.SNSEvent) ([]dtos.Email, error) {
	result := []dtos.Email{}

	for _, record := range event.Records {
		email, err := dtos.FromSnsEventRecord(record)

		if err == nil {
			result = append(result, email)
		} else {
			log.Printf("Unable to parse record: %s \n", err)
		}

	}

	log.Printf("RESULT: %s", result)
	return result, nil
}

Like the Kotlin implementation, the bulk of the work is in parsing the event record and the mim data. For the mime data, the Go function use the enmime library.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
package dtos

import (
	json2 "encoding/json"
	"github.com/aws/aws-lambda-go/events"
	"github.com/jhillyerd/enmime"
	"log"
	"strings"
)

type sqsJson struct {
	Content string
}

type Email struct {
	HTML      string
	From      string
	To        string
}

func FromSnsEventRecord(message events.SNSEventRecord) (Email, error) {
	var sqsData sqsJson
	err := json2.Unmarshal([]byte(message.SNS.Message), &sqsData)

	if err != nil {
		log.Printf("Unable to parse message: %s \n", err)
		return Email{}, err
	}

	envelope, err := enmime.ReadEnvelope(strings.NewReader(sqsData.Content))

	if err != nil {
		log.Printf("Unable to parse email body: %s \n", err)
		return Email{}, err
	}

	return Email{
		HTML: envelope.HTML,
		From: envelope.GetHeader("from"),
		To: envelope.GetHeader("to"),
	}, nil
}

Which one to keep?

Performance is always important, but is paramount with AWS lambda, where the cost is directly determined by how much memory and compute time is used. Below you can see a table comparing the invocations times for trial of 50 invocations in a minutes. The results are striking. I was expecting a bit of a difference. But the Go lambda is faster by three orders of magnitudes.

1
2
3
4
5
6
| Kotlin Invocation Time ms |  | Go Invocation Time ms |
|-------------|-------------|  |-------------|---------|
| Min:        | 148.44      |  | Min:        | 3.0487  |
| Max:        | 9,508       |  | Max:        | 24.94   |
| Average:    | 2,760       |  | Average:    | 9.2079  |
|-------------|-------------|  |-------------|---------|

This is barely a benchmark and it’s hardly a one to one comparison. There are a number of optimizations that I could make to the Kotlin application to make it run a bit faster: pruning dependencies, removing Kotlin and using Java, tuning when I’m allocating objects. However, it’s highly unlikely that I’d be able to get close to the execution times that were observed for the Go function without any optimization effort.

There has been a lot of work done to address the performance issues associated with Java in Lambda. It’s not my intention to address any of that. And my performance comparison is insufficient to draw any broad conclusions. I’m simply presenting it in order to make clear my motivation for continuing with Go. It’ll be way cheaper for me.

Conclusion

Using destinations makes writing lambdas almost trivial. All the function needs to do is parse incoming messages and return some output. There is no need to configure or connect with the destination (or even know what it is). I’m planning on using this feature extensively as it completely isolates business logic.

Creating a Go implementation and doing a performance comparison was really easy with sources and destinations. I simply had to create and new lambda and connect it to the same source and topics. It’s a really nice paradigm, and I’m excited to work with it some more.

-->