At SigParser we’re in the business of capturing data from unstructured emails. When we started building SigParser we tried all the open source solutions for parsing emails. None of them had the accurate enough. So we built the most accurate email body parser in existance.
Say you have an email body like this…
Great talking with you. Let's catchup soon.
Thanks,
Mark Anderson
VP of Engineering
888-222-4444
On Fri, Nov 19, 2018 at 12:03 PM, Paul Johnson <paul@example.com> wrote:
> Let's talk at 11.
> Thanks
> Paul Johnson
And you want the first message only…
Great talking with you. Let's catchup soon.
Or maybe the second message body…
Let's talk at 11.
How do you do that easily? We’ll cover various programming solutions below.
- Challenges of splitting email bodies
- SigParser’s Email Parsing API and Libraries
- Pricing
- Mailgun vs SigParser Parsing Libraries
Why is this hard?
We spent years building email parsers. There are a lot of issues that need to be solved when writing your own email parser:
- Signature identification
- Various formats for headers
- On Fri, Nov 19th…
- On 10/9/2018
- Headers that wrap across lines
- From:, To:, Date: style headers
- Reply chains indicated by > or multiple »>
- Some lines look like signatures but aren’t
- Corrupted email headers
- Common for plain text emails to split reply headers
- Multi-language support is required even if no one speaks another language on your team
- Header formats change over time
- Email clients change over time
Still don’t believe us? Look at our change logs. We’re constantly finding new edge cases.
Due to this, we suggest not coding your own signature parsing algorithm. It is non-trivial. There are also a number of open source half baked efforts out there as well. We’ve tried them all. Most of our users have tried those first before using SigParser.
SigParser’s Cross Platform Email Parsing Tools
Our simple email parsing tools provide a consistent JSON result.
- Clean email bodies of signatures and reply chains
- Get email bodies for forwarded emails
- Capture nested email chains in a single MIME message or .eml file
- REST API option - POST https://ipaas.sigparser.com/api/Mime/ParseString
- Windows, Linux and AWS Lambda deployment options
- .eml, .msg, or JSON format inputs
- Frequent updates as email clients and patterns change
- Usage based and unlimited plans available
The output structure will look like this.
{
"CleanedBodyPlain": "Another response in the chain.\r\n\r\n",
"CleanedBodyHtml": "<div dir=\"ltr\"><div dir=\"ltr\"><div>Another response in the chain. </div><div><br clear=\"all\"></div></div></div>",
"IsSpammyLookingEmailMessage": false,
"IsSpammyLookingSender": false,
"EmailTypes": [
"NormalEmail"
],
"Emails": [
{
"CleanedBodyPlain": "Another response in the chain.\r\n\r\n",
"CleanedBodyHtml": "<div dir=\"ltr\"><div dir=\"ltr\"><div>Another response in the chain. </div><div><br clear=\"all\"></div></div></div>",
"Subject": null,
"Date": "2020-05-11T16:41:16+00:00",
"FromEmailAddress": "paul@example.com",
"FromName": "Paul Mendoza",
"To": [
{
"Name": "Outlook Tester",
"EmailAddress": "outlook.tester@salesforceemail.com"
}
],
"Cc": []
},
{
"CleanedBodyPlain": "This is a reply from the test account.\r\n\r\n",
"CleanedBodyHtml": null,
"Subject": null,
"Date": "2020-05-11T09:40:00",
"FromEmailAddress": "outlook.tester@salesforceemail.com",
"FromName": "Outlook Tester",
"To": [],
"Cc": []
},
{
"CleanedBodyPlain": null,
"CleanedBodyHtml": null,
"Subject": "One more test email at 3:25 PM",
"Date": "2020-04-12T15:25:00",
"FromEmailAddress": "paul@example.com",
"FromName": "Paul Mendoza",
"To": [
{
"Name": "Outlook Tester",
"EmailAddress": "outlook.tester@salesforceemail.com"
}
],
"Cc": []
}
],
"Subject": "Re: One more test email at 3:25 PM",
"Date": "2020-05-11T16:41:16+00:00",
"Headers": {
"mime-version": "1.0",
"date": "Mon, 11 May 2020 09:41:16 -0700",
"references": "<CAL5Lp9VcCVNqeiw0Rry7BHQaTct46qv3BnUvR5-HNqWZO-Xxiw@mail.gmail.com>\r\n\t<BY5PR04MB6819EFA89CDABDFCB9D67D2F8AA10@BY5PR04MB6819.namprd04.prod.outlook.com>",
"in-reply-to": "<BY5PR04MB6819EFA89CDABDFCB9D67D2F8AA10@BY5PR04MB6819.namprd04.prod.outlook.com>",
"message-id": "<CAL5Lp9X0RjYNOo68Y_boL8OOw32gU-SWxLW3WjgYj93eTfUsyQ@mail.gmail.com>",
"subject": "Re: One more test email at 3:25 PM",
"from": "Paul Mendoza <paul@example.com>",
"to": "Outlook Tester <outlook.tester@salesforceemail.com>",
"content-type": "multipart/alternative; boundary=\"00000000000001bd4705a5620460\""
},
"FullPlainTextBody": "Another response in the chain.\n\n*Paul Mendoza*, Founder\nMobile 760-917-3753\nSigParser\npaul@example.com\nSchedule a meeting with me here <https://www.meetingbird.com/m/xxxxxx>\n\nListen to podcasts? I was recently on the *FutureTech Podcast*\n<https://www.futuretechpodcast.com/podcasts/digging-up-the-data-your-company-has-needs-and-cant-access-paul-mendoza-sigparser/>\ntalking about SigParser and use cases other customers are using it for.\n\n\nOn Mon, May 11, 2020 at 9:40 AM Outlook Tester <\noutlook.tester@salesforceemail.com> wrote:\n\n> This is a reply from the test account.\n>\n>\n>\n> *From:* Paul Mendoza <paul@example.com>\n> *Sent:* Sunday, April 12, 2020 3:25 PM\n> *To:* Outlook Tester <outlook.tester@salesforceemail.com>\n> *Subject:* One more test email at 3:25 PM\n>\n>\n>\n>\n> *Paul Mendoza, *Founder\n>\n> Mobile 760-917-3753\n>\n> SigParser\n>\n> paul@example.com\n>\n> Schedule a meeting with me here <https://www.meetingbird.com/m/xxxxxx>\n>\n> Listen to podcasts? I was recently on the *FutureTech Podcast*\n> <https://www.futuretechpodcast.com/podcasts/digging-up-the-data-your-company-has-needs-and-cant-access-paul-mendoza-sigparser/>\n> talking about SigParser and use cases other customers are using it for.\n>\n",
"FullHtmlBody": "<div dir=\"ltr\"><div dir=\"ltr\"><div>Another response in the chain. </div><div><br clear=\"all\"><div><div dir=\"ltr\" class=\"gmail_signature\" data-smartmail=\"gmail_signature\"><div dir=\"ltr\"><div><div dir=\"ltr\"><div><div dir=\"ltr\"><div><div dir=\"ltr\"><div dir=\"ltr\"><div dir=\"ltr\"><div dir=\"ltr\"><font color=\"#3d85c6\" face=\"tahoma, sans-serif\" style=\"font-size:12.8px\"><b>Paul Mendoza</b></font><font color=\"#3d85c6\" face=\"tahoma, sans-serif\" style=\"font-size:12.8px;font-weight:bold\">, </font><span style=\"font-size:12.8px;color:rgb(61,133,198);font-family:tahoma,sans-serif\">Founder</span><div style=\"font-size:12.8px\"><div><font color=\"#666666\" size=\"2\" face=\"arial narrow, sans-serif\">Mobile 760-917-3753</font></div><div><font color=\"#666666\" size=\"2\" face=\"arial narrow, sans-serif\">SigParser</font></div><div><a href=\"mailto:paul@example.com\" style=\"font-family:tahoma,sans-serif;font-size:12.8px;color:rgb(17,85,204)\" target=\"_blank\">paul@example.com</a><br></div><div><a href=\"https://www.meetingbird.com/m/xxxxxx\" target=\"_blank\">Schedule a meeting with me here</a></div><div><img src=\"https://drive.google.com/a/sigparser.com/uc?id=1GUhMvrGnJMCfkge1HMqyKFQCLSJNXcw-&export=download\" width=\"200\" height=\"90\"><br></div></div>Listen to podcasts? I was recently on the <a href=\"https://www.futuretechpodcast.com/podcasts/digging-up-the-data-your-company-has-needs-and-cant-access-paul-mendoza-sigparser/\" target=\"_blank\"><b>FutureTech Podcast</b></a> talking about SigParser and use cases other customers are using it for. </div></div></div></div></div></div></div></div></div></div></div></div><br></div></div><br><div class=\"gmail_quote\"><div dir=\"ltr\" class=\"gmail_attr\">On Mon, May 11, 2020 at 9:40 AM Outlook Tester <<a href=\"mailto:outlook.tester@salesforceemail.com\">outlook.tester@salesforceemail.com</a>> wrote:<br></div><blockquote class=\"gmail_quote\" style=\"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex\">\n\n\n\n\n\n<div lang=\"EN-US\">\n<div class=\"gmail-m_-2662285044572695259WordSection1\">\n<p class=\"MsoNormal\">This is a reply from the test account.<u></u><u></u></p>\n<p class=\"MsoNormal\"><u></u> <u></u></p>\n<div style=\"border-right:none;border-bottom:none;border-left:none;border-top:1pt solid rgb(225,225,225);padding:3pt 0in 0in\">\n<p class=\"MsoNormal\"><b>From:</b> Paul Mendoza <<a href=\"mailto:paul@example.com\" target=\"_blank\">paul@example.com</a>> <br>\n<b>Sent:</b> Sunday, April 12, 2020 3:25 PM<br>\n<b>To:</b> Outlook Tester <<a href=\"mailto:outlook.tester@salesforceemail.com\" target=\"_blank\">outlook.tester@salesforceemail.com</a>><br>\n<b>Subject:</b> One more test email at 3:25 PM<u></u><u></u></p>\n</div>\n<p class=\"MsoNormal\"><u></u> <u></u></p>\n<div>\n<p class=\"MsoNormal\"><br clear=\"all\">\n<u></u><u></u></p>\n<div>\n<div>\n<div>\n<div>\n<div>\n<div>\n<div>\n<div>\n<div>\n<div>\n<div>\n<div>\n<p class=\"MsoNormal\"><b><span style=\"font-size:9.5pt;font-family:Tahoma,sans-serif;color:rgb(61,133,198)\">Paul Mendoza, </span></b><span style=\"font-size:9.5pt;font-family:Tahoma,sans-serif;color:rgb(61,133,198)\">Founder</span><u></u><u></u></p>\n<div>\n<div>\n<p class=\"MsoNormal\"><span style=\"font-size:10pt;font-family:"Arial Narrow",sans-serif;color:rgb(102,102,102)\">Mobile 760-917-3753</span><span style=\"font-size:9.5pt\"><u></u><u></u></span></p>\n</div>\n<div>\n<p class=\"MsoNormal\"><span style=\"font-size:10pt;font-family:"Arial Narrow",sans-serif;color:rgb(102,102,102)\">SigParser</span><span style=\"font-size:9.5pt\"><u></u><u></u></span></p>\n</div>\n<div>\n<p class=\"MsoNormal\"><span style=\"font-size:9.5pt\"><a href=\"mailto:paul@example.com\" target=\"_blank\"><span style=\"font-family:Tahoma,sans-serif;color:rgb(17,85,204)\">paul@example.com</span></a><u></u><u></u></span></p>\n</div>\n<div>\n<p class=\"MsoNormal\"><span style=\"font-size:9.5pt\"><a href=\"https://www.meetingbird.com/m/xxxxxx\" target=\"_blank\">Schedule a meeting with me here</a><u></u><u></u></span></p>\n</div>\n<div>\n<p class=\"MsoNormal\"><span style=\"font-size:9.5pt\"><img border=\"0\" width=\"200\" height=\"90\" style=\"width: 2.0833in; height: 0.9375in;\" id=\"gmail-m_-2662285044572695259_x0000_i1025\" src=\"https://ci6.googleusercontent.com/proxy/TTpjUlFcjmphqTPKcbTFGb7TsHUk5MzP3P1Wt2uZYLjMzlO0UPeF7MAgaUwFk4hqlFafCMhmzlmkc3FUbGH4ijNXkqx9DAsv-_3CFnCTmZaZhMlONJqrrR-oGfWMfwqGpDgk301HHsijRMhsymfOCkhNKg=s0-d-e1-ft#https://drive.google.com/a/sigparser.com/uc?id=1GUhMvrGnJMCfkge1HMqyKFQCLSJNXcw-&export=download\"></span><span style=\"font-size:9.5pt\"><u></u><u></u></span></p>\n</div>\n</div>\n<p class=\"MsoNormal\">Listen to podcasts? I was recently on the <a href=\"https://www.futuretechpodcast.com/podcasts/digging-up-the-data-your-company-has-needs-and-cant-access-paul-mendoza-sigparser/\" target=\"_blank\">\n<b>FutureTech Podcast</b></a> talking about SigParser and use cases other customers are using it for.\n<u></u><u></u></p>\n</div>\n</div>\n</div>\n</div>\n</div>\n</div>\n</div>\n</div>\n</div>\n</div>\n</div>\n</div>\n</div>\n</div>\n</div>\n\n</blockquote></div></div>\n"
}
Learn More About SigParser
Command Line (Linux or Windows)
Consume SigParser from any shell. Provide it with a JSON file of the email or an EML file or a MSG file and it will return a JSON structured response for the fields listed above. You can also tell it to output to a directory.
SigParser API called with Python
Example of how to call our assembly in Python. You’ll need to write the JSON out to the input.json file first.
import os
stream = os.popen('SigParserEmailUtils cleanedemail --filename input.json')
output = stream.read()
output
Lambda Deployment Option
AWS Lambda is a great service to deploy SigParser’s email parsing tools to. Each email its own dedicated RAM and CPU, Lambdas are kept warm for around 5 minutes which means the startup time is decreased per email and they scale really well.
Deploying Your Lambda
To configure, create a .NET Core 2.1 (C#/PowerShell) Lambda function. Name doesn’t matter.
In Function code section set the Handler as SigParser.EmailParsing.Lambda::SigParser.EmailParsing.Lambda.Function::GetCleanedEmailAsync
Upload the SigParser.EmailParsing.Utils.Lambda.zip file.
Set the Environment Variable for SigParserLicenseKey to your license Cryptolens license key. Contact us to get that.
Set the Memory to 2048MB of RAM. SigParser needs quite a bit of RAM to run all the machine learning systems quickly.
Click Save and then click Test and use this test email and it should return a JSON result. The first time can be slow but after that it tends to be fast.
{
"FromEmailAddress": "mary.johnson@fake.com",
"FromName": "Mary Johsnon",
"TextBody": null,
"HtmlBody": "<p>Hi John,<\\/p>\\r\\n\\r\\n<p>Let\\'s get coffee tomorrow.<\\/p>\\r\\n\\r\\n<p>Thanks Mary Johnson<\\/p>"
}
Invoke Lambda Function
RAM Usage Explained
SigParser needs 2048MB of RAM per email to safely execute without running out of RAM when processing emails. The average real human emails needs 962 MB of RAM. The 99th percentile nees 1605MB.
SigParser Email Parser in incredibly CPU intensive. In AWS the more RAM you give a Lambda the more CPU speed AWS gives that Lambda. So having lots of RAM isn’t wasteful since it executes faster.
Mailgun vs SigParser Parsing Libraries
We get compared to Mailgun’s open source email parsing library but these are very different libraries when it comes to what they do and their performance.
SigParser | Mailgun | |
---|---|---|
Accuracy
Estimated accuracy for signature line identification |
99.9% | 92% |
Strip Signatures Off Emails |
Yes | Yes |
Support Languages How many lanauges can it split emails for? |
English, German, Spanish, French, Portuguese, Russian, Dutch, Norwegian, Korean, Chinese, Turkish, Swedish, Czech | English |
Forward Extraction Capture forwarded messages |
Yes | No |
ML Knowledge How much machine learning knowledge do you need? |
Nothing | Some. You'll need to find your own training data too since the 200 emails samples they give you isn't a very robust set. |
Deliverables What do you get? |
Linux assembly, Windows assembly, Lambda zip file, Nuget Package | Python source code |