{"id":629,"date":"2012-05-06T23:47:22","date_gmt":"2012-05-06T11:47:22","guid":{"rendered":"http:\/\/www.funk.co.nz\/blog\/?p=629"},"modified":"2013-06-04T17:30:43","modified_gmt":"2013-06-04T05:30:43","slug":"webmail-referral-source-medium-rollup-filter","status":"publish","type":"post","link":"https:\/\/www.funk.co.nz\/blog\/online-marketing\/webmail-referral-source-medium-rollup-filter","title":{"rendered":"Get rid of live.com load-balanced domains with a regular expression: GA Webmail Referral Traffic Source Rollup Filter"},"content":{"rendered":"<p><span style=\"text-decoration: underline;\"><em><strong>The problem?<\/strong><\/em><\/span> Nasty load balanced domains in Google Analytics reports like this:<\/p>\n<table width=\"500\" border=\"0\" cellspacing=\"0\" cellpadding=\"0\"><!--StartFragment--> <\/p>\n<colgroup>\n<col width=\"186\" \/>\n<col width=\"75\" \/> <\/colgroup>\n<tbody>\n<tr>\n<td width=\"186\" height=\"13\"><strong>Source<\/strong><\/td>\n<td width=\"75\"><strong>Visits<\/strong><\/td>\n<\/tr>\n<tr>\n<td height=\"13\">36ohk6dgmcd1n-c.c.yom.mail.yahoo.net \/ referral<\/td>\n<td align=\"right\">149<\/td>\n<\/tr>\n<tr>\n<td height=\"13\">mail.google.com\u00c2\u00a0\/ referral<\/td>\n<td align=\"right\">131<\/td>\n<\/tr>\n<tr>\n<td height=\"13\">du114w.dub114.mail.live.com\u00c2\u00a0\/ referral<\/td>\n<td align=\"right\">43<\/td>\n<\/tr>\n<tr>\n<td height=\"13\">du103w.dub103.mail.live.com\u00c2\u00a0\/ referral<\/td>\n<td align=\"right\">25<\/td>\n<\/tr>\n<tr>\n<td height=\"13\">sn124w.snt124.mail.live.com\u00c2\u00a0\/ referral<\/td>\n<td align=\"right\">23<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p><span style=\"text-decoration: underline;\"><em><strong>The solution?<\/strong><\/em><\/span> Clean nicely segmented source lines that \"roll up\" into one:<\/p>\n<table width=\"500\" border=\"0\" cellspacing=\"0\" cellpadding=\"0\"><!--StartFragment--> <\/p>\n<colgroup>\n<col width=\"202\" \/>\n<col width=\"75\" \/> <\/colgroup>\n<tbody>\n<tr>\n<td width=\"202\" height=\"13\"><strong>Source\/Medium<\/strong><\/td>\n<td width=\"75\"><strong>Visits<\/strong><\/td>\n<\/tr>\n<tr>\n<td height=\"13\">Webmail (live.com) \/ email<\/td>\n<td align=\"right\">643<\/td>\n<\/tr>\n<tr>\n<td height=\"13\">Webmail (yahoo.com) \/ email<\/td>\n<td align=\"right\">258<\/td>\n<\/tr>\n<tr>\n<td height=\"13\">Webmail (google.com) \/ email<\/td>\n<td align=\"right\">105<\/td>\n<\/tr>\n<tr>\n<td height=\"13\">Webmail (aol.com) \/ email<\/td>\n<td align=\"right\">13<\/td>\n<\/tr>\n<tr>\n<td height=\"13\">Webmail (libero.it) \/ email<\/td>\n<td align=\"right\">12<\/td>\n<\/tr>\n<tr>\n<td height=\"13\">Webmail (laposte.net) \/ email<\/td>\n<td align=\"right\">23<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>To clean these up requires two filters:<\/p>\n<ul>\n<li>Webmail Source Rollup (search and replace with Webmail (brand])<\/li>\n<li>Webmail Medium Rollup (swap \/ referral with \/ email as medium for further rolling up!)<\/li>\n<\/ul>\n<p>Both these filters will use the core regex code I've figured out that consolidates 99% of the worlds webmail systems without pulling in any false positives in theory.<\/p>\n<h2>The Magic Regex Code<\/h2>\n<p>Here is the filter regex that I've been currently using in my production Google analytics advanced filters since 15 Feb 2012 to cleanup these - or to roll up the load-balanced domains that you often get in referrals:<\/p>\n<p><span style=\"text-decoration: underline;\"><strong><em>My starting point:<\/em><\/strong><\/span><\/p>\n<p>(messag|courrier|zimbra|imp|mail)(.*)\\.(.*)\\..{2,4}<\/p>\n<p>This grabs any domain with say \"mail\" in it, but runs a check on the ending of the domain: it needs to have at least 2 dots after it and a TLD between 2 and 4 chars long. It will miss go.mail.ru and mail.com. It will also miss \"Mail Campaign \\ email\" because this is not a proper domain. So far so good \ud83d\ude42<\/p>\n<p><span style=\"text-decoration: underline;\"><em><strong style=\"font-style: italic;\">My improvement with exceptions (This is the one to use!):<\/strong><\/em><\/span><\/p>\n<p>(messag|courrier|zimbra|^imp|mail).*\\.(.*\\..{2,4}|go\\.mail\\.ru|promail\\.co\\.nz|service\\.mail\\.com|3c\\.web\\.de|outlook\\.com)<\/p>\n<p>Domain of webmail platform is in capture group $A2.<\/p>\n<p>My improved regex has a new exception section at the end to allow some special cases (go.mail.ru, 3c.web.de, service.mail.com) to get through the filter using a hard-coded approach that skips the safety net and autonomy provided by the wide open keyword matching paired with a domain name restriction ensuring \"two more dots and a TLD\".<\/p>\n<p>Maybe that is excessive use of regex, but at least you can be sure you can now see your word of mouth \/ word of email traffic nice and tidy!<\/p>\n<h2><strong>How does it work?<\/strong><\/h2>\n<p>I'll break the formula down in sections:<\/p>\n<h3>(messag|courrier|zimbra|^imp|<strong>mail<\/strong>)<\/h3>\n<p>This looks for the really obvious and common keywords in webmail services, the main one being <strong>mail<\/strong>. This will match\u00c2\u00a0sn124w.snt124.<strong>mail<\/strong>.live.com but it will also grab\u00c2\u00a0e<strong>mail<\/strong>chimp.com and a huge number of others that you really don't want to catch with this filter.\u00c2\u00a0If we were looking for <strong>live<\/strong> (but we aren't in this case), then a site like\u00c2\u00a0<a title=\"Answering Oliver page about Stray\" href=\"http:\/\/www.answeringoliver.com\/2012\/04\/days-3-5-new-zealands-stunning-bay-of.html\" target=\"_blank\">www.answeringo<strong>LIVE<\/strong>r.com<\/a>\u00c2\u00a0would get picked up in the crossfire. I use a ^ in front of imp so that domains like dimpost.wordpress.com don't get caught.<\/p>\n<p>Basically the first part of the domain name (<strong>sn124w.snt124<\/strong>.mail) will be getting deleted by this filter so you could stand to lose quite a lot of data with the \"mail\" and \"imp\" keywords if this were the only parts of the expression! So the next bit of filter is designed to pass the domain through another difficult test involving the dots and TLDs...<\/p>\n<h3>(.*)\\.(.*)\\..{2,4}<\/h3>\n<p>This makes sure that the domain bits after <strong>mail<\/strong> or <strong>zimbra<\/strong> or whatever always have two dots and a TLD (top level domain extension eg .nz .jp). Which matches the end bit of\u00c2\u00a0sn124w.snt124.mail<strong>.live.com<\/strong> and \u00c2\u00a0the end of: www.funk<strong>.co.nz<\/strong>.<\/p>\n<p>The (.*) part means match anything including nothing, and the \\. means there must be a dot, so\u00c2\u00a0(.*)\\.(.*)\\. means there gotta be at least two more dots in this domain name <em><strong>coming up after the live\"<\/strong><\/em>. Which is how Answering Oliver gets through the test for <strong>live<\/strong>. The next part .{2,4} is all about the top level domain or TLD. These can be 2, 3, or 4 letters long like .co, .com. and .mobi. The curly braces specify how many times the previous character . (which means single char you like except nothing) can appear like {min,max}.<\/p>\n<p>Then in the middle is a pipe | which cuts the regex open and allows some really hard to match exceptions through for smaller webmail systems. Only reason you see the web.de one also appear on the left of the central | is because this is such a whacky domain name that it doesn't match the \"mail\" which gets most of the webmail systems on the planet. Germans aye? \ud83d\ude42<\/p>\n<p>*\u00c2\u00a0(Hi Devon! Thanks for sending traffic to\u00c2\u00a0<a title=\"Stray Asia (we do NZ as well I just wanna send Asia some link love)\" href=\"http:\/\/www.straytravel.asia\/\" target=\"_blank\">Stray Travel<\/a>\u00c2\u00a0I found you researching this post)<\/p>\n<h2>Screenshots<\/h2>\n<h3>Campaign Source Filter<\/h3>\n<p>Advanced filter.<\/p>\n<p>Field A -&gt; Extract A: Campaign Source:<\/p>\n<p>(messag|courrier|zimbra|^imp|mail).*\\.(.*\\..{2,4}|go\\.mail\\.ru|promail\\.co\\.nz|service\\.mail\\.com|3c\\.web\\.de|outlook\\.com)<\/p>\n<div><\/div>\n<p>Field B -&gt; Extract B: <em>[leave both blank]<\/em><\/p>\n<p>Output To -&gt; Constructor: Campaign Source:\u00c2\u00a0Webmail ($A2.$A3)<\/p>\n<div id=\"attachment_674\" style=\"width: 736px\" class=\"wp-caption alignnone\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-674\" class=\"size-full wp-image-674\" title=\"webmail-source-rollup\" src=\"https:\/\/www.funk.co.nz\/blog\/wp-content\/uploads\/2012\/05\/webmail-source-rollup.png\" alt=\"GA Webmail Rollup Filter\" width=\"726\" height=\"547\" srcset=\"https:\/\/www.funk.co.nz\/blog\/wp-content\/uploads\/2012\/05\/webmail-source-rollup.png 726w, https:\/\/www.funk.co.nz\/blog\/wp-content\/uploads\/2012\/05\/webmail-source-rollup-600x452.png 600w, https:\/\/www.funk.co.nz\/blog\/wp-content\/uploads\/2012\/05\/webmail-source-rollup-650x489.png 650w\" sizes=\"auto, (max-width: 726px) 100vw, 726px\" \/><p id=\"caption-attachment-674\" class=\"wp-caption-text\">GA Webmail Rollup Filter<\/p><\/div>\n<h2>Medium Rollup Filter<\/h2>\n<p>Advanced filter.<\/p>\n<p>Field A -&gt; Extract A: Campaign Source:<\/p>\n<p>(messag|courrier|zimbra|^imp|mail).*\\.(.*\\..{2,4}|go\\.mail\\.ru|promail\\.co\\.nz|service\\.mail\\.com|3c\\.web\\.de|outlook\\.com)<\/p>\n<div><\/div>\n<p>Field B -&gt; Extract B:\u00c2\u00a0<em>[leave both blank]<\/em><\/p>\n<p>Output To -&gt; Constructor: Campaign <strong>Medium<\/strong>:\u00c2\u00a0email<\/p>\n<div id=\"attachment_630\" style=\"width: 640px\" class=\"wp-caption alignnone\"><a href=\"https:\/\/www.funk.co.nz\/blog\/wp-content\/uploads\/2012\/05\/webmail-rollup.png\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-630\" class=\"size-full wp-image-630\" title=\"webmail-rollup\" src=\"https:\/\/www.funk.co.nz\/blog\/wp-content\/uploads\/2012\/05\/webmail-rollup.png\" alt=\"GA Webmail Rollup Filter\" width=\"630\" height=\"506\" srcset=\"https:\/\/www.funk.co.nz\/blog\/wp-content\/uploads\/2012\/05\/webmail-rollup.png 630w, https:\/\/www.funk.co.nz\/blog\/wp-content\/uploads\/2012\/05\/webmail-rollup-600x482.png 600w\" sizes=\"auto, (max-width: 630px) 100vw, 630px\" \/><\/a><p id=\"caption-attachment-630\" class=\"wp-caption-text\">GA Webmail Rollup Filter<\/p><\/div>\n<p>&nbsp;<\/p>\n<h3>Additional reference domains to check:<\/h3>\n<p>The domains below are the really rare webmail clients that are hard to extract:<\/p>\n<p>mail175-236.sinamail.sina.com.cn<br \/>\ngo.mail.ru<br \/>\nservice.mail.com<br \/>\npromail.co.nz<br \/>\n3d.web.de<br \/>\nch1prd0310.outlook.com<\/p>\n<p>This is why I needed to grab the full domain with\u00c2\u00a0(.*\\..{2,4}) versus the first versions\u00c2\u00a0(.*)\\.(.*)\\..{2,4} which would have only grabbed \"sinamail\". Now we get \u00c2\u00a0sinamail.sina.com.cn.<\/p>\n<p>These can be checked with: sina\\.com\\.cn|go\\.mail\\.ru|mail\\.com|promail\\.co\\.nz|web\\.de|outlook\\.com<\/p>\n<h3>Errors and Issues<\/h3>\n<p>Currently the filter is incorrectly tracking as word of email the following referrals:<\/p>\n<p>dailymail.co.uk<\/p>\n<p>I will update the filter to address this issue at some point.<\/p>\n<h3>References<\/h3>\n<p>Thanks to <a href=\"http:\/\/www.resoneo.com\/\">Olivier Resoneo<\/a> for the original inspiration (French). His code was:<\/p>\n<blockquote>\n<div id=\"LC1\">Grouper tous les webmail francophones sous le nom de domaine principal<\/div>\n<div id=\"LC2\"><\/div>\n<div id=\"LC3\">Custom filter<\/div>\n<div id=\"LC4\">Advanced<\/div>\n<div id=\"LC5\">Champ A : Campaign Source : (messag|courrie|zimbra|ima?p|mail|prd[0-9]+)(.*)\\.(.*)\\..{2,4}<\/div>\n<div id=\"LC6\">Champ B : (rien) -<\/div>\n<div id=\"LC7\">Output To -&gt; Constructor : Campaign Source : Webmail - $A3<\/div>\n<div id=\"LC8\">Yes<\/div>\n<div id=\"LC9\">No<\/div>\n<div id=\"LC10\">Yes<\/div>\n<div id=\"LC11\">No<\/div>\n<div id=\"LC12\"><\/div>\n<div id=\"LC13\"><\/div>\n<div id=\"LC14\">On peut aussi d\u00c3\u00a9cliner pour forcer le medium \u00c3\u00a0 'email' quand match sur Campaign Source, ET Campaign \u00c3\u00a0 email-non-taggue par exemple, pour avoir le triplet medium\/source\/campagne<\/div>\n<div><\/div>\n<\/blockquote>\n","protected":false},"excerpt":{"rendered":"<p>The problem? Nasty load balanced domains in Google Analytics reports like this: Source Visits 36ohk6dgmcd1n-c.c.yom.mail.yahoo.net \/ referral 149 mail.google.com\u00c2\u00a0\/ referral 131 du114w.dub114.mail.live.com\u00c2\u00a0\/ referral 43 du103w.dub103.mail.live.com\u00c2\u00a0\/ referral 25 sn124w.snt124.mail.live.com\u00c2\u00a0\/ referral 23 The solution? Clean nicely segmented source lines that &#8220;roll up&#8221; into one: Source\/Medium Visits Webmail (live.com) \/ email 643 Webmail (yahoo.com) \/ email 258 Webmail [&hellip;]<\/p>\n","protected":false},"author":3,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[21,17],"tags":[24],"class_list":["post-629","post","type-post","status-publish","format-standard","hentry","category-google-analytics","category-online-marketing","tag-regular-expressions"],"_links":{"self":[{"href":"https:\/\/www.funk.co.nz\/blog\/wp-json\/wp\/v2\/posts\/629","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.funk.co.nz\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.funk.co.nz\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.funk.co.nz\/blog\/wp-json\/wp\/v2\/users\/3"}],"replies":[{"embeddable":true,"href":"https:\/\/www.funk.co.nz\/blog\/wp-json\/wp\/v2\/comments?post=629"}],"version-history":[{"count":0,"href":"https:\/\/www.funk.co.nz\/blog\/wp-json\/wp\/v2\/posts\/629\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.funk.co.nz\/blog\/wp-json\/wp\/v2\/media?parent=629"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.funk.co.nz\/blog\/wp-json\/wp\/v2\/categories?post=629"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.funk.co.nz\/blog\/wp-json\/wp\/v2\/tags?post=629"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}