[olug] First Ruby on Rails user group meeting 08/20/2008 (and regex question)

Christopher Cashell topher-olug at zyp.org
Mon Aug 18 15:57:38 UTC 2008


On Mon, Aug 18, 2008 at 9:58 AM, Dan Linder <dan at linder.org> wrote:
> On Mon, Aug 18, 2008 at 9:18 AM, Adam Haeder <adamh at aiminstitute.org> wrote:
>> I'm attempting to write a shell script that will pull email addresses out
>> of a file. These addresses may appear anywhere in a line. I think what I'm
>> essentially looking for is a 'substring grep'. I want a grep that will
>> give me part of a line that matches a regex. I've been toying with the
>> idea of doing something like this:

[Snip: test input]

> $ perl -ne 'printf "%s\n", $1 if m/([\w\.]+\@[\w\.]+)/' testfile.emails

Catching (or validating) e-mail addresses is one of those things that
seems really easy on the surface, but is actually a lot more
complicated than you would expect (I learned that the hard way, after
trying to roll my own regex for it and finding constant cases where it
didn't work).

For example, the following addresses would not get caught in the above
regex: user-foo at bar.com, user+foo at bar.com, user at foo-bar.com.

Those are some of the simpler ones to catch.  If you want a "pretty
good" regex that will catch almost all common e-mail addresses (but
will also allow through some invalid ones), the best simple one I've
found is:

perl -ne 'print $1 . "\n" if
m/(\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b)/i' emails.datafile

Where this one will fail is with non-English letters, a few valid but
potentially difficult to handle special characters (like single
quotes, better escape them if this will go in a SQL database), .museum
TLD addresses, etc.  It also allows through invalid e-mails like
foo at ...bar.  For quick and dirty work, it's almost always good enough
for me, but again, to really do it right, things get ugly in a hurry.

If you want to completely ensure you're catching all valid e-mail
addresses, while also not including any invalid e-mail addresses, your
best bet is to find a regex-based E-Mail Address handling library that
will do the heavy lifting for you.

For more information on this, one of the better short write-ups I've
seen on the issue can be found at
http://www.regular-expressions.info/email.html.  That's also where the
regex that I listed above came from.  Alternately, if you google for
'e-mail regex' you'll find lots of hits and discussion about this
whole issue.

> Dan

-- 
Christopher



More information about the OLUG mailing list