Data preprocessing on emails dataset

-1

I have a relatively large dataset (+5000 lines) containing names, email addresses and other characters.

I want extract only the emails without breaking the lines. My first thought was by negative regex match the email adresses, I got the expression right, but whole lines matching would get deleted.

Here's a sample of the dataset:

Qxvpbbj <ftnqfqkfrk@jnmg.twzrsws.bvj>,jplj.kpuof@mycnk.jlq
Vxclvës Wxaunfq <hdyminv.mhlygqn@mnajemi.sv>,"""askn.mgpxn@njemi.psh <mhfc.swsrv@dqefy.gjt>, frtnqzqvri@llgqvph.wx <vcsavrvsrr@ljairox.hw>"""
hucs@grezzmamkd.etl,misa.dhrnv@cdgun.hnt
Pivsvgz Semxmkxnu Ps Glymnzedu <xkaqhfwsk@fsxvhzyekrn.mo>,JPLJ.KPUOF@MYCNK.JLQ

And the result would look like the following:

ftnqfqkfrk@jnmg.twzrsws.bvj, jplj.kpuof@mycnk.jlq
hdyminv.mhlygqn@mnajemi.sv, mhfc.swsrv@dqefy.gjt, vcsavrvsrr@ljairox.hw   
hucs@grezzmamkd.etl, misa.dhrnv@cdgun.hnt
xkaqhfwsk@fsxvhzyekrn.mo, JPLJ.KPUOF@MYCNK.JLQ

Any advice on softwares or methods to use to reach that formatting ?

flowgunso

Posted 2016-07-31T23:44:03.557

Reputation: 1

1

It's a mystery to me why you'd want to do that. The result isn't any easier to read, let alone parse, than the original. If you would be happy extracting just the emails without retaining the formatting, there's plenty of options. I personally would go for Perl and use the module Email::Address.

– Edward – 2016-08-16T08:01:12.840

@Edward simply put, I want to find out which web services I registered to, and this is the first step. I've never used Perl, I'll give it a try to automate the full process, thanks! – flowgunso – 2016-08-17T18:24:57.283

you may not even need to use Perl if there's another tool you know better - but retaining the formatting seems like an unnecessary burden to me. If you can do away with the formatting, and you don't need to handle very special addresses, I'd suggest having a look at this site: http://www.regular-expressions.info/email.html

– Edward – 2016-08-23T07:51:15.677

And there's this answer over at Stack Overflow, too: https://stackoverflow.com/q/201323/2866527

– Edward – 2016-08-23T07:53:44.583

Answers

0

I've managed to cleanup the dataset using a data quality software, DataCleaner. The results are not perfect though.

flowgunso

Posted 2016-07-31T23:44:03.557

Reputation: 1