The SitePoint Forums have moved.

You can now find them here.
This forum is now closed to new posts, but you can browse existing content.
You can find out more information about the move and how to open a new account (if necessary) here.
If you get stuck you can get support by emailing forums@sitepoint.com

If this is your first visit, be sure to
check out the FAQ by clicking the
link above. You may have to register
before you can post: click the register link above to proceed. To start viewing messages,
select the forum that you want to visit from the selection below.

Help with cleaning up misquoted values in a tab-delimited file

Hello,

I'm hoping someone can help me out here because I've almost lost the will to live. I'm not sure if this is the right place but I figure that as you guys are dealing with data then you may well have come up against something similar.

I'm trying to prepare a large amount of products (about 5000) to insert into a shopping cart system. I have a text file with all the products in with the data delimited by tabs. Amendments need to be made to the data in Excel before its inserted using the carts admin system. This is all fine until I export the data from Excel.

Aside from Excel replacing random characters with other characters (which has taken me a stupid amount of time to sort out), it surrounds any text field that contains a comma with double quotes (eg "This is a product, description"). The problem is that it only does this to certain fields ie the ones with commas and there doesn't seem to any way to turn it off.

The system I'm inserting the data into has an option to strip off double quotes but because the data also contains inch marks and valid uses of double quotes the system chokes and can't distinguish between the valid uses and the ones I want it to strip.

Is there a way to get Excel to either put double quotes around all fields (I could then remove them with a search and replace in BBEdit) OR force it to not use double quotes at all?

I've looked in the help files and they're next to useless - All they say is "Excel places quotation marks around text with commas" as if its a handy feature.

If someone could help me out here I would really really really appreciate it.

It is a handy feature--if it did not quote the fields, you would now have two fields, breaking your table import. And most CSV

I don't understand - It's a tab separated file not CSV. Why would having a comma in it create two fields? Surely there must be an option to turn it off?

Originally Posted by wwb_99

One option would be to import the data into access and use access' text export features to push it out of the db. You could then have it quote all the fields and the system might be able to handle it.

Another option from access would be using something that can talk to it--like most any scripting package that can talk to ODBC--to generate actual INSERT statements from the data.

Sorry if I seem ungrateful but this just seems like a lot of work. I know nothing about Access. I thought Excel was designed to handle spreadsheets of data. It seems like a pretty basic function to export a file without it adding stuff I haven't asked it to. Is there no other way without involving Access?

I don't understand - It's a tab separated file not CSV. Why would having a comma in it create two fields? Surely there must be an option to turn it off?

Sorry if I seem ungrateful but this just seems like a lot of work. I know nothing about Access. I thought Excel was designed to handle spreadsheets of data. It seems like a pretty basic function to export a file without it adding stuff I haven't asked it to. Is there no other way without involving Access?

Sorry, I misread your initial post about the tab stuff. Anyhow, the access method is very quick and easy--just import spreadsheet, export text file. The advantage is the access has a much, much better text export wizard which will let you do stuff like not export the quotes. And trust me, I have munged alot of spreadsheets into usable data in my days.

Which leads me to your second point--spreadsheets are not data. They are NOT schema bound in any meaningful way. They lack row integrety and data type enforcement. Excel is a great business intelligence and arguably one of the best reporting tools on the planet. But it is NOT a database in any meaningful way, even if it can be abused as one.

a code snippet for pre-processing the file with PHP comes to mind. would you be interested in that?

Yes please. Yes please. Yes please. If it can be done by a running a script on the actual exported .txt file that would be amazing (Is that what you were thinking of?). If it could be done like this it would save me having to go back and redoing the work thats been done to the file since it was exported from Excel.

I think it might be quite complicated though (although maybe a bit of a challenge ).

What I think would be needed is to strip off all double quotes that:

1) Have a tab immediately in front of them but only those that are followed by another double quote immediately before the next tab character in sequence

2) Have a tab immediately after them (or a line break) but only those that are preceded by another double quote with the previous tab character in sequence immediately before that

I think it would have to be done like this because some of the data is in quotes and some not and some actually should start/end with a quote. Example data:

Sorry, I misread your initial post about the tab stuff. Anyhow, the access method is very quick and easy--just import spreadsheet, export text file. The advantage is the access has a much, much better text export wizard which will let you do stuff like not export the quotes. And trust me, I have munged alot of spreadsheets into usable data in my days.

The problem with this (although admittedly I didn't mention it in my first post) is that I've made quite considerable changes to the file since it was exported from Excel. I didn't realise that the extra quotes were going to cause such a problem until I tried to insert the "finished" file. So it would mean taking the original file into Access doing the necessary changes again, then exporting, and then redoing all the other changes. Because of that, processing the existing file is a lot more attractive to me - But if that doesn't work I may have to look at using Access as you describe or another method. Thanks.

The problem with this (although admittedly I didn't mention it in my first post) is that I've made quite considerable changes to the file since it was exported from Excel. I didn't realise that the extra quotes were going to cause such a problem until I tried to insert the "finished" file. So it would mean taking the original file into Access doing the necessary changes again, then exporting, and then redoing all the other changes. Because of that, processing the existing file is a lot more attractive to me - But if that doesn't work I may have to look at using Access as you describe or another method. Thanks.

You can import from tab delimited to access (or excel), and it handles conditional quotes well. All of that is a moot point because you are on a mac and they are a bit hamstrung in the desktop db sector.

You should be able to get someone to cook up something to process the original file.

it appears you want to eliminate double quotes if the first and the last character of a field is quotes, right? you can do this in SQL, but the exact syntax varies depending on the server type. here's some mysql compatible SQL:

The problem is I can't get the correct data into the correct fields in the database due to the excess quotes. By the time it gets into the DB some fields have already been joined together and others have nothing in when they should (Due to the confusion caused by the quote marks). Thats why I thought it may be better to strip the quotes before its gets near the DB?

I've already done many search and replaces on the file to correct all the junk that Excel put into it and I'm comfortable doing this. The problem I have is as you describe - There are many valid instances of " at the start and end of the fields which I don't want to strip. There are just too many to not worry about them (at least 100s and maybe a lot more) and it would mean checking through all the products after the search and replace.

I don't think there is any way to distinguish the valid from the invalid using a standard search and replace. The only way to "tell them apart" is as Longneck described above (i.e. the invalid ones are those where the first and the last character of a field is quotes).

Is there a way to do this in PHP rather than SQL? Should I be posting this in the PHP forum?

The technique longneck described above is essentially finding and replacing anything with quotes at the edge of a field. And, yes, you could do this with PHP. You are probably going to need to use some regular expressions to get there.

In furtherance of the 'fix it with php' effort, I am moving this to the PHP forum.

The technique longneck described above is essentially finding and replacing anything with quotes at the edge of a field. And, yes, you could do this with PHP. You are probably going to need to use some regular expressions to get there.

In furtherance of the 'fix it with php' effort, I am moving this to the PHP forum.

Could anyone help me with this? Regular expressions are a bit beyond me...

To clarify, what I'm after is a script that will strip the first and last character from all fields but only if both those characters (First and last in that field) are double quotes.

So it would strip: "Speaker"
But not: Speaker 19"

If someone could help me out with this I would really appreciate it.
Thanks.

I realize this post is almost a year old but... In case anyone else stumbles across it, here's a function that will remove quotes from the beginning and end of a string only if there is a quote at the beginning AND the end. Should work for all of the cases outlined by magpie in the Jun 26, 2007 15:35 post but would fail under conditions like:
[tab]"quoted text" other text "more quoted text"[tab] --- (This would be stripped even tho the desired result would be to have it left untouched)

I'm sure regex gurus could do a lot better but.. For a quick and dirty, this should work.