My situation is sounds rather simple. All I am doing is matching a spreadsheet
of products against our database. My job is to find any matches against
existing products and determine which ones are new, which ones are
replacements for older products, and which ones just need to have the
publication date (and page count, price, whatever) refreshed.
Publisher is no problem. What I have for each "feed" is a title and (most of
the time) an ISBN or other identification assigned by the publisher.
Matching by product ID is easy (assuming there aren't any mistakes in the
current or previous feeds); but the publisher might or might not change the
product ID when they update a report. That's why I also run a match by title,
and that's where all the trouble comes from.
The publisher might or might not include a mix of old and new products in a
feed. The publisher might change the title of an existing product, either on
purpose or by accident; they might simply be sloppy about their spelling; or
(and this is where it is critical) the title might include a reference to some
time period such as a year or a quarter.
I think we'd better pull the plug on this discussion. It doesn't seem like
there's a ready solution. Fortunately our database is small, and most feeds
are only a few hundred products.
Regards,
Jerry Schwartz
Global Information Incorporated
195 Farmington Ave.
Farmington, CT 06032
860.674.8796 / FAX: 860.674.8341
E-mail: jerry@stripped
Web site: www.the-infoshop.com
>-----Original Message-----
>From: shawn wilson [mailto:ag4ve.us@stripped]
>Sent: Tuesday, May 03, 2011 4:08 PM
>Cc: mysql mailing list
>Subject: Re: Join based upon LIKE
>
>I'm actually enjoying this discussion because I have the same type of issue.
>However, I have done away with trying to do a full text search in favor of
>making a table with unique fields where all fields should uniquely identify
>the group. If I get a dupe, I can clean it up.
>
>However, like you, they don't want me to mess with the original data. So,
>what I have is another table with my good data that my table with my unique
>data refers to. If a bad record is creased, I don't care I just create my
>relationship to the table of data I know (read think - I rarely look at this
>stuff) is good.
>
>So, I have 4 fields that should be unique for a group. Two chats and two
>ints. If three of these match a record in the 'good data' table - there's my
>relationship. If two or less match, I create a new record in my 'good data'
>table and log the event. (I haven't gotten to the logging part yet though,
>easy enough just to look sense none of the fields in 'good data' should
>match)
>
>I'm thinking you might have to dig deeper than me to find 'good data' but I
>think its there. Maybe isbn, name, publisher + address, price, average
>pages, name of sales person, who you guys pay for the material, etc etc etc.
>
>
>On May 3, 2011 10:59 AM, "Johan De Meersman" <vegivamp@stripped> wrote:
>>
>>
>> ----- Original Message -----
>> > From: "Jerry Schwartz" <jerry@stripped>
>> >
>> > I'm not sure that I could easily build a dictionary of non-junk
>> > words, since
>>
>> The traditional way is to build a database of junk words. The list tends
>to be shorter :-)
>>
>> Think and/or/it/the/with/like/...
>>
>> Percentages of mutual and non-mutual words between two titles should be a
>reasonable indicator of likeness. You could conceivably even assign value to
>individual words, so "polypropylbutanate" is more useful than "synergy" for
>comparison purposes.
>>
>> All very theoretical, though, I haven't actually done much of it to this
>level. My experience in data mangling is limited to mostly
>should-be-fixed-format data like sports results.
>>
>>
>> --
>> Bier met grenadyn
>> Is als mosterd by den wyn
>> Sy die't drinkt, is eene kwezel
>> Hy die't drinkt, is ras een ezel
>>
>> --
>> MySQL General Mailing List
>> For list archives: http://lists.mysql.com/mysql>> To unsubscribe: http://lists.mysql.com/mysql?unsub=1>>

Content reproduced on this site is the property of the respective copyright holders. It is not reviewed in advance by Oracle and does not necessarily represent the opinion of Oracle or any other party.