There’s actually a lot of low-hanging fruit in web scraping, with the usual associated legal issues.

One challenge with both scraping and user-inputted product data is normalization. “KLNX 6PK 3CT POUCH” is understandable to both a grocer and a POS operator as a 6 pack of 3-pouch containers of Kleenex but extracting that information into a machine-readable format is somewhat difficult.

Once you start trying to use the manufacturer prefix from the EAN/UPC to assist with manufacturer normalization, there’s another problem: often GS1 prefixes are either made up or transferred-by-acquisition so many times that they have nothing to do with the products bearing their prefix.

Open-source NLP pipeline projects (like Stanford CoreNLP) are valuable in this pursuit; many research NERs are already trained to recognize company names and quantities as entities so there’s a lot to work off of.

I posit that an open “crowdsourced” product database should be heavily user-dependent for normalization and should adhere to a strict schema to start; “scan a GTIN and upload a picture and name” will become almost more frustrating than having no data very quickly.

Manufacturer has many manufacturers recursively (really "subsidiaries" I suppose)
Manufacturer has many brands
Brand has many products
Product has many SKUs
SKUs have many SKUs (recursive, to represent i.e. "case of bottles" or "pallet of cases of bottles")
SKU has an association with a numeric (count) quantity
SKU has an association with an absolute (measure) quantity
SKUs have many barcodes

This product is listed as “BURTS B LIP GLOSS NUDE 0.5OZ” in a SKU table from a retailer I found online, so you can see how the simple “UPC, name” tuple often doesn’t quite cut it as useful information.

For user input I suppose an aggressive auto-complete based system with a lot of required fields could be useful, although there’s a balance between maximizing data ingress / difficulty and data quality.

We provided a facility for users to upload pictures of products we didn’t know about and then we would manually input the information (ourselves, interns, etc.) to ensure the best possible normalization and data quality.

Some databases I’ve seen attach even more metadata to SKU, for example “Form” (in the Burt’s Bees example, that’d be “stick”).

One difficulty with this schema was handling GS1 prefixes to try to automatically infer a manufacturer or brand when we didn’t have the specific product. We first tried associating them with brand - that got messy fast. Then we tried associating them with manufacturer, because that’s how the registration process is supposed to work, but that turned out to be lossy. Prefixes ended up being their own enormous independent hierarchy of prefix-brands and prefix-organizations that would eventually resolve to either a brand -or- manufacturer.

Interesting; I think the linked article is out of date as I know that at least Georgia used to use an “encrypted” barcode format provided by “L-1 Solutions” but as of 2012 moved to the AAMVA-recommended PDF417 approach.

But if you need legacy support, I see where a historic driver’s license archive could be useful.

Often bar bouncers will have a “big book” which they use when attempting to authenticate licenses; have you looked into obtaining one of those for your product? It sounds like “open” is useful but not a necessity.