WikiProject Chemicals and WikiProject Pharmacology are validating the content in the infoboxes {{chembox}} and {{drugbox}}. Values in the infobox are compared with values reported in literature, and when the values match, the current revision is stored in the index for chembox and the index for drugbox, respectively. This is typically done for values that are 'immutable' (e.g., the boiling point of a chemical compound: the boiling point of water under standard conditions is 99.98°C, and there is no plausible reason to suspect it will change).

CheMoBot is following changes to these articles, and is set up to update the infoboxes. When it detects changes to values, it will change parameters in the infobox accordingly. These parameters are used by the template to show what the status of the fields are in the box.

If you encounter a page with a {{chembox}} or {{drugbox}} that shows a N, then please check if the current value is wrong (in which case, it can just be changed back to the value in the verified revision; the bot will do the rest), or if there is a mistake in the verified revision (if so, it may need an update of the index; if you need help with that, please ask the appropriate wikiproject).

Verification - tagging references

CheMoBot adds a template to a _Ref parameter (e.g. for CASNo, CASNo_Ref will be filled with {{cascite|correct|XXX}}) when the bot finds the field correct. The first parameter of the template is 'correct', or 'changed', and the box will show a tick or a cross accordingly on CASNo. The second parameter is a field that contains a reference for 'where' the parameter was verified. As we are at the moment verifying all fields against the CAS commonchemistry.org site, the bot replaces XXX with 'CAS' (i.e., {{cascite|correct|CAS}}). When using another place to verify the CASNo, please adapt this parameter accordingly and will try to retain this field throughout. When there will be significantly more verifications against non-commonchemistry.org-places, I will instruct the bot to fill the field standard with {{cascite|correct|??}} or something similar.

Method of work

Our approach is to start by checking that the CAS registry number and the structure match with the name. This will be used as a foundation upon which we can build a broader validation effort. Once we have the structure verified, we have the formula, and hence the molar mass, and we can also generate other machine representations such as SMILES, InChI and InChIKey.

First 1000

After our IRC meeting on January 13, 2009, we used an Excel file to validate the first 1000 entries from the CAS XML file. This is available to project members here, on the password-protected site. Meanwhile, User:Physchim62 validated the inorganics separately, and these can be found in the CAVer file.

The work

We are now beginning to work through the list of "problem articles" found by User:Beetstra, and listed at User:Beetstra/CASFoundCorrect. A description of the process will be added soon.

Notes

Different CAS numbers are used for each form of a substance. For example, something simple like alanine will have one CAS# for the D form, another for L, another for "unspecified" and a fourth one for racemic. There would be another four CAS#s for the hydrochloride, four for the (1:1) sulfate, four for the (2:1)sulfate, etc. It is very important that we match the correct form CAS# to our Chemboxes!

Be aware that CAS uses an unusual system for representing some formulae, which may seem "wrong" to us. These involve describing salts such as sodium nitrate as HNO3·Na, and organic salts follow a similar system. Do not use such formulae on WP, but they are not "wrong" since they are merely a representation, not a formal structure. This also results in incorrect MolarMass in the FW section of the SDF file for salts.

For complex chiral structures, such as bleomycin, which may be drawn very differently in WP than in Common Chemistry, I found it best to assign R/S for each center and compare that way. (And yes, Farseer drew bleomycin perfectly!)

The CAS No. in a Chembox will receive a green tick (check mark) once {{cascite}} is added. This does not happen yet in the Drugbox (there is no change at present), but we hope to enable a similar system there too, if WP:PHARM is in agreement.

Fields to check/upload

Chemboxes

Check structure, CAS no., Formula, MolarMass.
Notes:
1: the bot 'divides' the fields in two sets, watched and unwatched; all changes are reported, but the watched fields are the ones we really want to take care of, those are the fields that contain hardcore, verifiable data that are very unlikely to change (as the boiling point of water, the CAS-number of benzene, the number of carbons in glucose. N.B. the list of 'watched' fields may need to be updated
2: The bot regards an empty field as 'unknown'. It will report changes to this field, but will assign a lower 'warning level' to it.
3: Things between <!-- and --> are 'comments', they can be saved and appear in the editbox, but do not produce visible wikicode.

When a 'better' version of a page comes up, change the number on the page. If there are two revids for the same page, it uses the one closest to the bottom of the index-page (the page gets parsed top to bottom, replacing values if duplicates occur).

This is very strange, it is trans,trans in the union file and cis,cis in the wikichem file (I have been using the union file to verify CAS numbers). I need to look into this. Ambix (talk) 12:47, 12 February 2009 (UTC)

Glucose_1-phosphate One chiral center is not specified (should be up to match CAS). (probably a result of copying glucose skeleton, in which this atom is not chiral?).

See anomer. It is likely that both forms (alpha and beta-glucopyranoside) are described by this CAS number. --Tweenk (talk) 21:41, 15 November 2009 (UTC)

Valine: The structure diagram has one carbon atom with two wedge bonds attached, making verification difficult (the stereochemistry should be S here, and is)

Threonine: The structure diagram does not specify the stereochemistry at the two chiral centres (should be 2S,3R)

Endrin: The structure diagram appears to show the endo-isomer whereas the CASRN is for the exo-isomer (or vice versa, I never was very good at this particular bit of nomenclature! in any case, it's not the same compound!) We should recheck with Dieldrin (CASRN [60-57-1]) as well. Neither compound has the stereochemistry correctly specified.

I've rechecked Dieldrin, adding the implicit hydrogens to the WP structure and drawing in chemsketch, I also copied the CAS structure exactly and had the program assign stereo labels. They match, which leads me to think my initial verify is OK. It maybe should be noted that while the carbon skeletons look to be the same projection, WP is from above and CAS (turns out to be) from below. If you are still unhappy could you describe your assignment in more detail? I'll try the chemsketch method with Endrin and hopefully we can compare notes Ambix (talk) 23:27, 6 February 2009 (UTC)

I have checked Endrin with the same process and it does not match. There is an older version of this image Endrin.png and this does match. Given the difficulties of transposing a 3D structure to more conventional form it would probably be better to have a more conventional structure as well for compounds like this but I would suggest we avoid removing 3D structures providing it is possible to validate them. I will investigate further.

I suggest that for our validated structure on such compounds, we should explicitly show the stereochemistry of each chiral centre, which is not the case at present on Endrin and Dieldrin (even if a knowledgeable chemist can figure out what it must be from the diagram). That doesn't necessarily mean changing the structures in the chemboxes (our images for inorganics don't always give a clear idea of the structure), but we should insist on the chembox information being correct and not-misleading, and that the full details be available in the article (maybe in a separate image). Physchim62(talk) 23:23, 9 February 2009 (UTC)

501-600

Trimethylaluminium is dimer, CAS is monomer. Is this significant, will CAS have a dimer listed?

Camphor Both the WP page and the CAS are for unspecified stereoisomers however if we follow the naturally occurring rule, should the WP page be changed for the natural isomer and the unspecified CAS be relegated to an 'other'?

601-700

701-800

801-900

901-1000

Inorganics

The 677 "inorganics" (neutral compounds without C–C or C–H bonds) have now all been checked. 496 entries gave a perfect match, 74 entries had some sort of problem in the article (often minor and already fixed) and 100 entries had no appropriate corresponding article on Wikipedia. A full report will be available in due course.

Elements and ions

These will require special treatment: please contact Physchim62 for more details.

1.
Benzene
–
Benzene is an important organic chemical compound with the chemical formula C6H6. The benzene molecule is composed of 6 carbon atoms joined in a ring with 1 hydrogen atom attached to each, because it contains only carbon and hydrogen atoms, benzene is classed as a hydrocarbon. Benzene is a constituent of crude oil and is one of the elementary petrochemicals. Because of the cyclic continuous pi bond between the atoms, benzene is classed as an aromatic hydrocarbon, the second -annulene. Benzene is a colorless and highly flammable liquid with a sweet smell and it is used primarily as a precursor to the manufacture of chemicals with more complex structure, such as ethylbenzene and cumene, of which billions of kilograms are produced. Because benzene has a high number, it is an important component of gasoline. Because benzene is a carcinogen, most non-industrial applications have been limited. The word benzene derives historically from gum benzoin, a resin known to European pharmacists. An acidic material was derived from benzoin by sublimation, and named flowers of benzoin, the hydrocarbon derived from benzoic acid thus acquired the name benzin, benzol, or benzene. Michael Faraday first isolated and identified benzene in 1825 from the oily residue derived from the production of illuminating gas, in 1833, Eilhard Mitscherlich produced it by distilling benzoic acid and lime. He gave the compound the name benzin, in 1845, Charles Mansfield, working under August Wilhelm von Hofmann, isolated benzene from coal tar. Four years later, Mansfield began the first industrial-scale production of benzene, gradually, the sense developed among chemists that a number of substances were chemically related to benzene, comprising a diverse chemical family. In 1855, Hofmann used the word aromatic to designate this family relationship, in 1997, benzene was detected in deep space. The empirical formula for benzene was known, but its highly polyunsaturated structure. In 1865, the German chemist Friedrich August Kekulé published a paper in French suggesting that the structure contained a ring of six carbon atoms with alternating single and double bonds, the next year he published a much longer paper in German on the same subject. Kekulés symmetrical ring could explain these facts, as well as benzenes 1,1 carbon-hydrogen ratio. Here Kekulé spoke of the creation of the theory and he said that he had discovered the ring shape of the benzene molecule after having a reverie or day-dream of a snake seizing its own tail. This vision, he said, came to him years of studying the nature of carbon-carbon bonds

2.
Alanine
–
Alanine is an α-amino acid that is used in the biosynthesis of proteins. It contains a group, an α-carboxylic acid group. It is non-essential in humans, meaning the body can synthesize it, the L-isomer of alanine is one of the 20 amino acids encoded by the human genetic code. L-Alanine is second only to leucine in rate of occurrence, accounting for 7. 8% of the structure in a sample of 1,150 proteins. The right-handed form, D-Alanine occurs in bacterial cell walls and in some peptide antibiotics, Alanine was first isolated in 1879 by Adolph Strecker. The amino acid was named Alanin in German, in reference to aldehyde, with the infix -an- for ease of pronunciation, the German ending -in used in chemical compounds being analogous to English -ine. The α-carbon atom of alanine is bound to a group, making it one of the simplest α-amino acids. The methyl group of alanine is non-reactive and is thus almost never directly involved in protein function, Alanine is an amino acid that cannot be phosphorylated, making it quite useful in a loss of function experiment with respect to phosphorylation. Alanine is an amino acid, meaning it can be manufactured by the human body. Alanine is found in a variety of foods, but is particularly concentrated in meats. Alanine can be manufactured in the body from pyruvate and branched chain amino acids such as valine, leucine, Alanine is most commonly produced by reductive amination of pyruvate. It also arises together with lactate and generates glucose from protein via the alanine cycle, in muscle and other tissues that degrade amino acids for fuel, amino groups are collected in the form of glutamate by transamination. Glutamate can then transfer its amino group through the action of alanine aminotransferase to pyruvate, the alanine formed is passed into the blood and transported to the liver. A reverse of the alanine aminotransferase reaction takes place in liver, pyruvate regenerated forms glucose through gluconeogenesis, which returns to muscle through the circulation system. Glutamate in the liver enters mitochondria and degrades into ammonium ion through the action of glutamate dehydrogenase, the glucose–alanine cycle enables pyruvate and glutamate to be removed from the muscle and find their way to the liver. Glucose is regenerated from pyruvate and then returned to muscle, the burden of gluconeogenesis is thus imposed on the liver instead of the muscle. All available ATP in muscle is devoted to muscle contraction, an international study led by Imperial College London found a correlation between high levels of alanine and higher blood pressure, energy intake, cholesterol levels, and body mass index. Alterations in the cycle that increase the levels of serum alanine aminotransferase is linked to the development of type II diabetes

3.
ChemSpider
–
ChemSpider is a database of chemicals. ChemSpider is owned by the Royal Society of Chemistry, the database contains information on more than 50 million molecules from over 500 data sources including, Each chemical is given a unique identifier, which forms part of a corresponding URL. This is an approach to develop an online chemistry database. The search can be used to widen or restrict already found results, structure searching on mobile devices can be done using free apps for iOS and for the Android. The ChemSpider database has been used in combination with text mining as the basis of document markup. The result is a system between chemistry documents and information look-up via ChemSpider into over 150 data sources. ChemSpider was acquired by the Royal Society of Chemistry in May,2009, prior to the acquisition by RSC, ChemSpider was controlled by a private corporation, ChemZoo Inc. The system was first launched in March 2007 in a release form. ChemSpider has expanded the generic support of a database to include support of the Wikipedia chemical structure collection via their WiChempedia implementation. A number of services are available online. SyntheticPages is an interactive database of synthetic chemistry procedures operated by the Royal Society of Chemistry. Users submit synthetic procedures which they have conducted themselves for publication on the site and these procedures may be original works, but they are more often based on literature reactions. Citations to the published procedure are made where appropriate. They are checked by an editor before posting. The pages do not undergo formal peer-review like a journal article. The comments are moderated by scientific editors. The intention is to collect practical experience of how to conduct useful chemical synthesis in the lab, while experimental methods published in an ordinary academic journal are listed formally and concisely, the procedures in ChemSpider SyntheticPages are given with more practical detail. Comments by submitters are included as well, other publications with comparable amounts of detail include Organic Syntheses and Inorganic Syntheses

4.
ChEMBL
–
ChEMBL or ChEMBLdb is a manually curated chemical database of bioactive molecules with drug-like properties. It is maintained by the European Bioinformatics Institute, of the European Molecular Biology Laboratory, based at the Wellcome Trust Genome Campus, Hinxton, the database, originally known as StARlite, was developed by a biotechnology company called Inpharmatica Ltd. later acquired by Galapagos NV. The data was acquired for EMBL in 2008 with an award from The Wellcome Trust, resulting in the creation of the ChEMBL chemogenomics group at EMBL-EBI, the ChEMBL database contains compound bioactivity data against drug targets. Bioactivity is reported in Ki, Kd, IC50, and EC50, data can be filtered and analyzed to develop compound screening libraries for lead identification during drug discovery. ChEMBL version 2 was launched in January 2010, including 2.4 million bioassay measurements covering 622,824 compounds and this was obtained from curating over 34,000 publications across twelve medicinal chemistry journals. ChEMBLs coverage of available bioactivity data has grown to become the most comprehensive ever seen in a public database, in October 2010 ChEMBL version 8 was launched, with over 2.97 million bioassay measurements covering 636,269 compounds. ChEMBL_10 saw the addition of the PubChem confirmatory assays, in order to integrate data that is comparable to the type, ChEMBLdb can be accessed via a web interface or downloaded by File Transfer Protocol. It is formatted in a manner amenable to computerized data mining, ChEMBL is also integrated into other large-scale chemistry resources, including PubChem and the ChemSpider system of the Royal Society of Chemistry. In addition to the database, the ChEMBL group have developed tools and these include Kinase SARfari, an integrated chemogenomics workbench focussed on kinases. The system incorporates and links sequence, structure, compounds and screening data, the primary purpose of ChEMBL-NTD is to provide a freely accessible and permanent archive and distribution centre for deposited data. July 2012 saw the release of a new data service, sponsored by the Medicines for Malaria Venture. The data in this service includes compounds from the Malaria Box screening set, myChEMBL, the ChEMBL virtual machine, was released in October 2013 to allow users to access a complete and free, easy-to-install cheminformatics infrastructure. In December 2013, the operations of the SureChem patent informatics database were transferred to EMBL-EBI, in a portmanteau, SureChem was renamed SureChEMBL. 2014 saw the introduction of the new resource ADME SARfari - a tool for predicting and comparing cross-species ADME targets

5.
International Chemical Identifier
–
Initially developed by IUPAC and NIST from 2000 to 2005, the format and algorithms are non-proprietary. The continuing development of the standard has supported since 2010 by the not-for-profit InChI Trust. The current version is 1.04 and was released in September 2011, prior to 1.04, the software was freely available under the open source LGPL license, but it now uses a custom license called IUPAC-InChI Trust License. Not all layers have to be provided, for instance, the layer can be omitted if that type of information is not relevant to the particular application. InChIs can thus be seen as akin to a general and extremely formalized version of IUPAC names and they can express more information than the simpler SMILES notation and differ in that every structure has a unique InChI string, which is important in database applications. Information about the 3-dimensional coordinates of atoms is not represented in InChI, the InChI algorithm converts input structural information into a unique InChI identifier in a three-step process, normalization, canonicalization, and serialization. The InChIKey, sometimes referred to as a hashed InChI, is a fixed length condensed digital representation of the InChI that is not human-understandable. The InChIKey specification was released in September 2007 in order to facilitate web searches for chemical compounds and it should be noted that, unlike the InChI, the InChIKey is not unique, though collisions can be calculated to be very rare, they happen. In January 2009 the final 1.02 version of the InChI software was released and this provided a means to generate so called standard InChI, which does not allow for user selectable options in dealing with the stereochemistry and tautomeric layers of the InChI string. The standard InChIKey is then the hashed version of the standard InChI string, the standard InChI will simplify comparison of InChI strings and keys generated by different groups, and subsequently accessed via diverse sources such as databases and web resources. Every InChI starts with the string InChI= followed by the version number and this is followed by the letter S for standard InChIs. The remaining information is structured as a sequence of layers and sub-layers, the layers and sub-layers are separated by the delimiter / and start with a characteristic prefix letter. The six layers with important sublayers are, Main layer Chemical formula and this is the only sublayer that must occur in every InChI. The atoms in the formula are numbered in sequence, this sublayer describes which atoms are connected by bonds to which other ones. Describes how many hydrogen atoms are connected to each of the other atoms, the condensed,27 character standard InChIKey is a hashed version of the full standard InChI, designed to allow for easy web searches of chemical compounds. Most chemical structures on the Web up to 2007 have been represented as GIF files, the full InChI turned out to be too lengthy for easy searching, and therefore the InChIKey was developed. With all databases currently having below 50 million structures, such duplication appears unlikely at present, a recent study more extensively studies the collision rate finding that the experimental collision rate is in agreement with the theoretical expectations. Example, Morphine has the structure shown on the right, as the InChI cannot be reconstructed from the InChIKey, an InChIKey always needs to be linked to the original InChI to get back to the original structure