[{"id":17130,"pmid":22171320,"pmcid":null,"title":"Human urinary glycoproteomics; attachment site specific analysis of N- and O-linked glycosylations by CID and ECD.","year":2012,"pages":null,"doi":null,"keywords":[],"mesh":[],"abstractText":"Urine is a complex mixture of proteins and waste products and a challenging biological fluid for biomarker discovery. Previous proteomic studies have identified more than 2800 urinary proteins but analyses aimed at unraveling glycan structures and glycosylation sites of urinary glycoproteins are lacking. Glycoproteomic characterization remains difficult because of the complexity of glycan structures found mainly on asparagine (N-linked) or serine/threonine (O-linked) residues. We have developed a glycoproteomic approach that combines efficient purification of urinary glycoproteins/glycopeptides with complementary MS-fragmentation techniques for glycopeptide analysis. Starting from clinical sample size, we eliminated interfering urinary compounds by dialysis and concentrated the purified urinary proteins by lyophilization. Sialylated urinary glycoproteins were conjugated to a solid support by hydrazide chemistry and trypsin digested. Desialylated glycopeptides, released through mild acid hydrolysis, were characterized by tandem MS experiments utilizing collision induced dissociation (CID) and electron capture dissociation fragmentation techniques. In CID-MS(2), Hex(5)HexNAc(4)-N-Asn and HexHexNAc-O-Ser/Thr were typically observed, in agreement with known N-linked biantennary complex-type and O-linked core 1-like structures, respectively. Additional glycoforms for specific N- and O-linked glycopeptides were also identified, e.g. tetra-antennary N-glycans and fucosylated core 2-like O-glycans. Subsequent CID-MS(3), of selected fragment-ions from the CID-MS(2) analysis, generated peptide specific b- and y-ions that were used for peptide identification. In total, 58 N- and 63 O-linked glycopeptides from 53 glycoproteins were characterized with respect to glycan- and peptide sequences. The combination of CID and electron capture dissociation techniques allowed for the exact identification of Ser/Thr attachment site(s) for 40 of 57 putative O-glycosylation sites. We defined 29 O-glycosylation sites which have, to our knowledge, not been previously reported. This is the first study of human urinary glycoproteins where \"intact\" glycopeptides were studied, i.e. the presence of glycans and their attachment sites were proven without doubt.","journal":null,"figures":[],"_authors":null},{"id":6,"pmid":16344560,"pmcid":null,"title":"Diversification of transcriptional modulation: large-scale identification and characterization of putative alternative promoters of human genes.","year":2006,"pages":null,"doi":null,"keywords":[],"mesh":[],"abstractText":"By analyzing 1,780,295 5'-end sequences of human full-length cDNAs derived from 164 kinds of oligo-cap cDNA libraries, we identified 269,774 independent positions of transcriptional start sites (TSSs) for 14,628 human RefSeq genes. These TSSs were clustered into 30,964 clusters that were separated from each other by more than 500 bp and thus are very likely to constitute mutually distinct alternative promoters. To our surprise, at least 7674 (52%) human RefSeq genes were subject to regulation by putative alternative promoters (PAPs). On average, there were 3.1 PAPs per gene, with the composition of one CpG-island-containing promoter per 2.6 CpG-less promoters. In 17% of the PAP-containing loci, tissue-specific use of the PAPs was observed. The richest tissue sources of the tissue-specific PAPs were testis and brain. It was also intriguing that the PAP-containing promoters were enriched in the genes encoding signal transduction-related proteins and were rarer in the genes encoding extracellular proteins, possibly reflecting the varied functional requirement for and the restricted expression of those categories of genes, respectively. The patterns of the first exons were highly diverse as well. On average, there were 7.7 different splicing types of first exons per locus partly produced by the PAPs, suggesting that a wide variety of transcripts can be achieved by this mechanism. Our findings suggest that use of alternate promoters and consequent alternative use of first exons should play a pivotal role in generating the complexity required for the highly elaborated molecular systems in humans.","journal":null,"figures":[],"_authors":null},{"id":5,"pmid":15489334,"pmcid":null,"title":"The status, quality, and expansion of the NIH full-length cDNA project: the Mammalian Gene Collection (MGC).","year":2004,"pages":null,"doi":null,"keywords":[],"mesh":[],"abstractText":"The National Institutes of Health's Mammalian Gene Collection (MGC) project was designed to generate and sequence a publicly accessible cDNA resource containing a complete open reading frame (ORF) for every human and mouse gene. The project initially used a random strategy to select clones from a large number of cDNA libraries from diverse tissues. Candidate clones were chosen based on 5'-EST sequences, and then fully sequenced to high accuracy and analyzed by algorithms developed for this project. Currently, more than 11,000 human and 10,000 mouse genes are represented in MGC by at least one clone with a full ORF. The random selection approach is now reaching a saturation point, and a transition to protocols targeted at the missing transcripts is now required to complete the mouse and human collections. Comparison of the sequence of the MGC clones to reference genome sequences reveals that most cDNA clones are of very high sequence quality, although it is likely that some cDNAs may carry missense variants as a consequence of experimental artifact, such as PCR, cloning, or reverse transcriptase errors. Recently, a rat cDNA component was added to the project, and ongoing frog (Xenopus) and zebrafish (Danio) cDNA projects were expanded to take advantage of the high-throughput MGC pipeline.","journal":null,"figures":[],"_authors":null},{"id":4,"pmid":14702039,"pmcid":null,"title":"Complete sequencing and characterization of 21,243 full-length human cDNAs.","year":2004,"pages":null,"doi":null,"keywords":[],"mesh":[],"abstractText":"As a base for human transcriptome and functional genomics, we created the \"full-length long Japan\" (FLJ) collection of sequenced human cDNAs. We determined the entire sequence of 21,243 selected clones and found that 14,490 cDNAs (10,897 clusters) were unique to the FLJ collection. About half of them (5,416) seemed to be protein-coding. Of those, 1,999 clusters had not been predicted by computational methods. The distribution of GC content of nonpredicted cDNAs had a peak at approximately 58% compared with a peak at approximately 42%for predicted cDNAs. Thus, there seems to be a slight bias against GC-rich transcripts in current gene prediction procedures. The rest of the cDNAs unique to the FLJ collection (5,481) contained no obvious open reading frames (ORFs) and thus are candidate noncoding RNAs. About one-fourth of them (1,378) showed a clear pattern of splicing. The distribution of GC content of noncoding cDNAs was narrow and had a peak at approximately 42%, relatively low compared with that of protein-coding cDNAs.","journal":null,"figures":[],"_authors":null},{"id":1211,"pmid":12975309,"pmcid":null,"title":"The secreted protein discovery initiative (SPDI), a large-scale effort to identify novel human secreted and transmembrane proteins: a bioinformatics assessment.","year":2003,"pages":null,"doi":null,"keywords":[],"mesh":[],"abstractText":"A large-scale effort, termed the Secreted Protein Discovery Initiative (SPDI), was undertaken to identify novel secreted and transmembrane proteins. In the first of several approaches, a biological signal sequence trap in yeast cells was utilized to identify cDNA clones encoding putative secreted proteins. A second strategy utilized various algorithms that recognize features such as the hydrophobic properties of signal sequences to identify putative proteins encoded by expressed sequence tags (ESTs) from human cDNA libraries. A third approach surveyed ESTs for protein sequence similarity to a set of known receptors and their ligands with the BLAST algorithm. Finally, both signal-sequence prediction algorithms and BLAST were used to identify single exons of potential genes from within human genomic sequence. The isolation of full-length cDNA clones for each of these candidate genes resulted in the identification of >1000 novel proteins. A total of 256 of these cDNAs are still novel, including variants and novel genes, per the most recent GenBank release version. The success of this large-scale effort was assessed by a bioinformatics analysis of the proteins through predictions of protein domains, subcellular localizations, and possible functional roles. The SPDI collection should facilitate efforts to better understand intercellular communication, may lead to new understandings of human diseases, and provides potential opportunities for the development of therapeutics.","journal":null,"figures":[],"_authors":null},{"id":2,"pmid":12477932,"pmcid":null,"title":"Generation and initial analysis of more than 15,000 full-length human and mouse cDNA sequences.","year":2002,"pages":null,"doi":null,"keywords":[],"mesh":[],"abstractText":"The National Institutes of Health Mammalian Gene Collection (MGC) Program is a multiinstitutional effort to identify and sequence a cDNA clone containing a complete ORF for each human and mouse gene. ESTs were generated from libraries enriched for full-length cDNAs and analyzed to identify candidate full-ORF clones, which then were sequenced to high accuracy. The MGC has currently sequenced and verified the full ORF for a nonredundant set of >9,000 human and >6,000 mouse genes. Candidate full-ORF clones for an additional 7,800 human and 3,500 mouse genes also have been identified. All MGC sequences and clones are available without restriction through public databases and clone distribution networks (see http:mgc.nci.nih.gov).","journal":null,"figures":[],"_authors":null},{"id":1263,"pmid":11181995,"pmcid":null,"title":"The sequence of the human genome.","year":2001,"pages":null,"doi":null,"keywords":[],"mesh":[],"abstractText":"A 2.91-billion base pair (bp) consensus sequence of the euchromatic portion of the human genome was generated by the whole-genome shotgun sequencing method. The 14.8-billion bp DNA sequence was generated over 9 months from 27,271,853 high-quality sequence reads (5.11-fold coverage of the genome) from both ends of plasmid clones made from the DNA of five individuals. Two assembly strategies-a whole-genome assembly and a regional chromosome assembly-were used, each combining sequence data from Celera and the publicly funded genome effort. The public data were shredded into 550-bp segments to create a 2.9-fold coverage of those genome regions that had been sequenced, without including biases inherent in the cloning and assembly procedure used by the publicly funded group. This brought the effective coverage in the assemblies to eightfold, reducing the number and size of gaps in the final assembly over what would be obtained with 5.11-fold coverage. The two assembly strategies yielded very similar results that largely agree with independent mapping data. The assemblies effectively cover the euchromatic regions of the human chromosomes. More than 90% of the genome is in scaffold assemblies of 100,000 bp or more, and 25% of the genome is in scaffolds of 10 million bp or larger. Analysis of the genome sequence revealed 26,588 protein-encoding transcripts for which there was strong corroborating evidence and an additional approximately 12,000 computationally derived genes with mouse matches or other weak supporting evidence. Although gene-dense clusters are obvious, almost half the genes are dispersed in low G+C sequence separated by large tracts of apparently noncoding sequence. Only 1.1% of the genome is spanned by exons, whereas 24% is in introns, with 75% of the genome being intergenic DNA. Duplications of segmental blocks, ranging in size up to chromosomal lengths, are abundant throughout the genome and reveal a complex evolutionary history. Comparative genomic analysis indicates vertebrate expansions of genes associated with neuronal function, with tissue-specific developmental regulation, and with the hemostasis and immune systems. DNA sequence comparisons between the consensus sequence and publicly funded genome data provided locations of 2.1 million single-nucleotide polymorphisms (SNPs). A random pair of human haploid genomes differed at a rate of 1 bp per 1250 on average, but there was marked heterogeneity in the level of polymorphism across the genome. Less than 1% of all SNPs resulted in variation in proteins, but the task of determining which SNPs have functional consequences remains an open challenge.","journal":null,"figures":[],"_authors":null},{"id":9283,"pmid":10493829,"pmcid":null,"title":"Genome duplications and other features in 12 Mb of DNA sequence from human chromosome 16p and 16q.","year":1999,"pages":null,"doi":null,"keywords":[],"mesh":[],"abstractText":"Several publicly funded large-scale sequencing efforts have been initiated with the goal of completing the first reference human genome sequence by the year 2005. Here we present the results of analysis of 11.8 Mb of genomic sequence from chromosome 16. The apparent gene density varies throughout the region, but the number of genes predicted (84) suggests that this is a gene-poor region. This result may also suggest that the total number of human genes is likely to be at the lower end of published estimates. One of the most interesting aspects of this region of the genome is the presence of highly homologous, recently duplicated tracts of sequence distributed throughout the p-arm. Such duplications have implications for mapping and gene analysis as well as the predisposition to recurrent chromosomal structural rearrangements associated with genetic disease.","journal":null,"figures":[],"_authors":null},null,null]