Human sperm protein associated with the nucleus on the X chromosomeconsists of a five-member gene family (SPANXA1, SPANXA2, SPANXB, SPANXCand SPANXD) clustered at Xq27.1. Evolved from an ancestral SPANX-N genefamily (at Xq27 and Xp11) present in all primates as well as in rats andmice, the SPANXA/D family is present only in humans, bonobos, chimpanzeesand gorillas. Among hominoid-specific genes, the SPANXA/D gene family isconsidered to be undergoing rapid positive selection in its coding region.In this study, RT-PCR of human testis mRNA from individuals showed that,although all SPANXA/D genes are expressed in humans, differences areevident. In particular, SPANXC is expressed only in a subset of men. TheSPANXa/d protein localized to the nuclear envelope of round, condensingand elongating spermatids, specifically to regions that do not underliethe developing acrosome. During spermiogenesis, the SPANXa/d-positivedomain migrated into the base of the head as the redundant nuclearenvelope that protrudes into the residual cytoplasm. Post-testicularmodification of the SPANXa/d proteins was noted, as were PEST (proline,glutamic acid, serine, and threonine rich regions) domains. It isconcluded that the duplication of the SPANX-N gene family that occurred6-11 MYA resulted in a new gene family, SPANXA/D, that plays a role duringspermiogenesis. The SPANXa/d gene products are among the few examples ofX-linked nuclear proteins expressed following meiosis. Their localizationto non-acrosomal domains of the nuclear envelope adjacent to regions ofeuchromatin and their redistribution to the redundant nuclear envelopeduring spermiogenesis provide a biomarker for the redundant nuclearenvelope of spermatids and spermatozoa.

BACKGROUND: An underlying tenet of the epigenetic code hypothesis is theexistence of protein domains that can recognize various chromatinstructures. To date, two major candidates have emerged: (i) thebromodomain, which can recognize certain acetylation marks and (ii) thechromodomain, which can recognize certain methylation marks. RESULTS: TheEpc-N (Enhancer of Polycomb-N-terminus) domain is formally defined herein.This domain is conserved across eukaryotes and is predicted to form aright-handed orthogonal four-helix bundle with extended strands at bothtermini. The types of amino acid residues that define the Epc-N domainsuggest a role in mediating protein-protein interactions, possiblyspecifically in the context of chromatin binding, and the types ofproteins in which it is found (known components of histoneacetyltransferase complexes) strongly suggest a role in epigeneticstructure formation and/or recognition. There appear to be two major Epc-Nprotein families that can be divided into four unique protein subfamilies.Two of these subfamilies (I and II) may be related to one another in thatsubfamily I can be viewed as a plant-specific expansion of subfamily II.The other two subfamilies (III and IV) appear to be related to one anotherby duplication events in a primordial fungal-metazoan-mycetozoan ancestor.Subfamilies III and IV are further defined by the presence of anevolutionarily conserved five-center-zinc-binding motif in the loopconnecting the second and third helices of the four-helix bundle. Thismotif appears to consist of a PHD followed by a mononuclear Zn knuckle,followed by a PHD-like derivative, and will thus be referred to as thePZPM. All non-Epc-N proteins studied thus far that contain the PZPM havebeen implicated in histone methylation and/or gene silencing. In addition,an unusual phyletic distribution of Epc-N-containing proteins is observed.CONCLUSION: The data suggest that the Epc-N domain is a protein-proteininteraction module found in chromatin associated proteins. It is possiblethat the Epc-N domain serves as a direct link between histone acetylationand methylation statuses. The unusual phyletic distribution ofEpc-N-containing proteins may provide a conduit for future insight intohow different organisms form, perceive and respond to epigeneticinformation.

BACKGROUND: Post-translational modification by Small Ubiquitin-likeModifiers (SUMO) has been implicated in protein targeting, in themaintenance of genomic integrity and in transcriptional control. But thespecific molecular effects of SUMO modification on many target proteinsremain to be elucidated. Recent findings point at the importance ofSUMO-mediated histone NAD-dependent deacetylase (HDAC) recruitment intranscriptional regulation. RESULTS: We describe the RENi family ofSUMO-like domain proteins (SDP) with the unique feature of typicallycontaining two carboxy-terminal SUMO-like domains. Using sequence analyticevidence, we collect family members from animals, fungi and plants, mostprominent being yeast Rad60, Esc2 and mouse NIP45http://mendel.imp.univie.ac.at/SEQUENCES/reni/. Different proteins of thenovel family are known to interact directly with histone NAD-dependentdeacetylases (HDACs), structural maintenance of chromosomes (SMC)proteins, and transcription factors. In particular, the highly non-trivialdesignation of the first of the two successive SUMO-domains in non-plantRENi provides a rationale for previously published functionally impairedmutant variants. CONCLUSIONS: Till now, SUMO-like proteins have beenstudied exclusively in the context of their covalent conjugation to targetproteins. Here, we present the exciting possibility that SUMO domainproteins, similarly to ubiquitin modifiers, have also evolved in a secondline - namely as multi-domain proteins that are non-covalently attached totheir target proteins. We suggest that the SUMO stable fusion proteins ofthe RENi family, which we introduce in this work, might mimic SUMO andshare its interaction motifs (in analogy to the way that ubiquitin-likedomains mimic ubiquitin). This presumption is supported by parallels inthe spectrum of modified or bound proteins e.g. transcription factors andchromatin-associated proteins and in the recruitment of HDAC-activity.

A nuclear targeting determinant for SATB1, a genome organizer in the Tcell lineage.

Cell Cycle. 2005; 4: 1099-106

Display abstract

SATB1 is a nuclear protein, which acts as a cell-type specific genomeorganizer and gene regulator essential for T cell differentiation andactivation. Several functional domains of SATB1 have been identified.However, the region required for nuclear localization remains unknown. Todelineate this region, we employed sequence analysis to identifyphylogenetically diverse members of the SATB1 protein family, and usedhidden Markov model (HMM)-based analysis to define conserved regions andmotifs in this family. One of the regions conserved in SATB1- andSATB2-like proteins in mammals, fish, frog and bird, is located near theN-terminus of family members. We found that the N-terminus of human SATB1was essential for the nuclear localization of the protein. Furthermore,fusing residues 20-40 to a cytoplasmic green fluorescence protein (GFP)fused to pyruvate kinase (PK) was sufficient to quantitatively translocatethe pyruvate kinase into the nucleus. The nuclear targeting sequence ofhuman SATB1 (residues 20-40) is novel and does not contain clusters ofbasic residues, typically found in 'classical' nuclear localizationsignals (NLSs). We investigated the importance of four well-conservedresidues (Lys29, Arg32, Glu34, and Asn36) in this nuclear targetingsequences. Remarkably, full-length SATB1 harboring a single point mutationat either Lys29 or Arg32, but not Glu34 or Asn36, did not enter thenucleus. Our results indicate that SATB1 N-terminal residues 20-40represent a novel determinant of nuclear targeting.

Discovery of the principal specific transcription factors of Apicomplexaand their implication for the evolution of the AP2-integrase DNA bindingdomains.

Nucleic Acids Res. 2005; 33: 3994-4006

Display abstract

The comparative genomics of apicomplexans, such as the malarial parasitePlasmodium, the cattle parasite Theileria and the emerging human parasiteCryptosporidium, have suggested an unexpected paucity of specifictranscription factors (TFs) with DNA binding domains that are closelyrelated to those found in the major families of TFs from other eukaryotes.This apparent lack of specific TFs is paradoxical, given that theapicomplexans show a complex developmental cycle in one or more hosts anda reproducible pattern of differential gene expression in course of thiscycle. Using sensitive sequence profile searches, we show that theapicomplexans possess a lineage-specific expansion of a novel family ofproteins with a version of the AP2 (Apetala2)-integrase DNA bindingdomain, which is present in numerous plant TFs. About 20-27 members ofthis apicomplexan AP2 (ApiAP2) family are encoded in differentapicomplexan genomes, with each protein containing one to four copies ofthe AP2 DNA binding domain. Using gene expression data from Plasmodiumfalciparum, we show that guilds of ApiAP2 genes are expressed in differentstages of intraerythrocytic development. By analogy to the plant AP2proteins and based on the expression patterns, we predict that the ApiAP2proteins are likely to function as previously unknown specific TFs in theapicomplexans and regulate the progression of their developmental cycle.In addition to the ApiAP2 family, we also identified two other novelfamilies of AP2 DNA binding domains in bacteria and transposons. Usingstructure similarity searches, we also identified divergent versions ofthe AP2-integrase DNA binding domain fold in the DNA binding region of thePI-SceI homing endonuclease and the C-terminal domain of the pleckstrinhomology (PH) domain-like modules of eukaryotes. Integrating thesefindings, we present a reconstruction of the evolutionary scenario of theAP2-integrase DNA binding domain fold, which suggests that it underwentmultiple independent combinations with different types of mobileendonucleases or recombinases. It appears that the eukaryotic versionshave emerged from versions of the domain associated with mobile elements,followed by independent lineage-specific expansions, which accompaniedtheir recruitment to transcription regulation functions.

A complex between peptide:N-glycanase and two proteasome-linked proteinssuggests a mechanism for the degradation of misfolded glycoproteins.

Proc Natl Acad Sci U S A. 2004; 101: 13774-9

Display abstract

Peptide:N-glycanase (PNGase) has been proposed to participate in theproteasome-dependent glycoprotein degradation pathway. The finding thatyeast PNGase interacts with the 19S proteasome subunit through the proteinRad23 supports this hypothesis. In this report, we have usedimmunofluorescence, subcellular fractionation, coimmunoprecipitation, andin vitro GST pull-down techniques for detecting intracellular localizationand interactions of PNGase, HR23B, and S4 by using human (h) and mouse (m)homologs. Immunofluorescence studies revealed that hPNGase, hHR23B, andhS4 are present in close proximity to the endoplasmic reticulum (ER) whencalnexin was used as an ER marker in HeLa cells. Subcellular fractionationsuggests not only cytoplasmic but also ER association of hPNGase in HeLacells. Immunoprecipitation analysis revealed the interaction of h/mPNGasewith the 19S proteasome subunit, hS4, through hHR23B. Using an in vitroGST pull-down assay, we also have shown that recombinant mPNGase requiresits N terminus and middle domain for interaction with mHR23B. Finally,using misfolded yeast carboxypeptidase Y and chicken ovalbumin asglycoprotein substrates, we have established that mHR23B acts as areceptor for deglycosylated proteins. Based on this finding, we proposethat after deglycosylation of misfolded glycoproteins by PNGase, theaglyco forms of these proteins are recognized by HR23B and targeted fordegradation.

Yeast RAD4, its human ortholog Xp-C and their orthologs in othereukaryotes are DNA repair proteins which participate in nucleotideexcision repair through a ubiquitin-dependent process. However, noconserved globular domains that might have shed light on their origin orfunctions have been reported for these proteins. By using sequence profileanalysis, we show that RAD4/Xp-C proteins contain the ancienttransglutaminase fold and are specifically related to the recentlycharacterized peptide-N-glycanases (PNGases) which remove glycans fromglycoproteins during their degradation. The PNGases retain the catalytictriad that is typical of this fold and are predicted to have a reactionmechanism similar to that involved in transglutamination. In contrast, theRAD4/Xp-C proteins are predicted to be inactive and are likely to onlypossess the protein interaction function in DNA repair. These proteinsalso contain a long, low-complexity insert in the globulartransglutaminase domain. The RAD4/Xp-C proteins, along with other inactivetransglutaminase-fold proteins, represent a case of functionalre-assignment of an ancient domain following the loss of the ancestralenzymatic activity.

WHSC1, a 90 kb SET domain-containing gene, expressed in early developmentand homologous to a Drosophila dysmorphy gene maps in the Wolf-Hirschhornsyndrome critical region and is fused to IgH in t(4;14) multiple myeloma.

Hum Mol Genet. 1998; 7: 1071-82

Display abstract

Wolf-Hirschhorn syndrome (WHS) is a malformation syndrome associated witha hemizygous deletion of the distal short arm of chromosome 4 (4p16.3).The smallest region of overlap between WHS patients, the WHS criticalregion, has been confined to 165 kb, of which the complete sequence isknown. We have identified and studied a 90 kb gene, designated as WHSC1 ,mapping to the 165 kb WHS critical region. This 25 exon gene is expressedubiquitously in early development and undergoes complex alternativesplicing and differential polyadenylation. It encodes a 136 kDa proteincontaining four domains present in other developmental proteins: a PWWPdomain, an HMG box, a SET domain also found in the Drosophila dysmorphygene ash -encoded protein, and a PHD-type zinc finger. It is expressedpreferentially in rapidly growing embryonic tissues, in a patterncorresponding to affected organs in WHS patients. The nature of theprotein motifs, the expression pattern and its mapping to the criticalregion led us to propose WHSC1 as a good candidate gene to be responsiblefor many of the phenotypic features of WHS. Finally, as a serendipitousfinding, of the t(4;14) (p16.3;q32.3) translocations recently described inmultiple myelomas, at least three breakpoints merge the IgH and WHSC1genes, potentially causing fusion proteins replacing WHSC1 exons 1-4 bythe IgH 5'-VDJ moiety.

By the middle of 1993, > 30,000 protein sequences has been listed. For1000 of these, the three-dimensional (tertiary) structure has beenexperimentally solved. Another 7000 can be modelled by homology. For theremaining 21,000 sequences, secondary structure prediction provides arough estimate of structural features. Predictions in three states rangebetween 35% (random) and 88% (homology modelling) overall accuracy. Usinginformation about evolutionary conservation as contained in multiplesequence alignments, the secondary structure of 4700 protein sequences waspredicted by the automatic e-mail server PHD. For proteins with at leastone known homologue, the method has an expected overall three-stateaccuracy of 71.4% for proteins with at least one known homologue(evaluated on 126 unique protein chains).