Dear all,
I am doing an assembly of 40 Mb genome with expected coverage of 181x. I am using Illumina reads 76bp length with insert size 200 bp (Sd 20 bp). I have tried velvet for these assemblies and 86-99% of reads were used in this assembly with N50 of 80kb (with k-mer's 21,55,2). But the strange thing is that I am getting only 19 Mb genome after all assemblies. The whole genome has been covered during the library preparations. What could be the possible reason behind this? Is this due to repeat elements, as some of my NODE's covered more than 5000x? I would appreciate your suggestions.

Yes, collapsed repeats can lead to a smaller than expected assembly size. See Myers et al (2000) for a good discussion on how to detect collapsed repeat contigs. If this is the case then you have a very repetitive genome on your hands.

Also, have you confirmed that your observed sequencing throughput is compatible with your expected throughput? You can do this by reference mapping against a single copy locus that was isolated previously from your species of interest. If the library/sequencing was poor, you may have a lower coverage than you think which could lead to a partial assembly, although in the range you are talking about this seems unlikely.

It is possible indeed. It is strange that you successfully assemble so many reads and you get such a small genome size. Have you tried to BLAST your "un-assembled reads" against a database containing only repeated elements (e.g repbase)?

It also depends on how you obtained your reads... maybe the whole genome isn't in your sample, because even if there is a lot of repeated elements in the genome, they should be there in multiple copies. You wouldn't assemble almost 100% of the reads. A whole bunch of reads very similar among them wouldn't assemble.