Guessing the number of real protein-coding genes is an ‘ancient’ bioinformatics game and now a new argument & newish research field has been applied to this problem. Proteogenomics can refer to different type of studies but the basic idea is that mass spectrometry peptide/protein evidences are used to improve genome annotations. Now a joint Spanish – British – US team makes an interesting argument based on the LACK of mass spec evidence (negative proteogenomics mind you) in a pre-print deposited in arxiv entitled The shrinking human protein coding complement: are there fewer than 20,000 genes? The argument and claim in a nutshell: by pulling together and reanalysing 7 large-scale proteomics studies & mapping them to the GENCODE v12 annotation they identified 60% of protein coding genes. Then applying multiple non-coding features (the real meat of the study) they further restricted the non-coding set yielding at the end a set of 2001 genes out of which they think 1500 do not actual code for proteins at all.
Whether the bioinformatics argument is flawless or not the following findings are exciting:
“We find that there are surprisingly strong correlations between peptide detection and cross-species conservation, gene age and the presence of protein-like features. The age of the gene and its conservation across vertebrate species are key indicators of whether a peptide will be detected in proteomics experiments. We find peptides for most highly conserved genes and for practically all genes that evolved before bilateria. At the same time there is little or no evidence for protein expression for genes that have appeared since primates or that do not have any protein-like features or conservation.”
By the way, out of the 7 proteomics datasets pulled together I processed one at PRIDE/ProteomeXchange and you can download the data under PX identifier PXD000134.