Human proteome project: 21000 genes/1 protein, 10 years, $1 billion?

In order to have the slightest change to design a robust, systemic life extension technology, we need to accumulate every systemic macromolecular, cellular, tissue- and organ level data of the normal, physiological human body, connect the trillions of nodes with scalable software algorithms and suck out the draft of the proper sequence of consecutive treatment/regeneration steps later. Fortunately not only life extension technology needs systems biology projects (this is not enough for getting grants), but more importantly the effective design of new drug targets and the discovery of disease biomarkers are clearly crying for the systemic level. The urgent diagnostic and therapeutic demands are sufficient to launch international, many-lab projects.

Finally a complete ‘Human Proteome Project’ is in the pipeline (illustration via BioMed Search). It aims the tissue-level complete knowledge of the human proteome revealing “which proteins are present in each tissue, where in the cell each of those proteins is located and which other proteins each is interacting with”. Keep in mind also that around 21’000 human genes encode 1 million different proteins and that the effort cannot localize exactly which cell types in a given tissue is producing which protein. According to Nature’s Helen Pearson: Biologists initiate plan to map human proteome

“Those involved in the draft plan say that a human proteome project is now feasible partly because estimates of the number of protein-coding genes have shrunk. It was once thought that there might be around 50,000 or 100,000, but now, just 21,000 or so are thought to exist, making the scale of human proteomics more manageable. And the group plans to focus on only a single protein produced from each gene, rather than its many forms.

The plan is to tackle this with three different experimental approaches. One would use mass spectrometry to identify proteins and their quantities in tissue samples; another would generate antibodies to each protein and use these to show its location in tissues and cells; and the third would systematically identify, for each protein, which others it interacts with in protein complexes. The project would also involve a massive bioinformatics effort to ensure that the data could be pooled and accessed, and the production of shared reagents.”

The idea is to analyze and list all the proteins manufactured by chromosome 21 within 3 years as a pilot study and then finish the whole project within 10 years. Chromosome 21 is the smallest child in the family and likely contains between 200 and 400 genes, so the pilot study can yield us a couple hundreds proteins. Another powerful idea (actually I prefer this) is to start with the human mitochondrial proteome which is around 1000-1500 proteins as far as I know, that is at least 3 times as many as encoded by chromosome 21.

“Steven Carr, director of proteomics at the Broad Institute in Cambridge, Massachusetts, says there is likely to be broad support for a large-scale proteomics effort, but much debate about how best to do it. Rather than analyse the proteome of one chromosome, he says it may be better to tackle the proteome of mitochondria or the cell membrane because it would reveal more about biology and diseases related to those structures. “It’s time to think about something in a systematic fashion — whether this is the project is a different question,” he says.”

8 thoughts on “Human proteome project: 21000 genes/1 protein, 10 years, $1 billion?

  1. Yes, I agree that chromosome 21 seems arbitrary for a project like this. What is the advantage of picking a single chromosome? If we were back 10 years ago doing sequence walks or systematic deletion/mutagenesis then I could (and did) see it.

  2. I recall that there are about 2300 different proteins expressed in a given cell type at any given time. I know this from counting spots on on giant 2D gels in the 80’s”.

    It seems odd, then, that there would only be 21000 genes, because this would mean that 10% of all proteins in the genome are being encoded. Seems orders of magnitude off, perhaps explained by post-translational modifications. Maybe all of those spots on 2D gels, which were denaturing, were overestimating protein expression, but that would still only be less than a factor of 10. Just curious how the 21000 number was derived?

  3. OK, I am an idiot when it comes to writing code in comments.

    The missing phrase is “I know this from counting spots on giant 2D gels in the 80’s”.

    Somehow, the web site was picked up. Maybe I’ll learn how to be an IT geek at SciFoo?

  4. I think “Solar System of the Human Liver Proteome” would make a great title for a Bad Plus song (if they ever get over their fixation on references to marginal medal winners from the Olympics).
    If this jazz trio is unknown to anyone (Ed: No way!) they have a nice blog at

  5. The #1 issue with systems biology is simply that of language. You have a bunch of biologists who speak in binary terms, and study the things they study “because we can”, and who don’t know any computing languages beyond C++, HTML and PERL. Then you have the mathematicians speaking of multi-dimensional space, and algorithms and all the while not knowing how to present their data in a graphical form that biologists will understand.

    Just the other day I met with a bioinformaticist, to discuss possible solutions to a data set with 6 dimensions, and his solution was essentially to inform me that my data set was not complete enough and I needed to go away and get more data. The problem was, the data I have took at least 2 years to gather, and to obtain the amount of data he was requesting would take another couple of decades! The mathematicians think it is real simple to just tie down the independent variables in a biological system, pull some levers and the data just drops out; they have no appreciation for the sheer amount of hard labor involved in generating the most basic/simple of data sets. If animals are involved in any way, multiply the time by an order of magnitude.

    A classical example is dose responses to an inhibitor of a protein kinase. Everyone throws in 10uM of PD98059 or UO126, sees an effect, and shouts out “it involves the ERK pathway”, and we all go home happy. There’s no dose response, no control coefficient for ERK, no ability of the mathematicians to gauge the size of the effect, because we have a graph with only 2 points (zero and 1) and we don’t know the shape of the curve in between.

    Another example – we study protein phosphorylation and we think it’s important. Why? Because we can! We have methods and tools to measure it, but all along we have absolutely no evidence that phosphorylation is any more important than the hundred of other protein modifications that we know exist. There could be upwards of 1000 PTMs on every single protein, and we have the technology to accurately study about 5 of these.

    A third example, PeiPei Ping’s really nice paper on the mitochondrial proteome, which makes a single but massively important point – if you’re going to isolate mitochondria and study their proteome, you’d better have a pretty good idea about the quality of the mito’ preparation before you start – i.e. do they respire and behave like mitochondria should? Only then are you allowed to go to the effort of doing the 2D gels to get the proteomics data. How many people prepare “cytosolic extracts” and actually check to see if they have a cytosolic extract before proceeding with an experiment? Overall quality control of samples is very poor in proteomics.

    Bottom line, almost 100% of all biological data is simply not yet objective enough, detailed enough, or of high enough quality and integrity, or in a format that can be fed into the mathematics engines to get meaningful answers out the other end. Unless 90% of the budget of this proteomics initiative is devoted to improving the data, the enterprise will fail… you put crap in one end, you get crap out the other end.

Comments are closed.