Pimm - Partial immortalization

A Biotech Geek Blogger’s adventures through science, technology and the web…

Archive for the 'data' Category


Human proteome project: 21000 genes/1 protein, 10 years, $1 billion?

Posted by attilachordash on April 23, 2008

In order to have the slightest change to design a robust, systemic life extension technology, we need to accumulate every systemic macromolecular, cellular, tissue- and organ level data of the normal, physiological human body, connect the trillions of nodes with scalable software algorithms and suck out the draft of the proper sequence of consecutive treatment/regeneration steps later. Fortunately not only life extension technology needs systems biology projects (this is not enough for getting grants), but more importantly the effective design of new drug targets and the discovery of disease biomarkers are clearly crying for the systemic level. The urgent diagnostic and therapeutic demands are sufficient to launch international, many-lab projects.

Finally a complete ‘Human Proteome Project’ is in the pipeline (illustration via BioMed Search). It aims the tissue-level complete knowledge of the human proteome revealing “which proteins are present in each tissue, where in the cell each of those proteins is located and which other proteins each is interacting with”. Keep in mind also that around 21′000 human genes encode 1 million different proteins and that the effort cannot localize exactly which cell types in a given tissue is producing which protein. According to Nature’s Helen Pearson: Biologists initiate plan to map human proteome

“Those involved in the draft plan say that a human proteome project is now feasible partly because estimates of the number of protein-coding genes have shrunk. It was once thought that there might be around 50,000 or 100,000, but now, just 21,000 or so are thought to exist, making the scale of human proteomics more manageable. And the group plans to focus on only a single protein produced from each gene, rather than its many forms.

The plan is to tackle this with three different experimental approaches. One would use mass spectrometry to identify proteins and their quantities in tissue samples; another would generate antibodies to each protein and use these to show its location in tissues and cells; and the third would systematically identify, for each protein, which others it interacts with in protein complexes. The project would also involve a massive bioinformatics effort to ensure that the data could be pooled and accessed, and the production of shared reagents.”

The idea is to analyze and list all the proteins manufactured by chromosome 21 within 3 years as a pilot study and then finish the whole project within 10 years. Chromosome 21 is the smallest child in the family and likely contains between 200 and 400 genes, so the pilot study can yield us a couple hundreds proteins. Another powerful idea (actually I prefer this) is to start with the human mitochondrial proteome which is around 1000-1500 proteins as far as I know, that is at least 3 times as many as encoded by chromosome 21. Read the rest of this entry »

Posted in Nature, Nature Publishing Group, bioinformatics, biology, data, partial immortalization, proteome, science, systems biology | 6 Comments »

How much data is produced by a life scientist/day?

Posted by attilachordash on March 3, 2008

3TBThe current operational idea behind Google’s Palimpsest Project is to ship 3TB (terrabyte= 1.0995 x 1012 bytes) drive array (Linux RAID-5) for scientists, who upload their data and FedEx the hard drives back to Google. Google then make those data publicly available and manageable. This file transfer method was heavily criticized by Dai Davies in Ars Technica. “This is a bit like using Flintstones technology in the Internet era.” although there are arguments behind this choice, see Jon Trowbridge’s 11th slide. Forget about this uploading/updating problem to the amount of this post. Here I only care about the end-user, the scientist who is provided with whatever tool to upload 3TB of research, measurement data on behalf of her research facility. While for an astronomer hundreds of gigabytes/day can seem as a normal output my angle is on how a life scientist and his data fits to this 3TB equation and eventually to the Palimpsest Project. Accordingly, my question is this:

How much data is produced by an average wet lab scientist, biomedical researcher/day?

I try to come out with a rough guess in the hope of subtle corrections from the commenters: I assume the following (rather busy) daily production of data by our average scientist in an average lab:

running a gel - making a gel photo 300 KB .tiff

preparing 5 samples for sequencing at the core facility, output: 500 KB - 1MB ab1, seq files

FACS sorting of different cell populations: 1 MB of special FACS files and 100 KB pdf out of it

Read the rest of this entry »

Posted in bioinformatics, biology, biotechnology, data, science, technology | 2 Comments »

Venter on the Web 2.0 summit, Mayer on Google Health and petabytes

Posted by attilachordash on October 17, 2007

The ongoing mainstream Web 2.0 summit has a little coverage on health and biomedicine too:

an upcoming conversation with genomics maverick, uncovered Craig Venter and

a past presentation by Marissa Mayer, Google’s Vice President for Search Products & User Experience, on health information. Sarah Milstein says: “They’re also interested in helping you store and access your own health records. While giving people more control over their own data is an important idea, not to mention a trend we hope to see more of, Google may have to build (or rebuild?) user trust before people make it the repository of their most sensitive information.”

From the Wired post: “Mayer mentioned that 2 billion x-rays are taken every year, each of which would take 10 megabytes of data. That’s 200 petabytes of info. “The word petabytes gets us really excited,” says Mayer, “because that’s what we’re good at: handling large amounts of information, organizing and storing it.”

This reminds me of another Google project, nicknamed Palimpsest.

more on Medicine 2.0

Posted in IT&BT, USA, data, google, media, medicine, o'reilly, presentation | 3 Comments »

Google’s Palimpsest project: promiscuous distribution of all science data sets

Posted by attilachordash on September 25, 2007

GooglesteinGoogle’s Palimpsest project, once realized (in the near future) has the potential to change the way science is done by accepting gigantic (raw?) data sets from all disciplines and making them open and free (including dark data?). Jon Trowbridge from Google Inc. (you know, The Facebook of information) had a presentation on SciFoo, 2007 at the Googleplex not documented well, but you can download his slides on the project that was presented at XTech 2007 in Paris, this May: Making Massive Datasets Universally Accessible and Useful Presentation. You are not restricted to the zip file as Jon kindly gave a permission to publish his slides with SlideShare here. From his intro: This talk will discuss a project underway at Google to collect and distribute large scientific datasets using a 21st century “Sneakernet”: multi-terabyte disk arrays shipped via FedEx and other common carriers.

The project is strictly non-profit, but fits well with Google’s mission.

Other links:

Scifoo: Google and large scientific datasets

Google helps terabyte data swaps

Posted in IT, Sci Foo, SciFoo, USA, data, google, googleplex, science, science slideshows, technology | 22 Comments »

Freeing dark, negative research data is the next in open access science?

Posted by attilachordash on September 23, 2007

goetzarticlePositive, published scientific data form the tip of the iceberg of any scientificgoetz data produced in labs. As at least 90% (my guess) of all experiments are failed or lead to negative results, those data sets become “dark data“. But those dark data are as important for making science happen as positive data and this information must be free - argues Thomas Goetz Wired’s deputy editor (and another SciFoo camper) in an opinionated piece in the October issue of Wired (available only offline at this moment, update: it is now online), called Mind the gaps. The idea is to push open access science to its limits.

“Liberating dark data makes many scientists deeply uncomfortable, because it calls for them to reveal their “failures”. But in this data-intense age, those apparent dead ends could be more important than the breakthroughs….Your dead end may be another scientist’s missing link. Freeing up dark data could represent one of the biggest boons to research in decades, fueling advances in genetics, neuroscience, and biotech.”

“Advocating the release of dark data is one thing, but it’s quite another to actually collect it, juggling different formats and standards. There’s the issue of storage….Google, among others, is lending a hand with its Palimpsest project, offering to store and share monster-size data sets (making the data searchable isn’t part of the effort.)”

Stop for a minute! The Palimpsest project was entertainingly presented at SciFoo by Jon Trowbridge (my iPhone shot of one his slide published here with Jon’s permission) and my guess is that this presentation is the source of Thomas Goetz’s sentence. I tried to make a hint of this project in my SciFoo Camp, 2007: data (Google) publishing (Nature) geeks (O’Reilly) post:

trowbridgeSciFoo“scientific data”

One of the most frequently used key term was “scientific data”. And the question is: how to collect, upload, organize and index them. With the exponentially increasing data sets, that are produced by scientists worldwide, it is obvious that we need really powerful tools to benefit them. After a couple of beta years it is highly probable that Google (according to its mission statement) will offer new ways to manage the enormous amount of valuable scientific data. Without that, the efficiency of the science industry will dramatically decline.

But it was Deepak, who later shared his experience on the presentation in details:
Scifoo: Google and large scientific datasets

Here is my favorite part out Goetz’s article about the science culture problem of freeing dark data:
“If their research is successful, many academics guard their data like Gollum, wringing all the publication opportunities they can out of it over years. If the research doesn’t pan out, there’s strong incentive to move on ASAP, and a disincentive to linger in eddies that may not advance one’s job prospects.”

Wait for a sec! During the summer I did 2 experiments that failed (=negative data), but then I explored in the literature why I exactly failed and now this knowledge and insight presumably will lead me to successful experiments. Read the rest of this entry »

Posted in Wired, biology, data, google, open science, open-access, science, science publishing | 10 Comments »