Petabyte Age Wiredesque lesson on what science can learn from Google

I argued many times here that biology based biotechnology is the next information technology but in order to do so, biotech should harness good IT patterns and mimic its massive computing practices to handle the enormous amount of constantly accumulating data. Often this trend could be summarized in a simple way: keep your eye on Google and conduct thought experiments in advance in which science is done in a Googleplex like environment in terms of the computing & financial resources and algorithm heavy engineering culture. Use Python and learn cluster computing and MapReduce. With the expected launch of the massive scientific dataset hosting Google service – nicknamed Palimpsest – this year finally a direct interface between scientists and Googlers emerges and hopefully opens up possibilities for scientists to cooperate with Google. (Remember my joke on Google BioLabs back in 2006)? I get emails from biologists, bioinformaticians asking me how to be hired by Google ever since then. As I tweeted yesterday: I growingly have the impression that “being ambitious” today = ‘worked, is currently working, is going to work at/for Google’ Taking Google’s inter-industrial power into consideration I see a real chance that some day the “Google of Biotechnology” title goes not to a startup yet to be emerged, not to Genentech or to 23andMe but……to Google itself. No kidding here. Fortunately Google’s model is “to build a killer app then monetize it later” says Andy Rubin, the man behind Google’s Android mobile software in the July issue of Wired so scientists working for the big G probably won’t have to worry about turning their scientific killer app into an instant cash machine.

And now in the very issue of Wired magazine (not online yet ) there is an exciting cover story on the same pattern I talked about concerning the life sciences but in the broader context of every kind of science with the provocative, Fukuyama-like title The End of Science. There is a witty and short essay from editor-in-chief Chris Anderson entitled The End of Theory followed by examples of the ‘new science’ like the The Large Hadron Collider expected to generate 10 petabytes if data/second, The Sloan Digital Sky Survey heaven catalog maker accumulating 25 terrabytes of data so far, the skeleton scanning project of Sharmila Majumdar and the Many Eyes project “where users can share their own dynamic, interactive representations of big data”.

For many people around the globe, Chris Anderson is a freeconomist & the author of a popular airport book but fewer people are aware that he was actually trained as a (quantum) physicist and even worked at Los Alamos (after a 3 min search on Google Scholar and the likes I gave up to find any peer reviewed article Anderson coauthored). So when he writes on science readers should keep in mind that what he really understood once was the practice of physics in the 80’s (looking forward to a biology oriented Wired editor-in-chief in the near 21st century future, Thomas Goetz perhaps?). Anderson talks about the end of traditional science, or at least the end of science according to the most popular philosophical account on how science operates: testable hypothesis for the underlying mechanism, causation – model – test, experiment – confirm/falsify.

“The reason physics has drifted into theoretical speculation about n-dimensional grand unified models over the past few decades is that we don’t know how to run the experiments that would falsify the hypotheses – energies are too high, accelerators too expensive, and so on.”

Anderson concludes:

“There is now a better way. Petabytes allow us to say” “Correlation is enough.” We can stop looking for models. We can analyze the data without hypotheses about what it show. We can throw the numbers into the biggest computing clusters the world has ever seen and let statistical algorithms find patterns where science cannot.”

Interesting point, but isn’t it too early to call all this the End of Science with a typical overjournalistic lingo? From a cultural point of view: it is just not cool to use the vulgar philosophical (and vulgar Hegelian) generalization of a stone conservative philosopher/political theorist who is one of the biggest enemies of everything biotech, specially in the journal famous for its cool technophilia and geekery. /Think about what Fukuyama would say on Venter’s synthetic biology, a biological example used by Anderson in his essay/

On the other hand, Anderson may have right about other sciences he mentions, like psychology, or sociology, and most notably his scientific home, physics but individually conducted, hypothesis based experimental biology is far from being over, or at least that is my very academic experience. Although more & more wet lab biologists find themselves dried out and choose bioinformatics or computational biology instead it is still totally possible to master hypothesis based, carefully designed (good controls!!!), beautiful experiments using less than say 20 main variables with the expected outcome and I see no reason why it should turn out to be the other way.

Even myself (trained and worked as an experimental scientist so far minus 5 year philosophy) have now growing bioinformatics, database building and large-scale pattern seeking dreams, but it’s due to my main commitment – the reason I chose biology once – to robust healthy life extension. One reason scientists are usually switching to computational heavy problem solving in the field of the biomedical sciences is that it is the way to tackle real big, complex issues, like cancer or diabetes or…..aging. But not every scientist would like to solve problems like these.

Anderson is right about the tendencies and trends (yes, science can and must learn a lot from Google, isn’t that obvious?) but even the petabyte age will need and produce Einstein like scientists, brilliant theoreticians, creative thought experimenters, methodological outliers and not just perfect statisticians.

Update: What about the large-scale statistical models? – asks Fernando Pereira via Three-Toed Sloth via Bill Hooker on FriendFeed, where people are mostly underwhelmed by Anderson’s arguments.

8 thoughts on “Petabyte Age Wiredesque lesson on what science can learn from Google

  1. Attila, there is a lot more to biological data than just managing it. What are the biological problems you are trying to solve? What questions are you trying to answer? Do you know how to present the information to different groups in your organization? These are problems that a non-scientist cannot answer. So what we need is a marriage of the two minds. I doubt Google has any interest in being a scientific company per se. It’s hard work and they’ll essentially have to be two companies. IBM has done some good science in its day (some brilliant science actually), but that’s still not it’s core business and shouldn’t be.

    What we need is the mindset, and the realization that we need to think about new ways of managing data and making the results available to scientists and decision makers. We need to start thinking about distributed computing paradigms, but those are thing science has always learned from the tech industry. We’ve just been too slow doing that.

  2. I’ll add that there are multiple avenues for Google to be an enabling platform (e.g. android, palimpset, google code, google scholar, etc).

  3. I haven’t read my Wired yet, but I loved the cover and have been thinking about it since I got it – of course with respect to biotechnology. There has definitely been a trend in biotechnology towards screening-type experiments and away from individual hypothesis driven experiments. There are arguments both ways that high throughput screening has led to the Demise of Big Pharma, but there are some cool ways to use the technology. I was involved in the first group to apply screening with application towards ‘discovering’ environments to drive stem cell fate which is currently being used internally for BDs cell therapy initiative and for external commercialization( While the platform allows for screening hundreds of conditions, there are still a lot of hypotheses being asked per experiment. BD had a 10+ person informatics team to customize the informatics for the 30+ strong biology group – so it was very obvious the importance of computing power to push that forward. I could totally see how that or other platforms could be extended and with infinite resources (~$300M/year?) you could just screen everything and look for what works. While that isn’t happening today, I can see how 10-30 years down the line (where Chris Anderson & Wired tries to ‘predict’) that non-hypothesis driven work will dramatically increase the ‘market share’ of even academic science. With that, there is the potential of a snowball effect, fewer scientists learning how to shape and test hypotheses, more screening/informatics approaches, towards the ‘end of science’ – that is the story that has developed in my head since seeing Wired’s cover. So while the End of Science is far from a reality, it isn’t so far fetched that it won’t happen in this milenium.

  4. I find it interesting that this concept appears diametrically opposed to the DIYbio movement: generation of petabytes of computational power will require enormous wealth. I understand the idea of pulling this power through pooled resources, but the necessity of a massive, centralized processor seems insurmountable.

    Am I missing something?

  5. Attila,

    very interesting post. Ever since reading your first post regarding the potential Google has in the science/health field I agreed with you – because I had similar thoughts. The data housing and mining of this data with Google’s infrastructure and brain power could lead to valuable insights and lead to a whole new field of science (even a paradigm shift).

    I will have to run out and get the new Wired magazine.

    ps GlaxoSmithKline just released a huge data base of cancer cell line data (

  6. Soft-updates guarantees that the only filesystem inconsistencies on unclean shutdown are leaked blocks and inodes. To resolve this you can run a background fsck or you can ignore it until you start to run out of space. We also could’ve written a mark and sweep garbage collector but never did. Ultimately, the bgfsck is too expensive and people did not like the uncertainty of not having run fsck. To resolve these issues, I have added a small journal to softupdates. However, I only have to journal block allocation and free, and inode link count changes since softdep guarantees the rest. My journal records are each only 32bytes which is incredibly compact compared to any other journaling solution. We still get the great concurrency and ability to ignore writes which have been canceled by new operations. But now we have recovery time that is around 2 seconds per megabyte of journal in-use. That’s 32,768 blocks allocated, files created, links added, etc. per megabyte of journal.

Comments are closed.