How much data is produced by a life scientist/day?

3TBThe current operational idea behind Google’s Palimpsest Project is to ship 3TB (terrabyte= 1.0995 x 1012 bytes) drive array (Linux RAID-5) for scientists, who upload their data and FedEx the hard drives back to Google. Google then make those data publicly available and manageable. This file transfer method was heavily criticized by Dai Davies in Ars Technica. “This is a bit like using Flintstones technology in the Internet era.” although there are arguments behind this choice, see Jon Trowbridge’s 11th slide. Forget about this uploading/updating problem to the amount of this post. Here I only care about the end-user, the scientist who is provided with whatever tool to upload 3TB of research, measurement data on behalf of her research facility. While for an astronomer hundreds of gigabytes/day can seem as a normal output my angle is on how a life scientist and his data fits to this 3TB equation and eventually to the Palimpsest Project. Accordingly, my question is this:

How much data is produced by an average wet lab scientist, biomedical researcher/day?

I try to come out with a rough guess in the hope of subtle corrections from the commenters: I assume the following (rather busy) daily production of data by our average scientist in an average lab:

running a gel – making a gel photo 300 KB .tiff

preparing 5 samples for sequencing at the core facility, output: 500 KB – 1MB ab1, seq files

FACS sorting of different cell populations: 1 MB of special FACS files and 100 KB pdf out of it

conducting a microarray measurement in a cooperation: microarray: 100 MB without analysis (further Excel files)

making high res cell culture pictures: 20 pictures, 40 MB, jpg files

and optionally (count 2/week), but that makes the bulk of the data: recording an overnight time series confocal microscopy “movie”: microscope’s own file format, say lsm, 1- 1.5 GB and corresponding avi file, say: 100 MB

With a rough (and decimal) overestimation it is 250 MB/day and with 2 additional microscopy movies/week it seems no more than 4.5 Gbytes/week which can add up to no more than 20 Gbytes/month, and that’s roughly 240 GB = 0.24 TB/year and for a 3TB hard drive it is the 13 year long work of one individual scientist. For instance, counting a lab with 70 people out of which say 50 are producing experimental data, they produce roughly 3 TB within 3 months.

Conclusion: Middle size biomedical labs are within range of the Google Palimpsest project, but they definitely won’t be the limiting factors in the next 1-2 years concerning the hard drives.

2 thoughts on “How much data is produced by a life scientist/day?

  1. I suppose Ars thinks that everyone has the network bandwidth to send terabytes of data. The biggest challenge is going to be nextgen sequencing. I was talking to a scientist at ISB some time ago and they actually throw away the raw data after a while (did tell him about Google). Once more machines come online as costs go down, people will be churning out data like crazy.

    For things like search, which generate tons of data, the individual files are small, but the moment you get to image data or something like that the individual files get so big that communication becomes the bottleneck. You should see the challenges companies face with remote data access. It’s a royal pain.

  2. I used to prepare and run plates full of samples for sequencing (and taking into the equation the fact that a 16 capillary sequencer can only do 12 runs, in other words, two full plates in a day)…and the fact that all these files fit into even half of my smallest key…then yeah.

    The problem comes when larger format microarrays (ie Illumina 100K)come into play. Our in house arrays had ~5700 datapoints, so I never had a problem. The sample files were about 4MB each and i only had about 150-200 or something microarrays done.

    Illumina’s microarrays we were using, on the other hand, have 100K (and were developing a 300 chip) so when we started working with Illumina excel wasn’t enough to handle the data (crash message was something like “Excel cannot open more than 27K rows”).

    I think the biggest problem will be that certain labs are already developing data rich technologies and churning more data than can be handled or stored w/o having to buy their own servers (like we had)…but then, that’s only a few labs around the world…

Comments are closed.