The current operational idea behind Google’s Palimpsest Project is to ship 3TB (terrabyte= 1.0995 x 1012 bytes) drive array (Linux RAID-5) for scientists, who upload their data and FedEx the hard drives back to Google. Google then make those data publicly available and manageable. This file transfer method was heavily criticized by Dai Davies in Ars Technica. “This is a bit like using Flintstones technology in the Internet era.” although there are arguments behind this choice, see Jon Trowbridge’s 11th slide. Forget about this uploading/updating problem to the amount of this post. Here I only care about the end-user, the scientist who is provided with whatever tool to upload 3TB of research, measurement data on behalf of her research facility. While for an astronomer hundreds of gigabytes/day can seem as a normal output my angle is on how a life scientist and his data fits to this 3TB equation and eventually to the Palimpsest Project. Accordingly, my question is this:
How much data is produced by an average wet lab scientist, biomedical researcher/day?
I try to come out with a rough guess in the hope of subtle corrections from the commenters: I assume the following (rather busy) daily production of data by our average scientist in an average lab:
running a gel – making a gel photo 300 KB .tiff
preparing 5 samples for sequencing at the core facility, output: 500 KB – 1MB ab1, seq files
FACS sorting of different cell populations: 1 MB of special FACS files and 100 KB pdf out of it
conducting a microarray measurement in a cooperation: microarray: 100 MB without analysis (further Excel files)
making high res cell culture pictures: 20 pictures, 40 MB, jpg files
and optionally (count 2/week), but that makes the bulk of the data: recording an overnight time series confocal microscopy “movie”: microscope’s own file format, say lsm, 1- 1.5 GB and corresponding avi file, say: 100 MB
With a rough (and decimal) overestimation it is 250 MB/day and with 2 additional microscopy movies/week it seems no more than 4.5 Gbytes/week which can add up to no more than 20 Gbytes/month, and that’s roughly 240 GB = 0.24 TB/year and for a 3TB hard drive it is the 13 year long work of one individual scientist. For instance, counting a lab with 70 people out of which say 50 are producing experimental data, they produce roughly 3 TB within 3 months.
Conclusion: Middle size biomedical labs are within range of the Google Palimpsest project, but they definitely won’t be the limiting factors in the next 1-2 years concerning the hard drives.