[BDCSG2008] Handling Large Datasets at Google: Current Systems and Future Directions (Jeff Dean)

Jeff was the big proposer for map-reduce model–the map-reduce guy. Jeff reviews of the infrastructure and the heterogeneous data set (heterogeneous and at least petabyte scale), their goal: maximize performance by buck. Also data centers, locality, and center are key in the equation. Low cost (not redundant power supplies, not raid disks, running linux, standard network) Software needs to be reliable to failure (node, disks, or racks going dead). Linux on all the production. An scheduler across the cluster to schedule jobs. Cluster wide file system on top, and usually big table cell. The GFS centralized manager that manages metadata and allocations of chunks and replication (talk to the master and then talk to the chunk servers). Big Table helps applications that need a bit more structure storage (key, column, time-stamp). It also provides garbage collection policies. Distribution is break in to tablets (range of rows) and managed by a single machine, the system can split growing tablets. Big Table provides transactions and allow to specify local columns to group together. They allow replications policies across data centers. MapReduce is a nice fitting model for some programs (for instance, the reverse index creation) Allows to move computation to closer to data. Also allows to implement load balancing. GFS works OK for a cluster, but they do not have a global view across the data centers. For instance, they are looking for unique naming of data, and also, if integrated allow data center to keep working if they became disconnected. They are also looking for data distribution policies.