Introduction by Jim Koch -
Eric Brewer is the co founder and chief scientist at Inktomi.  One of the recent great success stories in the internet arena.  He receive his Ph.D. in parallel computing from MIT in 1994.  He’s currently an assistant professor of electrical engineering and computer science at the University of California at Berkeley.  His studies include a focus on internet infrastructure, internet security and mobile computing.  Certainly the work that Eric is involved with in the internet area will provide us with some considerable insight on what were going do with all this storage capacity we have in markets that are yet before us to be created.


Brewer's Biography

ERIC BREWER


 
    I actually have very little experience in the disk drive industry, I’m a user, that’s about it.  But as a user of disk drives I think in an unusual way, in a way I think will become more and more relevant to the evolution of industry.  I’ll talk about how we use the disk in the internet, and also what I think I would like to see out of disk drives in the future and it’s not areal density.  So a little background; this chart shows the growth of the Internet measured in back bone traffic.  For reference the black line at the bottom is 100% annual growth.  That’s a very interesting graph that means that Moore's law is below the black line.  If you say were gonna build the internet and that were gonna ride Moore's law to keep up, there’s no chance.  Which means several things, it means that over time you need you more processors on the internet because processors aren’t going to get faster, fast enough.  Same is true for disk drives.  Even at 200% or 100% per year increase in density it won’t keep up.  Which is good since that means selling more disk drives. 
    So we’ll talk about that, but Inktomi exists because of this picture.  Inktomi’s strength basically is cluster computing.  So we know how to make very large computers, and I’ll show you in a little bit.  They act as one simple giant computer, that can grow at a pace that’s faster than Moore’s law. 
    So a short company overview.  It is IKNT on NASDAQ.  It came out of UC Berkeley.  It was founded by myself and one of my graduate students.  Actually my first graduate student, I found out today that we are now 260 employees, a sign of fast growth.  We were at 80 in earlier in the year so it’s kind of scary.  We have 3 applications well known mostly as search engines.  If you use HOTBOT or Yahoo or Snap those are all Inktomi search engines. We provide the search capacity they actually use to deliver their products.  So we are an OEM search provider.  Actually an OEM in general for all these products. 
    Networking cache, that I will talk about, has to do with making the internet faster.  On line shopping, that you just keep hearing about in the news, allowing you to buy things on line which fundamental will be more efficient than any other distribution mechanism.  It’s a very nice use of the internet.  So we have a lots of partners.  I’ll talk about a few of them a little bit.  But that’s kind of the big view.
    I’ll talk about the two applications that use lots of disk.  The first one is the search engine, and this is actually the most telling picture.  The way the search works if you would do a search.  That’s pretty normal.  You would go to say Yahoo, and type in your query.  Then what actually happens to it is not intuitive.  So it goes to Yahoo, which is I believe is from around here, in Mountain View.  Their office and their computers are in different places.  Then from Yahoo you actually currently go to Virginia. 
    So Yahoo gets your query and they sent it to Inktomi in Virginia even though there is a cluster in Santa Clara that we will talk about in a little bit.  The cluster gives them back the answer in real time and they then do a presentation, which is to say they convert it to HTML, and they insert whatever advertisement they’d like and what other things, little icon’s and such that show up on the page.  Maybe it’s Yahoo to get your stock quotes or something like that.  So they actually do the presentation and then you get your answer.  So there is actually quite a few steps in that.  There’s a step to Yahoo than a step from Yahoo to Inktomi.  It turns out most search engines work this way.  In fact what is interesting is that there is one big cluster that does most of searches on the internet.  It’s actually not very far from here, it’s off the Great America highway, I don’t know exactly but about a mile or 2 miles from here. 
    That main cluster has now 166 nodes and each of those have 2 CPUs.  When I say a large virtual machine what I mean a machine that’s got a 166 nodes in it, more than 300 hundred processors that solve the search engine queries.  It works as a virtual computer.  It also has a lot of disks, this is where I am gonna go with this.  But I think what is interesting is, that this picture implies, that there is an infrastructure being built in the back ground that actually does what we call the heavy lifting.  And it’s important because it is not only a centralization of CPU resources but a centralization of disk resources.  In fact I think that the most important trend from the internet for the disk drive industry is that most storage that end users own won’t be in their house, it won’t be in their laptops.  It’s gonna be in the infrastructure.  This means that the market is going to have to shift a little bit.
    So how do we do this?  I actually wanted to use this old picture to talk about the difference.  So what we do is take a cluster of commodity nodes  (I’ll show you a picture in a minute) typically work stations like a 2 processor SUN workstation.  We spread the data base across the cluster.  That is actually the interesting part because that been historically very difficult to do.  So what happens is actually when a query comes in it can go to any of the nodes in the cluster and that node will give the query to all the nodes, and in this case 166 nodes.  They will in parallel perform that query.  They will return their partial answers to the node that started the query that collates them, and picks the top 10 and actually returns that to the user.  So if you want to know what is the largest supercomputer you have ever used it is almost certainly the Inktomi cluster, because it’s a 330 something processor supercomputer.  Every query you do on HOTBOT or Yahoo, that’s a web query, actually goes to this cluster and uses all of it’s processors.  Not one of them, you get to use all of them in every query. 
    The reason we do that is subtle.  It’s because that is the way to get the most bandwidth and seeks out of disks.  The architecture wins and it literally wins because it give much more I/O bandwidth and many more seeks per second than any other architecture.  Actually this cluster is 60 thousand seeks per second.  You can only get about a hundred out of a disk.  We average about 65 per second 24 hours a day.  Which by the way is very hard on disks.  These disk are at extreme utilization all the time.  And by the way, that is how we know how many disk to buy, because the load is perfectly balance due to some randomization trick we are playing.  All the disks are equally used, and they are all running at about 60% to 80% capacity all the time.  By capacity, I mean seeks per second capacity not space capacity.  And the reason you get more bandwidth is, of course, because you get a separate I/O bus for every machine.  The net I/O bus span for this architecture is incredible.  Much more than you can get with a any kind of a Cray or anything like that. 
    So if you looked at it as a traditional super computer based on metrics, such as floating point operations per second, its actually relatively weak.  It’s in the top 100 but it’s not in the top 10.  If you look at it in terms of I/O capacity per second either by bandwidth or seeks, it’s by far number one.  But there’s no such record of those machines so we don’t really know.  The closest one would be the Tera-Data server for Wal-Mart.  Which has about 700 processors, thus it should be up there too.  Again we don’t know what it’s I/O capacity is.  But this is basically a I/O machine. 
    Search engines basically require lots and lots of random seeks, because you don’t know what people are gonna ask for.  It has a lots of other interesting properties that I won’t talk about too much.  But the most interesting probably is the way you expand it.  When you would like to have more capacity in terms of more users per day, or more documents in your database, a bigger search engine database, the operation you would perform is the same.  It’s you add another commodity node.  So in fact we started with 5 nodes than 10, 26, 36, 66, 80, 166, and in fact well go to 200 shortly.  We’ve also gone from 1 cluster to three clusters.  Now, two in Santa Clara, and one in Virginia.  The one in Virginia, you’d might be able to have guessed, is for disaster tolerance.  You don’t want all your clusters in California; certainly not in the same part of California.  We could not afford multiple clusters until recently. 
    The other thing I think is interesting about this picture and why I call it the old picture is because we no longer use RAID storage.  Turns out RAIDs are not cost effective.  We do our own management of faults basically for recovery.  We actually only now use raw disks.  In fact the other change, which I’ll talk to you about in a minute, is that we don’t even use external disks anymore.  We use only internal disks, because they cost less to administer.  They actually turn out to be more reliable, which is counter intuitive.
    This is a picture of a cluster in Santa Clara, a picture is very telling.  The most interesting thing for this audience is that there is no disk visible in the picture.  There is a tera-byte and half of storage in that picture.  It’s just not visible.  So don’t take the "no disk" literally, there’s actually about 800 disk drives or something like that.  But they are not visible.  That turns out to be a very good thing in many ways.  In particular it takes up less rack space to use internal drives than to use external drives.  As you can tell rack space is an issue.  In general all centralized services, which is where our disks are going to go in the future, rack space will be a dominant issue.  It also means that our field replaceable unit has changed.  Which is to say that when a disk fails we do not replace the disk, we replace, the whole work station that contains the disk.  That actually turns out to be a very nice mechanism so we just hot swap entire nodes and therefore we don’t care if the disks are hot swappable.  Which used to be the only thing we cared about practically.  It turns out that hot swap disks are expensive.  And they actually don’t buy you that much, because there are lots of ways they could fail where the hot swap part doesn’t help you.  So now what we do is if you just hot swap the whole node with another node, then we get the node out of this cluster we can go deal with its disk drives. 
     Disk drives are still the components that fails the most often.  We lose about 2 disk drives a week, and again part of that is because they are running 24 hours a day at between 25 and 80 seeks per second all the time.  I mean all the time, even on weekends.  So that tends to be hard. 
    The biggest change we did that increases reliability was to contractually obligate that the temperature in the room be held to a 4 degree range.  And if it goes above that we don’t pay basically.  So the rent we pay for this space, which is obviously going to be expensive, is contingent on the temperature of disks drives and that’s by far the most important thing we could do to improve the reliability of the disks.  No temperature variation.  That is new.  That’s not done widely in the industry yet, but as soon as people figure that out they’ll converge on that.  It’s not so much heat or cold, it’s the variation that kills the disk. 
    There’s also no visible cable.  That’s so you can’t trip over them.  Another reason not to use external enclosures is because you have to have a cable.  In fact if the disk breaks and it’s not in a critical system, we leave it there until Friday.  So most disks get replaced on Friday morning, and that’s fine.  We don’t need them all up to be successful.  They are 4 gigabit disks, which I think is interesting, because clearly you can get higher density disks than that.  It turns out that 4 gigabyte disks are more reliable and cost less per gigabyte.  And they are enough, meaning that having bigger disk wouldn’t help us because we could not get more bandwidth and seeks in and out of the disks, anyway.  The ratios are wrong, and that’s going be a big thing that I’ll come back to.  I would like to have 8 little disks rather than one big disk for the same capacity.  One of the most important take aways here is that I want more heads, and less capacity essentially.
    So that’s the search engine, the other place to use disk is in network caching.  Give me a minute to explain what this is.  Normally the internet just deals with packets.  When you have a ISP, their job is to send you packets, basically, IP packets.  If you want to go to a far away site, your packets traverse that entire path and come back.  In fact if the person in the cubical next to you goes to the same site that you just went to, their path makes the same trip.  So it’s like every time you want to watching a movie you have to fly to Hollywood and fly back.  That’s basically how internet content is distributed.  There is no local distribution mechanisms.  Caching is the local distribution mechanism.  So what happened is essentially in these big boxes, we put local storage, which is basically disks, in each of these places, and these disk hold fresh local copies of internet data. 
    And what that means is that if you want to go and visit a site like YAHOO, there is a good chance that there is a local copy on a disk near you, rather than actually go to Palo Alto and back.  And this is more important if you are in Europe and have to go across the ocean and get your YAHOO packets.  So in fact, there lots of benefits from this.  Let just outline them.
     (My one marketing slide).  The first thing is you factor in response time.  Meaning that 40 to 50% of the things you ask for are in the local copy, and getting them locally is much faster than getting them over the long internet.  It saves money, 30% bandwidth saved, has to do with the fact that if your ISP doesn’t have to make those extra trips across the country, they can get more utilization out of their network.  Essentially their effective capacity goes up.  Because you are absorbing capacity at the edges.  That’s the main reason today that ISP’s are deploying caching, is because it reduces their operating cost by a lot.  It’s more efficient. 
     Surge protection is a little bit subtle.  It has to do with the fact that if you have something like TWA flight 800, or the Starr report, many many users try to go to the one site.  In fact they can’t all get there.  So in fact the Starr report was the test of this phenomena with caching.  AOL did this.  AOL's average response time for Starr report was less than 100 milliseconds.  Now the reason for that was because all their traffic was served out of a disk copy.  They weren’t actually going to CNN to get the report or to any where else frankly.  The first person went to CNN and brought it into the local copy and then every one after that got it out of the local copy.  Which means the maximum capacity of AOL serving that document is about 300 times what any single Internet site could of delivered.  Therefore there was no crush on that site, at least for AOL users.  Very interesting phenomena.  That was the first proof that surge protection actually works.  In fact AOL made a press release about how they didn’t have any trouble with the Starr report.  What they didn’t tell you is why.  The real answer is because they had copies on local disk. 
    This next feature is more subtle, which is once you got disks and CPU’s in the infrastructure, there’s lots of other things you can do with them.  One of the most interesting short term is much smoother audio and video play back.  If you want to watch a video stream, you can’t actually do that over the long internet, but you can do it from the local ISP POP to your home.  And @home is going to do this with a traffic server, which is the name of our network cache.  So there is quite a lot that you can do with it once you’ve got these things in place.
    So a little bit about scaling.  I think most users will have more disk on the infrastructure than they will at home.  AOL uses the network cache.  It has about 1.25 billion hits a day.  So big websites like YAHOO are proud to have 100 million hits per day.  Well what the caches see is the aggregation of all those sites from all AOL users.  So the cache actually has to handle not only all AOL visits to Yahoo, but all AOL visits to every other internet site.  AOL is responsible for 30 to 40% of all dialup traffic in the U.S.  What I’m telling you is the biggest cache in the world, which is the one at AOL, is already handling a billion hits every day.  1.25 billion about now.  That translates to about a peak at 50 thousand operations per second and every one of those operations per second, in practice, needs a seek.  We have to deploy systems with 50,000 seeks per-second.  In fact that decides how many disks you have to buy.  It has nothing to do with the capacity of the disk.  You have to wonder if cache, any network system, can keep up with this.  We don’t know yet, but I’d be happy to try.

 

    So the last thing is to realize just how may places disks can go.  Users have a disk.  That’s their browser cache, their local disk.  It’s actually not the primary copy of the data.  It’s just a cached copy.  There are now caches in the POPs in the cable head end, so @home has caching.  They can get higher quality response time.  There are slowly being caches placed in aggregation points.  Groups like PSI-net and Digex are using caches in the backbone to reduce their operation cost.  You go across the Internet to the server side, there are web hosting caches too.  Which is basically, if your server is overloaded, and you want to increase is capacity, the way to do it, is to stick a bunch of higher performance caches in front of it, which essentially act as a shock absorber.  If your are really clever you can push those shock absorbers out a little bit further and distribute the serving of your content.  There’s a way, as a provider of content, to get more control over you own surge protection and your own capacity issues.  Which means there’s a chance here for there to be something like 4 different disk copies of the same page between user and server.  There’s no reason to believe that there won’t be at least 2 or 3 copies in practice.  Alright so if you have 1 copy for you, there are also 3 copies for you in the infrastructure.  Now fortunately those copies are actually shared among all these people, which is what makes if work.  It’s not like you have a lot more copies of stuff out there in aggregate buts its amazing at how much stuff in here is actually only for one person.  If you reach some obscure page that nobody else has reached, you still get to take up disk space in the infrastructure.  Of course we don’t know which one is going to be viewed by everybody and which ones are not. 
     So there’s lot of space for disks here.  The other good thing is that’s growing on net 200, 300, 500 percent per year slope.  That good.
     So the last thing I will talk about is this inappropriate focus areal density.  The bottom line is “who cares” about disk capacity trends?  This is not new data.  This is basically plotting the capacity for a bunch of disks, just by looking at when they were they were released.  The bottom axis is the release date of some particular disk, like a Seagate Barracuda.  The vertical axis is the size in megabytes of that disk.  It is a logarithmic axis.  So a linear line on this plot means exponential growth of 100 %.  That’s not new.  It’s what people expect.
    Just what kind of interface are they using? This is the same disk but shows the seek time in milliseconds and this about 10 % per year, historically.  It doesn’t take too many years that 10x difference in relative improvement for capacity to become completely irrelevant.  And I would argue, that except for home PCs, its already irrelevant.  We’re stuck with 4 megabytes disks because we can’t get enough seeks per second using bigger disks.  We can’t make use of the increased storage.  And that’s going to get increasingly true.  In fact it get truer at a factor of 10 difference per year. 
     This slide is very very fundamental and so what I’d really like you to take away from this is “what I want is not capacity but two different things”.  I have a short terms wish and one long term one.  The short term ones are small disk packs.  I’d like each of these to have 8, 1 inch or 2 inch disks in it.  Ideally hot swappable in the same case as the workstation, but I can live without that.  I want to maximize seeks and bandwidth with per cubic inch.  In the infrastructure that’s the metric that actually matters.  Rack space is an issue.  Capacity won’t be an issue, but bandwidth and seeks per second are definitely an issue.  The RAID is actually optional.  Even though it’s 8 disks we can do the RAID in software.  Or you can make a special disk controller that does the RAID.  Either option is fine.  Don’t get stuck on it having to be RAID.  It doesn’t not have to be RAID.  In fact this same set of criteria is what the data base industry wants too, whether they really know it or not, for the same reasons.
    The long term wish is that I would also like to see the return of "drums".  I put that in quotes.  I don’t literally mean drums, but I don’t really understand why I have a moving head at all.  I can understand it historically, but a moving head is just a bad idea.  Its just a bad idea! I would just replace it with what I’d call a BAR head.  That’s because it would be a bar that spans the disk radially.  I have no idea whether this is new or not, but it’s still worth talking about.  What I think is interesting now is that I know how to build one of these, and I didn’t before.  And the way that I can build it now is by making it a single etched piece of silicon.  You can make very long pieces of silicon now.  In fact it doesn’t have to be one piece, it can be segmented.  What it means is you actually don’t have any moving heads.  If you want a disk head, it is etched into the CMOS.  You just hold the whole bar floating above the disk, the disk spins under it.  In fact you can have multiple bars per disk if you want to reduce your rotational delay.  The other interesting thing is you can start playing games with DSP’s where multiple parallel tracks are readable by multiple heads, and you can actually average the readings you get to get a more precise average to reduce error rate. 
    What are the games you can play down the road? The big one that I care about is that the seek time goes away, and you get more bandwidth, because you can easily read parallel tracks.  In fact they don’t need to be really close and they can still be very useful.  So what I really want is a bar head that has maybe 16 or 32 parallel tracks output and you just multiplex the 3000 to 10,000 tracks, what ever it is, over the 16 or 32 that gets switched out to the controller.  My first reaction to this was that you are probably going to lose a lot of density because you probably can not play the cool aerodynamic games you get with heads and GMR stuff.  The answer is “so what”.  It doesn’t matter.  You forfeit some density in the short term but you get it back just by waiting a couple of years.  The density is not the issue anyway for infrastructure servers.  You can also get higher life time and lower cost because there’s no moving parts.  I don’t know how to evaluate that.  I think there is room for high quality persistent storage infrastructure, much more than I think people realize.  The real issue is that it’s got to be seek oriented and bandwidth oriented and not capacity oriented.


Q.  How much are people willing to pay?
A.  I think if you reduce the footprint, you can argue that people would pay a lot more, because we are already buying 4 gigabyte disks instead of nine gigabytes because we don’t need the capacity.  We are not opposed of paying the price of the 9’s, its just that they don’t do anything for us.  People paid 2x or more for RAID for a long time, and what I am suggesting offers them a lot more than RAID.  I do not know how much.

Q.  What read write ratios do you encounter and how does this factor affect the search engine?
A.  It depends a lot on the application.  In search engines there is a probably a ratio of 100 to 1 for reads to writes.  For cache it’s actually slightly more writing then reading.  Because you’ll write stuff that you’ll never read again.  But we don’t know yet what the actual ratio is but it is much more even.  In that game there’s other tricks you can play such as typically you master copy any piece of data which means its okay for the disk to lose data as long as it doesn’t corrupt it.  Which really changes the games that you play, which is probably the reason we can get away without having RAID, because we don’t care about lost data we just don’t want get corrupted data.  We can avoid that with checksums and basically run the disk full tilt, and when we loose them, we have to recover but it’s not that bad.

Q.  Your search engine would seem to be unique for a different application since you have all of your eggs in one basket.  What is your disaster recover operation?
A.  Our disaster recovery is probably better than what most business have already.  For example we were not affected by the San Francisco electrical issues.  And so any one using the Santa Clara cluster was automatically moved to the Virginia cluster.  That cluster would then reduce its database size automatically to allow that cluster to handle the essentially unexpected increase in total queries having to be served.  That has been tested slightly but basically the bottom line is our up time is .99999 and I think we would transfer over to quite naturally to the Virginia cluster in the event of an earthquake.  That’s certainly the plan.  You have to go to extensive testing to keeps these things working.  Certainly at his point I think our customers do demand certain kinds of stuff your talking about.  Certainly YAHOO would be furious if our back end were down and they couldn’t deliver searches to their customers.  Microsoft has contractual obligation on us in terms of up time and reliability as well.