Episode Summary
Episode Video
Episode Show Notes & Transcript
Show Highlights
- Tobi’s Twitter: https://twitter.com/superguenter
- LinkedIn URL: https://www.linkedin.com/in/tobiasknaup/
- Personal site: https://tobi.knaup.me/
Transcript
Tobi Knaup: In this case, when we say multi-cloud, it's often not actually one of the big three cloud providers that they’re thinking about.
Corey Quinn: Welcome to Screaming in the Cloud. I’m Corey Quinn. I’m joined this week by Tobi Knaup, the co-founder and CTO of D2IQ, which you probably have not heard of, and what used to be called Mesosphere, which you most assuredly have. Tobi, welcome to the show.
Tobi Knaup: Thank you for having me.
Sponsor: This episode is sponsored in part by my day job, the Duckbill Group. Do you have a horrifying AWS bill? That can mean a lot of things.
Predicting what it's going to be. Determining what it should be. Negotiating your next long term contract with AWS. Or just figuring out why it increasingly resembles a phone number, but nobody seems to quite know why that is. To learn more, visit duckbillgroup.com. Remember, you can't duck the duck bill, Bill.
And my CEO informs me that is absolutely not our slogan.
Corey Quinn: Of course. So let’s start with the, I guess, burning question that at least is on my mind, if not a bunch of other folks, Mesosphere was a company that everyone in the infrastructure space at least had a vague awareness that there was that thing over there. And last year, I think it was last year, time is speeding up, the company rebranded. What was behind that?
Tobi Knaup: Yeah. So, what’s behind that is, in hindsight, it wasn’t a very good idea to put a technology name into our company name, to be honest, because, technologies change over time. And we obviously started the company, Mesosphere, in 2013 around Apache Mesos. That was the core open-source project that my co-founders and I had been using at Airbnb and Twitter. And we wanted to start a company around that to help every Enterprise out there to adopt Apache Mesos. But very quickly, we actually started helping people with other technologies from the cloud-native ecosystem. We help folks automate things like Kafka, and Cassandra, and Spark, and build these data pipelines on it. And very quickly, actually got involved in Kubernetes, as well, actually in the first year when it was announced. And so, over time, the name Mesosphere as a company name became sort of a stumbling block for us, because we always had to explain that, yes, we are the Mesos company, but we also do all these other things. We help you build data pipelines, and we help you with Kubernetes, too. And so it kind of became this anchor, and we decided, it’s maybe not a good idea to have a specific technology in our company name, and so we decided to rebrand and we wanted to pick a name that kind of expresses what we really do, what we help our customers with. And that is, we help them on Day-2, we help them be successful on Day-2 and be smart about the day to operations. So, Day-2 in the sense of, the DevOps concept of Day-2, so the ongoing operations and maintenance of production systems. So that’s what’s behind that.
Corey Quinn: You always could have gone down the path that I did, where I started with a newsletter, Last Week in AWS, a consulting company that had no bearing on any of it, and this podcast, Screaming in the Cloud. There were three brands instead of one, which means that whenever anyone asked me, “So what do you do?” My answer is always “Well, it depends. Can you contextualize that question for me a bit more?” It winds up effectively having to lead us down this weird path of branding things very differently. And then, of course, I started another podcast with a completely separate name on top of that, called the AWS Morning Brief, and it’s at this point, I just sound like I’m professionally confused. Naming is hard, especially once you have a name that is no longer accurate in some ways, but it’s something that people have a definite affinity for, you have brand recognition. We had a guest on previously, from palantir.net, which predates a terrifying Palantir in the Valley by about 10 years. And it seems like their tagline is become, “We’re Palantir. No, not that one.”
Tobi Knaup: [laughing] . That’s lovely. Yeah. Obviously, like you said, naming things is hard. And renaming a company is hard, too. We built up a lot of brand equity over the years. And so, what was important to us is actually that we don’t give that up. And so the name Mesosphere actually lives on. It’s now the name of our product family around Mesos. So, name lives on, just the company has a different name.
Corey Quinn: So are you finding that—I guess it’s, obviously, from the time that you started Mesosphere back when—when was that?
Tobi Knaup: 2013. So we’re almost seven years old.
Corey Quinn: Forever ago in Internet time. There’s been some, let’s say upheavals in the infrastructure space. Back then, I would have frankly bet the farm on Mesos. It seemed like the right answer. A lot of the big shops were doing that. And today, whenever you suggest that to people, they look at you a bit strangely and say, “Yeah, if we’re doing anything net new, it’s probably going to be on top of Kubernetes, which I have a laundry list of complaints about. But I’m curious to get your take, how have you seen Mesos’s rise and fall through the eyes of what you do for customers?
Tobi Knaup: Right. So, I think what we’ve seen with Kubernetes is really the power of community. When I talk to folks and ask them, “Why Kubernetes?” That’s the thing that people most commonly mention, it’s the community in the broadest sense, meaning there’s a place online where I can go to learn about Kubernetes and related technologies. There’s a place I can recruit talent from. There’s people that want to have that on their resume. And obviously the community is so much bigger than any single vendor could ever be. And so, that’s where a lot of innovation happens. And innovation happens much faster in that community. So that’s really the most common reason we hear. Mesos started as a abstraction layer for large compute clusters. And, while we do a lot with Kubernetes now, and we have an entire product line around it, we also still have our Mesos product line. And it is still the platform of choice for those large scale deployments. So we have customers with hundreds of thousands of nodes in production, and they’re running Mesos, and they will be running Mesos for a while. So it’s really a best tool for the job kind of situation. If you’re a small shop, you’re getting started with cloud data, you have maybe a 10, 20, 30 node cluster—20, 30 nodes is where we see most clusters out there in the industry—Mesos may be not the right choice because it is built for scale. And so what we said is, “Hey, let’s offer our customers what they want. Let’s give them Kubernetes. Developers want that. And we still keep the Mesos platform for those large scale deployments.
Corey Quinn: Are you seeing net new activity around Mesos in 2020?
Tobi Knaup: We do actually. So one thing that we built, that we invest a lot of time in over the years, is helping customers automate data services. Building end-to-end data pipelines with Kafka, Spark, technologies like that. And the experience that they get around that on Mesos doesn’t exist the same way yet on top of Kubernetes. We’re working on making that happen. And obviously there’s a lot of activity around building Kubernetes operators. There’s various different approaches to building operators. We started an open-source project about a year and a half ago called Kudo that aims to make building operators very easy. It’s based on our learnings on top of Mesos. And so, the ecosystem is going to get there but it’s not quite there. And so we’re actually seeing a lot of people still start new projects around these data infrastructure projects on top of Mesos.
Corey Quinn: It’s interesting you bring up releasing open-source offerings around, I guess, anything in the infrastructure world. Lately, it seems that there’s been a bit of a pretty persistent narrative around the danger of open-source as a business model because then someone like AWS comes in and launches effectively what you do as a managed service. Is that something that’s currently on your threat radar? Is that something that you don’t see as being particularly credible? Or am I missing something entirely?
Tobi Knaup: It’s definitely on our radar. And I think there is, while this is a threat that everyone’s facing, there are also opportunities to build differentiated product for maybe a different use case for a different customer demographic. So what we see a lot, these days, is folks wanting to run any combination of hybrid or multi-cloud scenarios. So they want a public-cloud-like experience like they can get from AWS, but they want it on the infrastructure that they choose. So we see a lot of activity, we work with a lot of customers that have industrial IoT use cases. So let’s say they have a manufacturing plant, a factory where they have thousands or tens of thousands of sensors that produce data in real-time that they need to process and do things like predictive maintenance, finding outliers in the sensor data, and things like that. Those factories are often in areas where they don’t have a high-quality connection to the cloud. So it’s not feasible to send all that data in real-time to a public cloud, you have to kind of process it locally. And so essentially, what those customers need is they need a mini Edge Cloud. They obviously don’t have highly skilled cloud-native engineers in every one of their manufacturing plants and some of those people have over a hundred of these plants. So what they need is really a public-cloud-like experience sort of in a box that they can deploy on the Edge. Now, they also want to run a bunch of infrastructure on the public cloud, and they also want to run a bunch of infrastructure on their existing data centers. So how do you do that? How do you operate, pick Kafka as an example or Spark, in a consistent way across all of these platforms? That’s one thing we’re focusing on and where, yes, you can go to AWS and you can get a managed Kafka, but you can’t get it in a manufacturing plant or in an air-gaps case, right where you don’t have any Internet connection. So there’s still a lot of these use cases out there, and that’s how we differentiate, or it’s one way we differentiate.
Corey Quinn: I’ve always said that one of the most effective attack ads you could come up with about running Kubernetes would be to send someone who’s considering it to a three-day Kubernetes workshop, and by the time they come back, they will understand that here be dragons. And that has sort of continued to be the case, as far as talking to anyone who’s doing anything at significant scale in the Kubernetes ecosystem, is just the sheer level of abstraction built upon abstraction that fundamentally turns into something that is incredibly difficult and opaque to understand what’s going on underneath the hood. So, it’s not the Day-1 experience it’s the Day-2 experience, as you alluded to earlier in the recording, that once you have something running and then you see a degradation or an intermittent failure, it becomes super challenging to figure out what’s causing that issue and why.
Tobi Knaup: That’s absolutely right. And the typical journey that we see a lot of people go through is something like, they decide to do cloud-native, they decide to do containers, or their boss tells them to. They go on the Internet, they go on Stack Overflow or wherever and they find Kubernetes. They try it out, they download it onto their laptop and have a great experience. The first touch experience with Kubernetes is really great. You can get a container up and running quickly or get your guestbook example up and running quickly. And so, too many people assume that putting it in production is going to be a similar experience. And the first common mistake we see is that people assume that Kubernetes is all they need. That Kubernetes gives them all the tools that they need to put a container stack into production at an Enterprise. And that’s just not the case. You need a bunch of other tools from the cloud-native ecosystem around Kubernetes. You need a monitoring stack, you need logging, you need networking, load balancing, all of those things. And because people kind of take this fairly agile approach where they try it out, and then when they hit a wall, they figure it out. Let’s say I start a container, a stateless container, and that’s a great experience. Now, I need to add state to it. How do I do that? How do I get volumes? They kind of take it step by step, and that’s where we see a lot of cloud-native projects failing is because, like you said, at some point, they face the complexity and they’re like, “Oh, wow, there’s actually a lot of things that I hadn’t thought about.” What we like to do there, is make sure that people are educated about that. So we say, “Hey, when you need to go to production, these are all the things you should pay attention to. Make sure you have proper monitoring, make sure you have proper logging, you need a networking layer, and so on.” And that’s part of what we teach in our Kubernetes trainings, too. So, we do these free trainings, in the field, in various different cities in the world, to just highlight these problems because, like you said, a lot of people just aren’t aware of those.
Sponsor: Here at the Duckbill Group, one of the things we do with, you know, my day job, is we help negotiate AWS contracts. We just recently crossed five billion dollars of contract value negotiated. It solves for fun problems such as how do you know that your contract that you have with AWS is the best deal you can get?
How do you know you're not leaving money on the table? How do you know that you're not doing what I do on this podcast and on Twitter constantly and sticking your foot in your mouth? To learn more, come chat at duckbillgroup.com. Optionally, I will also do podcast voice when we talk about it. Again, that's duckbillgroup.com.
Corey Quinn: One of the, I guess, arguments in favor of Kubernetes historically has been the hybrid story, which I’m sympathetic to, and the multi-cloud storage which I’m slightly less sympathetic, in that on paper, it looks fantastic. In practice, it means that you’re not just dealing with one cloud provider’s deficiencies, you’re dealing with all of them. And that’s been a recurring subject of some debate on this show for a while now. Where do you stand on the idea of multi-cloud as a best practice?
Tobi Knaup: Yeah, it’s one of my favorite topics. So I think multi-cloud is where everything is going to move. To me multi-cloud—it also includes hybrid, because every large Enterprise has massive workloads that they want to keep on-prem for various reasons, whether it’s they want to protect their data, whatever, and at a certain scale to actually running your own gear becomes more cost-effective, too. So I think ultimately, every Enterprise is going to be there. Now their reasons for why they want to do multi-cloud vary, and I think a lot of folks, when they hear multi-cloud, the first thing they go to is, “Oh, I’m gonna have this abstraction layer, Kubernetes, or whatever it may be. And I’m going to dynamically move my workloads around, and I’m going to look at where the costs are optimal, or, I’m going to optimize for other things.”
That’s typically not the main reason why people do that, although we are working with some customers that are fairly sophisticated, that are literally doing that, they’re watching the spot instance price market on all the different cloud providers and then hour by hour, decide where things should go. But that’s only a handful and they’re ahead of the pack. They’re fairly sophisticated customers. For most folks, the reasons are something different. We work with a lot of companies that work globally, that work in a lot of different countries and jurisdictions. And so they need to take a look at data privacy laws and regulations around that. So when the infrastructure they stand up in China, that data that they process there, for the Chinese customers, can often not leave the country, so they need to run on a Chinese cloud provider. They may be operating in Europe, so they need to run on European infrastructure and within Europe, in each country.
And so, in this case, when we say multi-cloud, it's often not actually one of the big three cloud providers that they’re thinking about. This may be some fairly small infrastructure as a service provider in one specific country that they need to run on top of, for these data privacy reasons. And so, in this scenario, multi-cloud makes a lot of sense, because you want to architect your stack once. You want to build it on top of an abstraction layer like Kubernetes, and then be able to stand that up in multiple countries on different IS, for those reasons. That’s a common one that we see. Obviously, that’s with companies that act globally, that are working in a lot of different jurisdictions. Another reason we see for multi-cloud, often, is that they want to handpick certain cloud provider services that they like. So they may want to go to Provider A, for their machine-learning stack. And they want to go to Provider B because they have the better managed databases. So it’s more of those reasons, I think, and not so much what most people go to immediately, which is dynamically moving the workloads around.
Corey Quinn: And the dynamic movement of those workloads seems to be what people put up as “Oh, it would be great to be able to magically deploy our entire application anywhere we need to at any point in time.” Except data gravity always makes it a bit of a challenge.
Tobi Knaup: That’s right.
Corey Quinn: The joy of trying to get even that baseline fundamental, consistent experience working between two providers, even when one of them is on-prem and you control virtually every aspect of it, is non-trivial. An argument I’ve enjoyed for a while now has been, “Great, take your primary cloud provider, whichever one it happens to be, I don’t care, you probably care, I don’t care, and triangle multi-region. Be able to span to multiple regions of the same provider and see what breaks.” It’s a good baseline story for the things you’re going to have to start thinking about, and then some, when you start going multi-cloud. Now there are workloads that justify that level of work, and experience, and stress. But it’s certainly not, I’d say, worth an awful lot of companies time and effort to do it.
Tobi Knaup: Yeah, you’re absolutely right. That experience. is very similar to what you’re going to have to do in multi-cloud. And there’s one more use case I forgot to mention earlier, and that is, people in certain industries that are regulated, they actually have to go with multiple vendors. They have to, for regulatory reasons, pick two or more cloud providers, and so that they’re kind of forced to do that. One of the main things you’re going to have to build your own, or it’s actually something we help our customers with, is replicating your data. Like you said, data has gravity. And so the people that we see that are successful at doing multi-cloud or multi-region, they do things like using Kafka to replicate their data, or using Cassandra to replicate the data asynchronously, between different infrastructures. So that’s not something the cloud provider offers, but we help folks manage Cassandra, manage Kafka, so it makes that a little easier.
Corey Quinn: So tell me a little bit about where you came from. Most people don’t decide to spring fully formed from the forehead of some ancient god in the form of a co-founder of a company in the infrastructure space everyone has heard of. Where were you before Mesosphere, if there can be said to be a time before the Mesospheric era.
Tobi Knaup: There is definitely a time before the Mesospheric era. Yeah, my exposure to the Internet and infrastructure and HA basically started as a teenager. So my co-founder, Flo Leibert and I, we grew up in Germany in the same town. And when we were teenagers, we started building websites. And we started building some adventure games and things like that. So we knew how to build websites. And we grew up in a fairly small town, 50,000 people, and this was the late 90s. And even in that neck of the woods, companies started to hear about the Internet. And so they’re basically wondering what this thing is. Someone told them, “Hey, you need to be on the Internet, you need to have a website, and you need to have an email address as a business,” but they had no idea how to think about this and how to approach it. And at the time it was really hard to get a website because you basically had to work with three different companies. You had to find someone to design it for you, you had to find someone to program it, and then someone to host it. Those were typically three different companies. And so what Flo and I did is, we said, “Hey, we know how to build websites, and we know how to run Linux servers.” We just dabbled with that on the side. And so we actually convinced my mom to register a company, and so we could program websites for people and host them. So that was our first experience with infrastructure, and even back then we did HA things. We bought two servers, not one. One would have been enough to host all of our clients, but we wanted it to be highly available. So got some experience with that, running Linux servers, running production infrastructure.
And then when you grow up in Germany, or anywhere outside of Silicon Valley, and you’re in tech, then you hear the stories. You hear about Silicon Valley, and I’ve always imagined it to be the super-futuristic place, and I kind of wanted to check it out at some point. And so in college, I found an internship at this startup down a Redwood City, and join them for three months, and helped them build their websites, PHP and Ruby on Rails, at the time. So that was my first exposure to Silicon Valley. And I just loved the energy, the people that are full of ideas, and the speed at which things get built. And so, after I finished college, I joined that same company where I did that internship, worked full time and built the infrastructure there, built the website.
And then my next job was with Airbnb. I joined them pretty early on as engineer four, and so wore a lot of different hats there. And one of the things I did there is also design and build the infrastructure for their massive growth. Hired the engineering team and did some machine-learning work there too. That’s my other passion besides infrastructure. And at Airbnb, that’s when we started using Apache Mesos. We built data infrastructure there based on Apache Mesos which my third founder, Ben, was working on in Berkeley at the time. I should have mentioned Ben and Flow and I, the three founders, we’ve known each other for a long time. Flo and I grew up together, and Flo did a student exchange, and stayed with Ben’s family in high school too. So, we all love computers, we talked about Mesos and that’s how you know Twitter and Airbnb ended up using Mesos. And to us as the people running the infrastructure there, and the people with the pagers that would go off at three in the morning, sometimes, Mesos really felt like magic. It was a 10 times better solution, because we could automate a lot more things and the pager wouldn’t go off as much in the middle of the night.
And so that’s when we decided hey, this is a great opportunity to start a company, because the problems we were solving there with automation, they were not unique to Twitter, or Airbnb, or any Silicon Valley tech company. They were infrastructure challenges that every company would face at some point. And, this was around a time when this idea of software is eating the world, that Marc Andreessen wrote about, I think in 2009, that was still fairly new. But, we saw that every company, whether it’s a bank, or an insurance company, or a car manufacturer, will have to run large scale cloud infrastructure at some point. And in fact, in order to stay competitive in the future, they’re going to have to use some of the same technologies that the best software companies in the world are using. And so that’s where we saw the opportunities. We saw that we had this tool that automated a lot more things, and making infrastructure more robust, and scalable, and cost-effective. The only challenge at the time was it was an open-source project. There was only a few people in the world that knew how to use it, and so we decided, let’s form a company around it. Let’s build an enterprise product around the open-source core. And that's how Mesosphere was formed in 2013.
Corey Quinn: I have vague recollections, back in the dawn of my version of the era of computing, we would configure a core switch at the office I worked at, then we rented a van and a few of us on the tech ops team drove it down to the data center about 30 miles away and did the installation. And we learned a few things. One we are super crappy movers. Two it is vaguely disturbing the company decided not to spring for professional insured bonded movers for this. And thirdly, there’s something very surreal about loading a piece of computer equipment that fits in a rack, two or three of you can lift it up into the back of a van, and that van costs less than the switch does. It was still such a strange and surreal experience. You don’t get to experience that in the world of cloud in quite the same way. But it’s more than made up for it with the other hilarious and sarcastically disturbing things that it has exposed for us.
Tobi Knaup: Yeah, absolutely. I think interacting with real hardware and a real data center, it’s an experience that really, really shaped how I think about stuff. And one big way is that—always expecting failure, because I could tell so many partially funny, partially painful stories, of things that went wrong in the data center in the physical world. That, I think, what that taught me is to just expect failure, always. Everything can fail at any point in time and then when you build software, even if you build it on top of a bunch of layers of cloud computing, you have to expect that. And even in the cloud, machines will fail, and you get that email from AWS that says, “This instance is now broken.” And I think if you don’t have that experience, racking and stacking gear, and seeing a bunch of physical failures, you might not think about it the same way. So we don’t want to miss that experience. But yeah, like you said, there’s all kinds of other funny behavior that happens in the cloud. It’s just abstracted away by a bunch of layers, but I’ve lived through some funny cloud outages there, too, where packets went in a circle, and then, that caused EBS to go crazy, and all kinds of fun stuff.
Corey Quinn: Oh, the cascading dependencies are always the story, the stuff of legend after the fact. And it makes sense, in hindsight. Every failure does, to some extent, but when you’re in the middle of it, you’re wondering if you’ve lost your mind, if the old rules no longer hold, this behavior is completely inexplicable, what happened? And I guess figuring that out, and living through that a few times is really, I think, the best way to learn to approach those things in a more methodical way. But, ouf, some of those early failures were not fun. Seeing aspects of that manifest in cloud environments is absolutely something that is definitely reminding me that old things are new again.
Tobi Knaup: That’s true. And one thing we shouldn’t forget, too, is that by using these cloud services, you give up a lot of control, too, because when things do fail, there’s only so many things you can do. APIs may all of a sudden be read-only, and you cannot restore your database from a backup all of a sudden, or you cannot promote your read-only database to a master instance all of a sudden. So there’s definitely that aspect, too, which that was much easier when we were running our old gear is, you’re in full control. You control the whole thing, and if you want to do something crazy to try and fix a problem, you can do that. Can’t do that on the cloud.
Corey Quinn: So last question, I suppose, before we wrap this up. I made a prediction about a year ago, I said five years, now, so we have four years to go, where I argued that in four years now, nobody is going to care about Kubernetes. And my argument was not that it’s going to dry up, blow away and be replaced with something else, but rather that it or something like it, is going to slip below the surface of awareness. Just like we don’t have to worry about what kernel version we’re running on an operating system anymore, we won’t care what’s handling orchestration in our various data centers and cloud providers. Do you think that that is an accurate prediction, or I’m going to be eating some crow?
Tobi Knaup: No, I think that’s absolutely accurate. It is a substrate, it’s becoming a substrate. And unless you’re directly involved in adding features to Kubernetes, or you’re using it in some other way where that requires you to make changes to it directly, you’re probably going to use some other higher-level API. You’re probably going to be interfacing with a CI/CD system, as a developer. And maybe you know that it’s Kubernetes under the hood, just like you know right now that it’s Linux under the hood, but you’re not really interacting with it directly that much. Or, if you’re a data scientist, so you work in the data infrastructure world, you’re much more likely to use a tool like Kudo to deploy that service, versus trying to piece together your Kubernetes primitives in order to stand up that service. So I absolutely agree with that. I think that’s a trend. It’s sort of this abstraction layer wave that’s always behind us or the rising tide of abstractions. And so, I think the same thing will be true for Kubernetes. I think most folks out there, individual developers or end-users off a platform, they’ll know that it’s there, but they’re going to be talking to other APIs at different levels. And we’re seeing a lot of activity around CI/CD right now. I think things like Argo and Tekton are super exciting. There’s a lot of activity around that, and people wanting to use GitOps approaches to deploy their software. So I think those are some of the signs that we’re seeing of the abstraction layer rising. And then, of course, serverless, too.
Corey Quinn: So if people want to hear more about your thoughts on these and other topics, where can they find you?
Tobi Knaup: So, they can find me on the usual places. I’m pretty active on Twitter. I’m on LinkedIn. I give talks at conferences sometimes. Those are some of the first places to find me.
Corey Quinn: Excellent, thank you so much for taking the time to speak with me today.
Tobi Knaup: Absolutely. Thank you so much for the opportunity.
Corey Quinn: Tobi Knaup, CTO and co-founder of D2IQ, formerly Mesosphere. I’m Corey Quinn, and this is Screaming in the Cloud. If you’ve enjoyed this podcast, please leave it a great rating on Apple Podcasts. If you hated this podcast, please leave it an even better rating on Apple Podcasts.