Sort By
Search
Episode 34: Slack and the Safety Dance of Chaos Engineering
In the early days, angry nerd corners on the Internet viewed Slack and some of its predecessors as, “Oh, it’s just IRC. Now, you pay someone for it.” Many fell into that trap of wondering about what value such systems offered.The big differentiator? Slack is built as a collaborative business tool.
Today, we’re talking to Holly Allen, who helped make government software better while serving as the director of engineering at 18F. Now, she’s a senior engineering manager at Slack, a collaborative chat program where you can do most of your work through a rich platform of integrations. Holly enjoys taking a weird set of skills that make a computer do things and convincing people who know how to make computers do things do things.
Some of the highlights of the show include:
Safety engineering brings chaos and resilience engineering, incident management, and post-mortem processes together for resiliency and reliability
Slack strives to move really fast while being in complete control
Slack is primarily on AWS, but is working on a multi-Cloud strategy because if AWS is down, Slack still needs to work
Slack has a close relationship with AWS and is a collaborative company; it has immediate access to AWS staff anytime there’s a problem
Slack uses Terraform and Chef and working to determine if its production workflows in Kubernetes would be worthwhile
Disasterpiece Theater: Real scenario that might happen and surmise what will happen; don’t cause production issues, but teach Slack employees
Slack hires collaborative, empathetic people to create a collaborative environment where everyone works together toward a goal
Slack was firmly in a centralized operations model, but is transforming toward development teams to increase responsibility and service ownership
Slack doesn’t encourage remote work because it’s not in a position to put in that investment; day-to-day work happens in hallways and between desks
Slack sees itself as an enterprise software company; an enterprise software company must have enterprise software reliability, stability, and processes
Slack has thousands of servers, so events and disruptions happen more often; system needs to respond, react, and repair itself without human intervention
Links:
Holly Allen on Twitter
18F
Slack
Freenode IRC
HipChat
AWS
Kubernetes
Terraform
Chef
QCon
Datadog
Episode 33: The Worst Manager I Ever Had Spoke Only In Metaphor
If you’ve been doing DevOps for the past 10-20 years, things have really changed in the industry. There’s no longer large pools of help desk support. People aren’t climbing around the data center and learning how to punch down cables and rack servers to gradually work their way up. Now, entry level DevOps jobs require about five years of experience. So, that’s where internships play a major role. But how can an internship program be set up for success? Where is the next generation of SREs or DevOps professionals coming from? Where do we find them?
Today, we’re talking to Fatema Boxwala, who has been an intern at Rackspace, Yelp, and Facebook. She’s a computer science student at the University of Waterloo in Canada, where she’s involved with the Women in Computer Science Committee and Computer Science Club. Occasionally, she teaches people about Python, Git, and systems administration.
Some of the highlights of the show include:
Mentors made Fatema’s intern experience positive for her; made site reliability and operations something she wanted to do
Academic paths don’t tend to focus on such fields as SRE, and interns tend to come exclusively from specific schools
Fatema’s school requires five internships to graduate and receive a degree; upper-year students are already very qualified professional software engineers
Companies don’t have time to train and want to find someone with an exact skill set; instead of hiring someone, they spend months with an unfilled position
Continuity Problem: You can’t train someone to be a systems administrator, if you aren’t willing to give them certain privileges due to inexperience
Use a low-stakes environment to train, where mistakes can be made; most systems aren’t on a critical path - don’t keep people away from contributing
If you have never broke production, that means either you’re lying or you’ve been in an environment that didn’t trust you to touch things that mattered
Internship should mimic the kind of work that everyone else is doing; give them responsibilities where their work has an impact
Bad mentors lead to bad internships; person in charge of your success doesn’t have the necessary skills; needs to be a good communicator, set expectations
As the intern, ask about possible outcomes of internship early on; mentors should be clear about expectations, feedback, and offers
Links:
Fatema Boxwala
Fatema Boxwala on Twitter
Jackie Luo on Twitter
Julia Evans Zines on Twitter
SREcon MEA
Digital Ocean
Episode 32: Lambda School: A New Approach to “Hire Ed”
Are you interested in computer science? How would you like to go to school for free and learn what you need to in just a few months? Then, check out Lambda School!
Today, we’re talking to Ben Nelson, co-founder and CTO of Lambda School, which is a 30-week online immersive computer science academy. Lambda School has more than 500 students and takes a share of future earnings instead of traditional debt. So, it's free until students get a job.
Some of the highlights of the show include:
Bootcamps were created to address engineering shortages and quickly move people into technical careers
Lambda is not explicitly a bootcamp; its 30-week program gives students more instructions and more time spent on developing a portfolio
Lambda also makes time to cover computer science fundamentals; teaches C, Python, Django, and relational database - not just JavaScript
Employers appreciate the school’s in-depth and advanced approach, which results in repeat hires
Lambda avoids the typical reputation of traditional for-profit educational institutions by being mission-driven and knowing its investors want ROI
Lambda aligns its incentives with those of students; an income share agreement means the school doesn’t make money, unless students are successful
Lambda’s 7-month program is less of a risk for someone later in their career; some don't have capital to support their family while going to school for 4 years
Lambda incentivizes healthy financial habits; after two years of repayment, students can put that money into retirement, savings, and investments
5 Tracks Now Offered by Lambda: iOS development, UX, Full Stack Web development, data science, and Android development
Mastery Based Progression System: When you're learning something sequentially, where knowledge builds, you don't move on until you’ve mastered it
Lambda’s acceptance rate is around 5% and based on people who can keep up
Lambda works with different partner companies to help them find qualified graduates - people they want to hire
Links:
Lambda School
Ben Nelson on Twitter
Y Combinator
Wealthfront
Datadog
Episode 31: Hey Sam, wake up. It’s 3am, and time to solve a murder mystery!
Have you ever been on-call duty as an IT person or otherwise? Woken up at 3 a.m. to solve a problem? Did you have to go through log files or look at a dashboard to figure out what was going on? Did you think there has got to be a better way to troubleshoot and solve problems?
Today, we’re talking to Sam Bashton, who previously ran a premiere consulting partner with Amazon Web Services (AWS). Recently, he started runbook.cloud, which is a tool built on top of serverless technology that helps people find and troubleshoot problems within their AWS environment.
Some of the highlights of the show include:
Runbook.cloud looks at metrics to generate machine learning (ML) intelligence to pinpoint issues and present users with a pre-written set of solutions
Runbook.cloud looks at all potential problems that can be detected in context with how the infrastructure is being used without being annoying and useless
ML is used to do trend analysis and understand how a specific customer is using a service for a specific auto scaling group or Lambda functions
Runbook.cloud takes all aggregate data to influence alerts; if there’s a problem in a specific region with a specific service, the tool is careful to caveat it
Various monitoring solutions are on the market; runbook.cloud is designed for a mass market environment; it takes metrics that AWS provides for free and makes it so you don’t need to worry about them
Will runbook.cloud compete with or sell out to AWS? Amazon wants to build underlying infrastructure, other people to use its APIs to build interfaces for users
Runbook.cloud is sold through AWS Marketplace; it’s a subscription service where you pay by the hour and the charges are added to your AWS bill
Amazon vs. Other Cloud Providers: Work is involved to detect problems that address multiple Clouds; it doesn’t make sense to branch out to other Clouds
Runbook.cloud was built on top of serverless technology for business financial reasons; way to align outlay and costs because you pay for exactly what you use
Analysis paralysis is real; it comes down to getting the emotional toil of making decisions down to as few decision points as possible
Save money on Lambda; instead of using several Lambda functions concurrently, put everything into a single function using Go
AWS responds to customers to discover how they use its services; it comes down to what customers need
Links:
Sam Bashton on Twitter
runbook.cloud
How We Massively Reduced Our AWS Lambda Bill with Go
AWS
AWS Lambda
Microsoft Clippy
Honeycomb
AWS X-Ray
Kubernetes
Simon Wardley
Go
Secrets Manager
DynamoDB
EFS
Digital Ocean
Episode 30: How to Compete with Amazon
Trying to figure out if Amazon Web Services (AWS) is right for you? Use the “quadrant of doom” to determine your answer. When designing a Cloud architecture, there are factors to consider. Any system you design exists for one reason - support a business. Think about services and their features to make sure they’re right for your implementation.
Today, we’re talking to Ernesto Marquez, owner and project director at Concurrency Labs. He helps startups launch and grow their applications on AWS. Ernesto especially enjoys building serverless architectures, automating everything, and helping customers cut their AWS costs.
Some of the highlights of the show include:
Amazon’s level of discipline, process, and willingness to recognize issues and fix them changed the way Ernesto sees how a system should be operated
Specialize on a specific service within AWS, such as S3 and EC2, because there are principles that need to be applied when designing an architecture
Sales and Delivery Cycle: Ernesto has a conversation with a client to discuss their different needs
Vendor Lock-in: Customers concerned about moving application to Cloud provider and how difficult it will be to move code and design variables elsewhere
For every service you include in your architecture, evaluate the service within the context of a particular business case
Identify failure scenarios, what can go wrong, and if something goes wrong, how it’s going to be remediated
CloudWatching detects events that are going to happen, and you can trigger responses for those events
Partnering with Amazon: Companies are pushing a multi-Cloud narrative; you gain visibility and credibility, but it’s not essential to be successful
Can you compete against Amazon? Depends on which area you choose
Expand product selection to grow, focus on user experience, and improve performance to compete against Amazon
MiserBot: Don’t freak out about your bill because Ernesto created a Slack chatbot to monitor your AWS costs
Links:
Concurrency Labs
Ernesto Marquez on Twitter
How to Know if an AWS is Right for You
MiserBot
AWS
RDS
Lambda
Digital Ocean
Episode 29: Future of Serverless: A Toy that will Evolve and Offer Flexibility
Are you a blogger? Engineer? Web guru? What do you do? If you ask Yan Cui that question, be prepared for several different answers.
Today, we’re talking to Yan, who is a principal engineer at DAZN. Also, he writes blog posts and is a course developer. His insightful, engaging, and understandable content resonates with various audiences. And, he’s an AWS serverless hero!
Some of the highlights of the show include:
Some people get tripped up because they don’t bring microservice practices they learned into the new world of serverless; face many challenges
Educate others and share your knowledge; Yan does, as an AWS hero
Chaos Engineering Meeting Serverless: Figuring out what types of failures to practice for depends on what services you are using
Environment predicated on specific behaviors may mean enumerating bad things that could happen, instead of building a resilient system that works as planned
API Gateway: Confusing for users because it can do so many different things; what is the right thing to do, given a particular context, is not always clear
Now, serverless feels like a toy, but good enough to run production workflow; future of serverless - will continue to evolve and offer more flexibility
Serverless is used to build applications; DevOps/IOT teams and enterprises are adopting serverless because it makes solutions more cost effective
Links:
Yan Cui on Twitter
DAZN
Production-Ready Serverless
Theburningmonk.com
Applying Principles of Chaos Engineering to Serverless
AWS Heroes
re:Invent
Lambda
Amazon S3 Service Disruption
API Gateway
Ben Kehoe
Digital Ocean
Episode 28: Serverless as a Consulting Cash Register (now accepting Bitcoin!)
Is your company thinking about adopting serverless and running with it? Is there a profitable opportunity hidden in it? Ready to go on that journey?
Today, we’re talking to Rowan Udell, who works for Versent, an Amazon Web Services (AWS) consulting partner in Australia. Versent focuses on specific practices, including helping customers with rapid migrations to the Clouds and going serverless.
Some of the highlights of the show include:
Australia is experiencing an increase in developers using serverless tool services and serverless being used for operational purposes
Serverless seems to be either a brilliant fit or not quite ready for prime time
Misconceptions include keeping functions warm, setting up scheduled indications
Simon Wardley talked about how the flow of capital can be traced through an organization that has converted to serverless
Concept of paying thousands of dollars up front for a server is going away
Spend whatever you want, but be able to explain where the money is going (dev vs. prod); companies will re-evaluate how things get done
Serverless is either known as an evolution or revolution; transformative to a point
Winding up with a large number of shops where when something breaks, they don’t have the experience to fix it; gain practical experience through sharing
Seek developer feedback and perform testing, but know where and when to stop
With serverless, you have little control of the environment; focus on automated parts you do control
Serverless Movement: People have opinions and want you to know them
Understand continuum of options for running your application in the Cloud; learn pros and cons; and pick the right tool
Reconciliation between serverless and containers will need to play out; changes will come at some point
Blockchain + serverless + machine learning + Kubernetes + service mesh = raise entire seed round
Links:
Rowan Udell’s Blog
Rowan Udell on Twitter
Versent on Twitter
Lambda
Simon Wardley
Open Guide to AWS Slack Channel
Kubernetes
Aurora
Digital Ocean
Episode 27: What it Took for Google to Make Changes: Outages and Mean Tweets
Google Cloud Platform (GCP) turned off a customer that it thought was doing something out of bounds. This led to an Internet outrage, and GCP tried to explain itself and prevent the problem in the future.
Today, we’re talking to Daniel Compton, an independent software consultant who focuses on Clojure and large-scale systems. He’s currently building Deps, a private Maven repository service. As a third-party observer, we pick Daniel’s brain about the GCP issue, especially because he wrote a post called, Google Cloud Platform - The Good, Bad, and Ugly (It’s Mostly Good).
Some of the highlights of the show include:
Recommendations: Use enterprise billing - costs thousands of dollars; add phone number and extra credit card to Google account; get support contract
Google describing what happened and how it plans to prevent it in the future seemed reasonable; but why did it take this for Google to make changes?
GCP has inherited cultural issues that don’t work in the enterprise market; GCP is painfully learning that they need to change some things
Google tends to focus on writing services aimed purely at developers; it struggles to put itself in the shoes of corporate-enterprise IT shops
GCP has a few key design decisions that set it apart from AWS; focuses on global resources rather than regional resources
When picking a provider, is there a clear winner? AWS or GCP? Consider company’s values, internal capabilities, resources needed, and workload
GCP’s tendency to end service on something people are still using vs. AWS never ending a service tends to push people in one direction
GCP has built a smaller set of services that are easy to get started with, while AWS has an overwhelming number of services
Different Philosophies: Not every developer writes software as if they work at Google; AWS meets customers where they are, fixes issues, and drops prices
GCP understands where it needs to catch up and continues to iterate and release features
Links:
Daniel Compton
Daniel Compton on Twitter
Google Cloud Platform - The Good, Bad, and Ugly (It’s Mostly Good)
Deps
The REPL
Postmortem for GCP Load Balancer Outage
AWS Athena
Digital Ocean
Episode 26: I’m not a data scientist, but I work for an AI/ML startup building on Serverless Containers
Do you deal with a lot of data? Do you need to analyze and interpret data? Veritone’s platform is designed to ingest audio, video, and other data through batch processes to process the media and attach output, such as transcripts or facial recognition data.
Today, we’re talking to Christopher Stobie, a DevOps professional with more than seven years of experience building and managing applications. Currently, he is the director of site reliability engineering at Veritone in Costa Mesa, Calif. Veritone positions itself as a provider of artificial intelligence (AI) tools designed to help other companies analyze and organize unstructured data. Previously, Christopher was a technical account manager (TAM) at Amazon Web Services (AWS); lead DevOps engineer at Clear Capital; lead DevOps engineer at ESI; Cloud consultant at Credera; and Patriot/THAAD Missile Fire Control in the U.S. Army. Besides staying busy with DevOps and missiles, he enjoys playing racquetball in short shorts and drinking good (not great) wine.
Some of the highlights of the show include:
Various problems can be solved with AI; companies are spending time and money on AI
Tasks can be automated that are too intelligent to write around simple software
Machine learning (ML) models are applicable for many purposes; real people with real problems and who are not academics can use ML
Fargate is instant-on Docker containers as a service; handles infrastructure scaling, but involves management expense
Instant-on works with numerous containers, but there will probably be a time when it no longer delivers reasonable fleet performance on demand
Decision to use Kafka was based on workload, stream-based ingestion
Veritone’s writes code that tries to avoid provider lock-in; wants to make an integration as decoupled as possible
People spend too much time and energy being agnostic to their technology and giving up benefits
If you dream about seeing your name up in lights, Christopher describes the process of writing a post for AWS
Pain Points: Newness of Fargate and unfamiliarity with it; limit issues; unable to handle large containers
Links:
Veritone
Christopher Stobie on LinkedIn
Building Real Time AI with AWS Fargate
SageMaker
Fargate
Docker
Kafka
Digital Ocean
Episode 25: Kubernetes is Named After the Greek God of Spending Money on Cloud Services
Google builds platforms for developers and strives to make them happy. There's a team at Google that wakes up every day to make sure developers have great outcomes with its services and products. The team listens to the developers and brings all feedback back into Google. It also spends a lot of time all over the world talking to and connecting with developer communities and showing stuff being worked on. It doesn't do the team any good to build developer products that developers don’t love.
Today, we’re talking to Adam Seligman, vice president of developer relations at Google, where he is responsible for the global developer community across product areas. He is the ears and voice for customers.
Some of the highlights of the show include:
Google tackles everything in an open source way: Shipping feedback, iteration, and building communities
Storytelling - the Tale of Kubernetes: in a short period of time, gone from being open source that Google spearheaded to something sweeping the industry
Rise of containerization inside Linux Kernel is an opportunity for Google to share container management technology and philosophy with the world
Google Next: Knative journey toward lighter-weight serverless-based applications; and GKE On-Prem, customers and teams working with Kubernetes running on premise
Innovation: When logging into GCP console, you can terminate all billable resources assigned to project and access tab for building by hand
GCP's console development strategy includes hard work on documentation, making things easy to use, and building thoughtfulness in grouping services
Google is about design goals, tradeoffs, and metrics; it’s about hyper scale and global footprint of requirements, as well as supporting every developer
Conception 1: Google builds HyperScale Reid-Centric user partitioned apps and don't build globally consistent data driven apps
Conception 2: Software engineers at the top Internet companies do the code and write amazing things instantly
12-Factor App: Opinions of how to architect apps; developers should have choices, but take away some cognitive and operating load complexity
Businesses are running core workloads on Google, which had to put atomic clocks in data centers and private fiber networking to make it all work
Perception that Google focuses on new things, rather than supporting what's been released; industry is on a treadmill chasing shiny things and creating noise
Industry needs to be welcoming and inclusive; a demand for software, apps, and innovation, but number of developers remains because everyone’s not included
Human vs. Technology: More investment and easier onboarding with technology and an obligation to build local communities
Goal: Take database complexity and start removing it for lots of use cases and simplify things for users to deal with replication, charting, and consistency issues
DevFest: Google has about 800 Google developer groups that do a lot of things to build local communities and write code together
Links:
Adam Seligman on Twitter
12-Factor App
I Want to Build a World Spanning Search Engine on Top of GCP
DevFest
Kubernetes
Docker
Heroku
Google Next
Google Reader
Episode 24: Serverless Observability via the bill is terrible
What is serverless? What do people want it to be? Serverless is when you write your software, deploy it to a Cloud vendor that will scale and run it, and you receive a pay-for-use bill. It’s not necessarily a function of a service, but a concept.
Today, we’re talking to Nitzan Shapira, co-founder and CEO of Epsagon, which brings observability to serverless Cloud applications by using distributed tracing and artificial intelligence (AI) technologies. He is a software engineer with experience in software development, cyber security, reverse engineering, and machine learning.
Some of the highlights of the show include:
Modern renaissance of “functions as a service” compared to past history; is as abstracted as it can be, which means almost no constraints
If you write your own software, ship it, and deploy it - it counts as serverless
Some treat serverless as event-driven architecture where code swings into action
When being strategic to make it more efficient, plan and develop an application with specific and complicated functioning
Epsagon is a global observer for what the industry is doing and how it is implementing serverless as it evolves
Trends and use cases include focusing on serverless first instead of the Cloud
Economic Argument: Less expensive than running things all the time and offers ability to trace capital flow; but be cautious about unpredictable cost
Use bill to determine how much performance and flow time has been spent
Companies seem to be trying to support every vendor’s serverless offering; when it comes to serverless, AWS Lambda appears to be used most often
Not easy to move from one provider to another; on-premise misses the point
People starting with AWS Lambda need familiarity with other services, which can be a reasonable but difficult barrier that’s worth the effort
Managing serverless applications may have to be done through a third party
Systemic view of how applications work focuses on overall health of a system, not individual function
Epsagon is headquartered in Israel, along with other emerging serverless startups; Israeli culture fuels innovation
Links:
Epsagon
Email Nitzan Shapira
Nitzan Shapira on Twitter
Heroku
Google App Engine
AWS Elastic Beanstalk
Lambda
Amazon CloudWatch
AWS X-Ray
Simon Wardley
Charity Majors
Start-Up Nation
Digital Ocean
Episode 23: Most Likely to be Misunderstood: The Myth of Cloud Agnosticism
It is easy to pick apart the general premise of Cloud agnosticism being a myth. What about reasonable use cases? Well, generally, when you have a workload that you want to put on multiple Cloud providers, it is a bad idea. It’s difficult to build and maintain. Providers change, some more than others. The ability to work with them becomes more complex. Yet, Cloud providers rarely disappoint you enough to make you hurry and go to another provider.
Today, we’re talking to Jay Gordon, Cloud developer advocate for MongoDB, about databases, distribution of databases, and multi-Cloud strategies. MongoDB is a good option for people who want to build applications quicker and faster but not do a lot of infrastructural work.
Some of the highlights of the show include:
Easier to consider distributed data to be something reliable and available, than not being reliable and available
People spend time buying an option that doesn’t work, at the cost of feature velocity
If Cloud provider goes down, is it the end of the world?
Cloud offers greater flexibility; but no matter what, there should be a secondary option when a critical path comes to a breaking point
Hand-off from one provider to another is more likely to cause an outage than a multi-region single provider failure
Exclusion of Cloud Agnostic Tooling: The more we create tools that do the same thing regardless of provider, there will be more agnosticism from implementers
Workload-dependent where data gravity dictates choices; bandwidth isn’t free
Certain services are only available on one Cloud due to licensing; but tools can help with migration
Major service providers handle persistent parts of architecture, and other companies offer database services and tools for those providers
Cost may/may not be a factor why businesses stay with 1 instead of multi-Cloud
How much RPO and RTO play into a multi-Cloud decision
Selecting a database/data store when building; consider security encryption
Links:
Jay Gordon on Twitter
MongoDB
The Myth of Cloud Agnosticism
Heresy in the Church of Docker
Kubernetes
Amazon Secrets Manager
JSON
Digital Ocean