Sort By
Episode 22: The Chaos Engineering experiment that is us-east-1
Trying to convince a company to embrace the theory and idea of Chaos Engineering is an uphill battle. When a site keeps breaking, Gremlin’s plan involves breaking things intentionally. How do you introduce chaos as a step toward making things better?
Today, we’re talking to Ho Ming Li, lead solutions architect at Gremlin. He takes a strategic approach to deliver holistic solutions, often diving into the intersection of people, process, business, and technology. His goal is to enable everyone to build more resilient software by means of Chaos Engineering practices.
Some of the highlights of the show include:
Ho Ming Li previously worked as a technical account manager (TAM) at Amazon Web Services (AWS) to offer guidance on architectural/operational best practices
Difference between and transition to solutions architect and TAM at AWS
Role of TAM as the voice and face of AWS for customers
Ultimate goal is to bring services back up and make sure customers are happy
Amazon Leadership Principles: Mutually beneficial to have the customer get what they want, be happy with the service, and achieve success with the customer
Chaos Engineering isn’t about breaking things to prove a point
Chaos Engineering takes a scientific approach
Other than during carefully staged DR exercises, DR plans usually don’t work
Availability Theater: A passive data center is not enough; exercise DR plan
Chaos Engineering is bringing it down to a level where you exercise it regularly to build resiliency
Start small when dealing with availability
Chaos Engineering is a journey of verifying, validating, and catching surprises in a safe environment
Get started with Chaos Engineering by asking: What could go wrong?
Embrace failure and prepare for it; business process resilience
Gremlin’s GameDay and Chaos Conf allows people to share experiences
Ho Ming Li on Twitter
Gremlin on Twitter
Gremlin on Facebook
Gremlin on Instagram
Gremlin: It’s GameDay
Chaos Engineering Slack
Chaos Conf
Amazon Leadership Principles
Adrian Cockcroft and Availability Theater
Digital Ocean
Episode 21: Remember when RealNetworks used to– BUFFERING
Are you about to head off to college? Interested in DevOps and the Cloud? Is there a good way for someone like you who is starting out in the world of technology to absorb the necessary skills? The Open Source Lab (OSL) at Oregon State University (OSU) is one program that helps students and serves as a career accelerator. OSL is a unicorn because OSU is willing to invest in open source.
Today, we’re talking to Lance Albertson, director of OSL at OSU. OSL does a variety of projects to provide private Clouds that are neutrally hosted on its premises. The lab also gives undergraduate students hands-on experience with DevOps skills, including dealing with configuration management, deploying applications, learning how applications deploy, working with projects, and troubleshooting issues. OSL is for any student who has a general interest or passion for it, and a willingness to learn.
Some of the highlights of the show include:
Workflow focuses on what students need to learn about Linux and giving access to various repos; then they experience the lab’s configuration management suite
Interview Process: Put out a posting, student submits an application online, each candidate is reviewed, student is given a screening quiz,
If a student passes the screening process, they are brought in for an in-person interview for personality and technical questions
Students tend to initially have the least amount of experience and most difficulty with a repository that has multiple people committing to it and dealing with PRs
Spinning up VMs and understanding how configuration management is connected, how services communicate, and how to set up an application
Round-Robins and System Sprint Meetings: Focus on discussing and documenting processes, issues, suggestions, comments, and other information
Younger students are mentored by Lance and the older students; every generation has to evolve because the environment and industry evolve
OSL made OpenStack work on POWER8, PowerPC, and PowerPC little-endian; gateway into Cloud - having OpenStack instance to offer services
Vast majority of OSL’s revenue comes from donations; no direct support from the university; finding companies to serve as sponsors is beneficial to all
Future of OSL: Providing more Cloud-like services; creating a more internal, private Cloud’ and containerized ways of running or deploying applications
Apache Software Foundation
Digital Ocean
Episode 20: The Wizard of AWS
Today, we’re talking to Jeff Barr, vice president and chief evangelist at Amazon Web Services (AWS). He founded the AWS Blog in 2004 and has written more than 2,900 posts for it and another 1,100 for his personal blog. As chief evangelist, Jeff strives to explain the benefits of Cloud computing and Web services to anyone who will listen.
Jeff is the voice of AWS. He does what he does best - exploits his superpower of explaining technology in ways that people can understand it. Jeff tries to be the same person all the time. He loves to meet people and go out of his way to say “Hello.” So, if you see him at re:Invent, say “Cheese” and take a selfie with him!
Some of the highlights of the show include:
Jeff uses AWS Workspaces for his blog; one of Jeff’s blogging principles is to not take anybody else's word for anything to the absolute best of his technical ability
Zero Client: Jeff has no rotating hardware, disk drives, just a zero client; wherever he is, it's the same workspace
AWS has something for everyone; it build things in response to customers’ questions, requests, and feedback
Naming Services and Products: Is it helpful? Is it descriptive? Does it have any hidden meanings?
Amazonian DNA and Dog Friendly Workspace: Jeff went from super fearful to accepting, to now thinking of dogs as incredible creations because they add fun and excitement to the office
As part of hiring, each interviewer is assigned Amazon leadership principles (LPs) to ask questions that measure a candidate against those LPs
What is the secret to getting hired at Amazon? Study the LPs to understand what they're about and be able to express your philosophies and history with LPs
re:Invent makes sure customers understand services - What is it? What does it do? How do they put it to work? What are the best use cases for it?
Things can never be too simple; you start from zero, put a lot of different things in there, and then you need the feedback to build in simplicity
AWS is following a more on-demand approach than traditional reserve instances; it opens the door to being used in a lot of ways
AWS does a lot of work before a launch to make sure it’s got infrastructure, scaling, monitoring, and capacity in place
If you are a customer, talk to AWS and let them know what they're doing right or wrong; write a blog post, tweet about it, share it with them in some way
Is the breadth of product offerings from AWS too vast? Is it offering too many things?
AWS was not explicit about where it was going with Cloud computing or do analyses or projections about it; it simply launched SQS and let it speak for itself
Customer feedback shapes what Amazon works on; customers share and then AWS re-prioritizes to make sure it’s delivering the right thing at the right time
Remember: It's not just bits and bytes, it's about the organic life form
Jeff Barr on Twitter
Jeff Barr on LinkedIn
AWS Blog
Jeff Barr’s Blog
Amazon Machine Images
Zero Client
AWS Workspaces
AWS Lambda
Amazon Leadership principles
The Robot Uprising Will Have Very Clean Floors
Serverlessly Storing My Dad Jokes in a Dadabase
Days Until re:Invent
Episode 19: I want to build a world spanning search engine on top of GCP
Some companies that offer services expect you to do things their way or take the highway. However, Google expects people to simply adapt the tech company’s suggestions and best practices for their specific context. This is how things are done at Google, but this may not work in your environment.
Today, we’re talking to Liz Fong-Jones, a Senior Staff Site Reliability Engineer (SRE) at Google. Liz works on the Google Cloud Customer Reliability Engineering (CRE) team and enjoys helping people adapt reliability practices in a way that makes sense for their companies.
Some of the highlights of the show include:
Liz figures out an appropriate level of reliability for a service and how a service is engineered to meet that target
Staff SRE involves implementation, and then identifying and solving problems
Google’s CRE team makes sure Google Cloud customers can build seamless services on the Google Cloud Platform (GCP)
Service Level Objectives (SLOs) include error budgets, service level indicators, and key metrics to resolve issues when technology fails
Learn from failures through instant reports and shared post-mortems; be transparent with customers and yourself
GCP: Is it part of Google or not? It’s not a division between old and new.
Perceptions and misunderstandings of how Google does things and how it’s a different environment
Google’s efforts toward customer service and responsiveness to needs
Migrating between different Cloud providers vs. higher level services
How to use Cloud machine learning-based products
GCP needs to focus on usability to maintain a phase of growth
Offer sensible APIs; tear up, turn down, and update in a programmatic fashion
Promotion vs. Different Job: When you’ve learned as much as you can, look for another team to teach something new
What is Cloud and what isn’t? Cloud deployments require SRE to be successful but SREs can work on systems that do not necessarily run in the Cloud.
Cloud Spanner
Cloud Bigtable
Google Cloud Platform blog - CRE Life Lessons
Google SRE on YouTube
Episode 18: Sitting on the curb clapping as serverless superheroes go by
What’s serverless? Are you serverless now? Is going from enterprise to serverless a natural evolution? Or, is it a “that was fun, now let’s go ride our bikes” moment? Is serverless “just a toy?” Is it a wide and varied ecosystem, or is it Lambda plus some other randos? What's up with serverless vs. containers?
Today, Forrest Brazeal is here to answer those questions and discuss pros and cons of serverless. He was a senior Cloud architect prior to joining Trek10. Forrest spent several years leading AWS and serverless engineering projects at Infor. He understands the challenges faced by enterprises moving to the Cloud and enjoys building solutions that provide maximum business value at a minimal cost.
Some of the highlights of the show include:
Bimodality: Backend development going away and being replaced by managed services; undifferentiated items are being moved to the Cloud
Serverless is application designs with “Backend as a Service” (BaaS) and/or “Functions as a Service” (FaaS) platforms; everything is managed for you
AWS Lambda: Is it today’s trend or a bias that everyone is using it; Lambda makes up 80% of current FaaS adoption
Serverless Ecosystem: You can build it however you want, and you’re doing it right; but don’t take that at face-value; no two Lambda environments are alike
Cloud services at this scale have not been knitted together to form applications that are serving major workloads; best practices need to be established
Native Cloud providers will consolidate, and individual frameworks will be created with components of application stacks tied together to build systems
Serverless vs. Containers: No need for disparity - we can learn to get along; people use containers because it is easier than going serverless
Serverless Heroes series features people thinking out-of-the-box and helps identify emerging trends; serverless is growing, and it’s not just about startups
Went from working with a Sharpie to Procreate for the FaaS and Furious cartoon series; serverless component of process is for invoicing
Changes? Packaging to handle sharing; more knobs on console; unified process needed because too many building own workflow and tooling
Certification: Proof-positive that you know what you’re talking about or is it questionable value if not backing up expertise in the real world?
Forrest Brazeal on Twitter
Summon the vast power of certification - Dilbert cartoon
Trek10 blog
A Cloud Guru ThinkfaaS podcast
A Cloud Guru - Serverless Superheros
Why We’re Excited About AWS AppSync
Serverless Architectures with Mike Roberts
AWS Lambda
AWS Serverless Application Model (SAM)
AWS Certified Cloud Practitioner
Digital Ocean
Episode 17: Pouring Kubernetes on things with reckless abandon
DevOps as a service describes what Reactive Ops is trying to do, who it’s trying to help, and what problems it’s trying to solve. It’s passion to deliver service where human beings help other human beings is done through a group of engineers who are extremely good at solving problems.
Sarah Zelechoski is the vice president of engineering at Reactive Ops, which defines the world’s problems and solves them by pouring Kubernetes on top of them. The team focuses on providing expert-level guidance and a curated framework using Kubernetes and other open source tools. Sarah's greatest passion is helping others, which encompasses advocating for engineers and rekindling interest in the lost art of service in the tech space.
Some of the highlights of the show include:
Kubernetes is changing the way people work; it offers a way to release a product, provide access to it, and behaviors when you deploy it
Any person/business can use Kubernetes to mold their workflow
Kubernetes is complex and has sharp edges; it has only recently become productive because of its community finding and reporting issues
Business value of deploying Kubernetes to a new environment: Flexibility and uniform system of management; and it can provide a context shift
Implementation Challenges with Workshops/Tutorials: Valuable entry level strategy for people learning Kubernetes; but the translation is not easy
About 85% of the work Reactive Ops does is helping its customers get on to Kubernetes is spent on application architecture
If thinking about moving to Kubernetes, how well will your current applications translate? Do you want to start over from scratch?
Value in paying someone to do something for you
Using Defaults: Try initially until you realize what you need; Kubernetes gives you options, but it’s a challenging path to go from defaults to advanced
Deploying a workload between all major Cloud providers is possible, but there are challenges in managing multiple regions or locations
Cluster Ops: Managed Kubernetes clusters where Reactive Ops stays on the map, watches them, and puts them on pager, so you can continue your work without having to worry
Sarah Zelechoski on Twitter
Reactive Ops
GKE from GCB
AKS from Azure
EKS from AWS
Episode 16: There are Still Servers, but We Don’t Care About Them
Are you interested in going beyond basic monitoring and visibility? Need tools to build and operate serverless applications and extract business intelligence? IOpipe provides extended visibility and metrics around AWS Lambda, including profiling, core dumps, and incoming input events.
Today, we’re talking to Erica Windisch, who is the founder and CTO of IOpipe. She brings her experience in building developer and operational tooling to serverless applications. Erica also has more than 17 years of experience designing and building Cloud infrastructure management solutions. She was an early and longtime contributor to OpenStack and maintainer of the Docker project.
Some of the highlights of the show include:
Nomenclature Battle: Serverless vs. stateless
Building a window of visibility into Lambda: Talking to users and assessing needs/pain points
Observability of the infrastructure: Necessary evil to get to automated healing
Using Lambda at significant levels of scale; some companies grow usage, others go all in right away
Current state of Lambda ecosystem
Is Lambda stable? Indications and no formal SLA
How issues manifest and are exposed
Trends include cold starts, hours-long failures, and multiple function evokes
Infrastructure powering IOpipe: Lambda issues may impact performance of monitoring system, but IOpipe is not necessarily dependent on Lambda
Future of Lambda: Builds applications a specific way, but there are limitations
What would Erica change about Lambda? Run function and define handlers
Lambda functions can be difficult to understand; some developers do not have familiarity and create bottlenecks
Capacity limits around Lambda can be difficult to establish
Erica Windisch on Twitter
Erica Windisch on Twitch
12-Factor App
Cloud Custodian in Lambda
Velocity London
ServerlessConf London
AWS Glue
Episode 15: Nagios was the Original Call of Duty
Let’s chat about the Cloud and everything in between. The people in this world are pretty comfortable with not running physical servers on their own, but trusting someone else to run them. Yet, people suffer from the psychological barrier of thinking they need to build, design, and run their own monitoring system. Fortunately, more companies are turning to Datadog.
Today, we’re talking to Ilan Rabinovitch, Datadog’s vice president of product and community. He spends his days diving into container monitoring metrics, collaborating with Datadog’s open source community, and evangelizing observability best practices. Previously, Ilan led infrastructure and reliability engineering teams at various organizations, including Ooyala and He’s active in the open source and DevOps communities, where he is a co-organizer of events, such as SCALE and Texas Linux Fest.
Some of the highlights of the show include:
Datadog is well-known, especially because it is a frequent sponsor
More organizations know their core competency is not monitoring or managing servers
Monitoring/metrics is a big data problem; Datadog takes monitoring off your plate
Alternate ways, other than using Nagios, to monitor instances and regenerate configurations
Datadog is first to identify patterns when there is a widespread underlying infrastructure issue
Trends of moving from on-premise to Cloud; serverless is on the horizon
How trends affect evolution of Datadog; adjusting tools to monitor customers’ environments
Datadog’s scope is enormous; the company tries to present relevant information as the scale of what it’s watching continues to grow
Datadog’s pricing is straightforward and simple to understand; how much Cloud providers charge to use Datadog is less clear
Single Pane of Glass: Too much data to gather in small areas (dashboards)
Why didn’t monitoring catch this? Alerts need to be actionable and relevant
How to use Datadog’s workflow for setting alerts and work metrics
Datadog’s first Dash user conference will be held in July in New York; addresses how to solve real business problems, how to scale/speed up your organization
Ilan Rabinovitch on Twitter
Docker Adoption Survey Results
Rubric for Setting Alerts/Work Metrics
Dash Conference
Episode 14: Cheslocked and loaded
Do you need data captured that let you know when things don’t look quite right? Need to identify issues before they become major problems for your organization? Turn to Threat Stack, which has Cloud issues of its own, and helps its customers with their Cloud issues.
Today, I’m talking to Pete Cheslock, who runs technical operations at Threat Stack, which handles security monitoring, alerting, and remediation. The company uses Amazon Web Services (AWS), but its customer base can run anywhere.
Some of the highlights of the show include:
Challenges Threat Stack experienced with AWS and how it dealt with them
Threat Stack helps companies improve their security posture in AWS
Security shouldn’t be an issue, if providers do their job; shared responsibility
Education is needed about what matters regarding security, avoiding mistakes
Cloud is still so new; not many people have abroad experience managing it
Scanning customer accounts against best practices to identify risks
Threat Stack’s scanning tool is worthwhile, but most tools lack judgement and perspective
Threat Stack offers context between host- and Cloud-based events; tying data together is the secret sauce
You shouldn’t have to pay a bunch of money to have a robust security system
Good operations is good security; update, patch, track, and perform other tasks
Lack of validation about what services are going to be a successful or not
Vendor Lock-in: Understand your choices when building your system
Pervasiveness and challenge of containerization and Kubernetes
Cloud reduces cycle time and effort to bring a product to market
Amazon is a game changer with what it allows you to do and solve problems
Pete Cheslock
Digital Ocean
Threat Stack
Episode 13: Serverlessly Storing my Dad Jokes in a Dadabase
Aurora, from Amazon Web Services (AWS), is a MySQL-compatible service for complex database structures. It offers capabilities and opportunities. But with Aurora, you’re putting a lot of trust in AWS to “just work” in ways not traditional to relational database services (RDS).
David Torgerson, Principal DevOps Engineer at Lucidchart, is a mystery wrapped in an enigma and virtually impossible to Google. He shares Lucidchart’s experience with migrating away from a traditional RDS to Aurora to free up developer time.
Some of the highlights of the show include:
Trade off of making someone else partially responsible for keeping your site up
Lucidchart’s overall database costs decreased 25% after switching to Aurora
Aurora unknowns: What is an I/Op in Aurora? When you write one piece of data, does it count as six I/Ops?
Multi-master Aurora is coming for failover time and disaster recovery purposes
Aurora drawbacks: No dedicated DevOps, increased failover time, and misleading performance speed
Providers offer ways to simplify your business processes, but not ways to get out of using their products due to vendor and platform lock-in
Lucidchart is skeptical about Aurora Serverless; will use or not depending on performance
Corey's architecture diagram on AWS
Lucidchart’s Data Migration to Amazon Aurora
Preview of Amazon Aurora Multi-master Sign Up
This is My Architecture
Digital Ocean
Episode 12: Like Normal Cloud Services, but More Depressing
Does your job challenge and motivate you? Does it utilize your skills? Or, are you ready to go job hunting? Do you want an awesome job that is a resume booster? Companies should be supportive of their employees finding a job that matches their skills and interests. Also, when hiring, companies should offer thoughtful processes for interviews.
Today, I’m talking to Sarah Withee, a polyglot software engineer, mentor, teacher, and robot tinkerer. Sarah went job hunting, and after several job interviews, she finally found a job that made her super happy at Arcadia Healthcare Solutions. Sarah compares the interview processes she experienced at big name tech companies that offer Cloud services.
Some of the highlights of the show include:
Companies sometimes lose sight that even interview interactions need to be a two-way sale
Interviews often involve talking to many people; and if several are bad, that forms a negative impression of the company
Companies need to provide interview training and follow the same standards
Don’t farm out challenging or unfamiliar issues when interviewing candidates
Sarah is very competent, but she is new to Cloud platforms; she is like a sponge, who enjoys learning and having a bare knowledge of new technology
How HIPAA regulations impact Sarah’s learning and software engineering work; she has to be more aware of security and safety of healthcare data
Being a teacher and mentor affects how Sarah learns new things; everybody learns slightly differently
In the Cloud space, know which direction you want to go and start with simpler things to learn the basics; focus on what is relevant to what you are working on
Sarah Withee on Twitter #speakerconfessions
Sarah Withee on Twitter
Sarah Withee Blog
Sarah Withee Resume
Digital Ocean
Episode 11: Hickory Dickory Docker
Docker went from being a small startup to an enterprise company that changed the way people think about their infrastructure to now, where its relevance is somewhat minimal. The conversation is no longer around the container level. Docker has become commonplace.
Today, we’re talking to Jérôme Petazzoni, formerly of Docker. While he was with the company for about 8 years, Docker definitely experienced a roller coaster ride.
Some of the highlights of the show include:
Amount of work conducted on the enterprise vs. community editions
Docker was so widely adopted because its core technology was open source
Challenge is to build a viable business and revenue model for the long run
Similarities between Docker and Red Hat open source platforms
Docker went from six people working in a garage to having a few hundred employees and $1.3 billion valuation
Changes happened, but they were gradual; the changes were necessary to be a profitable and sustainable company
Contingent of internal and external people believed that Docker was the answer for whatever problem surfaced; Docker would save you, but not always
Balancing Act: Pushing forward with a correct message and regulating enthusiasm
Networking and Docker for dummies; confusion and problems of things not working as expected have been resolved
Things will continue to shift; Kubernetes and the orchestration battle
What was unthinkable, could happen by companies pushing the envelope and making progress
Will who you have as your Cloud provider stop mattering? It depends.
All major Cloud providers plan to offer managed Kubernetes services and what Jérôme thinks of them
Jérôme’s opinion on whether Kubernetes will follow this same path as Docker
What does the road ahead look like for infrastructure automation? There is potential and lots of best practices in Cloud environments.
Jérôme Petazzoni on Twitter
Docker Crunch Base
Digital Ocean
Red Hat
Corey's Heresy in the church of docker talk