Marco Pracucci

KubeCon Europe 2018 - Memo and Takeaways

by Marco Pracucci Comments

The KubeCon and CloudNativeConf Europe 2018 is just over and it was really a blast! I had the opportunity to meet and talk with a lot of talented and smart people, share ideas and learn, learn a lot.

Despite the conference and the city are quite expensive for people - like me - coming from the south of Europe, it was really worth the money and my feeling at the end of the conference was more or less this:

I’m used to take notes at each conference I attend and KubeCon was not an exception. In this post I will share random notes and takeaways from the conf, in the form of a publicly-shared personal memo. It just covers talks I’ve attended, that means it just covers my personal interests within the wide CNCF landscape.

It’s an opinionated list of notes with inputs from talks and people I’ve met. It doesn’t pretend to be neither discursive, complete or fully accurate. They’re just my notes, as a starting point from which I will deep dive into the next weeks and months.


In this memo:


Notes on the conference organization

First of all, a big shout out to CNCF and all people involved in running the conference. Everything was perfect, from the beautiful location to the food catering being able to feed 4300 people without queues, from the evening party at Tivoli Gardens to the high quality keynotes and talks.

Well done!


Notes on autoscaling

Kubernetes support three types of autoscaling:

Horizontal Pod Autoscaler (HPA)

Vertical Pod Autoscaler (VPA)

Cluster Autoscaler (CA)

What’s about the future?

Horizontal Pod Autoscaling with custom metrics (Prometheus)

Kubernetes 1.8 introduced Resource and Custom Metrics API, as an attempt to decouple the HPA from Heapstep/cAdvisor metrics and allow to autoscale based on custom metrics (ie. Prometheus) or external metrics (outside the cluster).

Resource and Custom Metrics API is not an implementation, but an API specification implemented and maintaned by vendors. The contract of such APIs is that each metric should return a single value upon which the HPA will scale the workload.

Resource Metrics API

Custom Metrics API

External Metrics API


Notes on networking

Networking is hard and Kubernetes networking is harder. However, the overall outcome of this KubeCon is that we should stop getting scared about Kubernetes networking and better understand how it works. It’s complicated, but there’s no magic behind it and - with some study and practice - it’s approachable.

Debugging networking issues

In case of a networking issue, the first thing to do is to collect information. Two suggested starting points (in case no overlay network is used):

  1. Dump the iptables rules with iptables-save, in order to get a snapshot of the current rules for later analysis (before they change and maybe the issue vanish).
  2. Use conntrack -L to list all connections tracked by the Connection Tracking System, which is the module that provides stateful packet inspection for iptables and has a primary role in packets routing.

Slides: Blackholes & Wormholes: Understand and Troubleshoot the “Magic” of k8s Networking

CNI

The CNI (Container Network Interface) is a both a specification document and a set of “base” (common) plugins aiming to abstract the way applications (ie. Kubernetes) add and remove network interfaces on the host. CNI specification is not Kubernetes-specific, while being supported by Kubernetes.

The goal is to provide an interface layer between the application and the host, and a set of interchangeable community/vendor driven plugins that do the real work to attach/detach/configure network interfaces.

The specification is very simple and currently support ADD and REMOVE operations, while a GET is going to be introduced soon. The specification is still a draft, but close to be finalized in the 1.0 version (should be released by the end of 2018) after which the specification will be considered feature freeze, and the whole effort will be put in adding and/or improving plugins.

AWS has developed two CNI plugins:

Pods networking in Kubernetes using AWS Elastic Network Interfaces

The amazon-vpc-cni-k8s plugin allows to setup a Kubernetes cluster where pods networking is based on the VPC native networking using multiple ENIs attached to the EC2 instance. Using this plugin, all pods and all nodes will be addressable in the same VPC network space.

Since the number of ENIs that can be attached to an EC2 instance is limited, the plugin leverage on the ability to add multiple secondary IPs to each ENI in order to increase the number of Pods addressable on each node (the number depends on the instance type).

The plugin reserves 1 IP per ENI for its own purposes, while the remaining ones can be used for Pods networking. For example, a c4.large instance can attach up to 3 interfaces, each can have up to 10 addresses and so the total number of pods schedulable on a c4.large instance using ENI networking is 3 * (10 - 1) = 27 that sounds quite reasonable unless you’ve a bunch of micro pods.

The amazon-vpc-cni-k8s plugin is used in EKS as well. According to Anirudh Aithal (AWS Engineer) it can already be used for production workload despite the official status is still alpha (will be switched to stable once EKS enters GA). I’ve also talked to few people at the conference already using it in production (with no issues).

Slides:


Notes on AWS EKS

I didn’t met anyone having access to EKS (I haven’t been much lucky), despite most of the people I’ve met run Kubernetes on AWS. People running multi cloud says GKE is a step above and works really well: there’s consensus on this.

Key Points:

Networking:

Authentication:

IAM authentication works like this:

  1. kubectl passes the AWS Identity to K8S API server
  2. The K8S API Server verifies the AWS Identity with the AWS Auth API
  3. On success, it authorize the AWS Identity with RBAC - starting from this point, IAM not used anymore

2018-05-05-eks-iam-auth.png

Source: Introducing Amazon EKS

Preview vs GA:

What’s about the future?

Slides: Introducing Amazon EKS (.pptx)


Notes on resource management

A quick recap on container resources

Each container in a pod can specify resource requests and limits (ie. CPU and memory). When a pod has no resource requests / limits, it will run with the Best Effort QoS class which is the most likely to get throttled and evicted.

The scheduler keeps track of “allocatable” resources on each node that are tipically lower than its capacity, since some resources are reserved for the system, the kubelet and a buffer for the hard eviction threshold.

2018-05-05-node-allocatable-resources.png

You can get a summary of both "Node capacity" and "Node allocatable" resources via `kubectl describe node <name>`, ie.

Capacity:
 cpu:     2
 memory:  15657244Ki
 pods:    110
Allocatable:
 cpu:     2
 memory:  15554844Ki
 pods:    110

When the scheduler algorithm places a pod an a node it doesn’t allow for overcommit of requested resources, so the total sum of requested resources must be <= the allocatable resources. Resource limits are currently not taken into account by the scheduler, despite some initial work has been done (alpha and behind feature gate).

Each pod has a QoS class implicitely assigned based upon its containers CPU and memory resources and limits:

2018-05-05-qos-classes.png

Each container resource is either compressable (CPU) or uncompressable (memory and storage):

Kubelet eviction policy allows to specify hard and soft thresholds: when a soft threshold is reached a SIGTERM is sent to the container to allow a graceful termination, while in case of hard threshold the container is immediately killed. If the kubelet is not fast enough to react, the kernel’s OOM killer will kill the container.

Known issues

Best practices on container resources

  1. If in doubt, start with Guaranteed QoS class
  2. Protect critical Pods (ie. DaemonSets, controller, master components, monitoring)
    • Apply Burstable or Guaranteed QoS class with sufficient memory requests
    • Reserve node resources with labels / selectors, or apply PriorityClasses (still alpha - hopefully beta in 1.11) if scheduler priority is enabled
  3. Test and/or enforce resources in your CI/CD pipeline
  4. Monitor container resources usage
  5. Fine tune the kubelet
    • --eviction-hard and --eviction-soft (and related Values like Grace Periods)
    • --fail-swap-on (default in recent Versions)
    • --kube-reserved and --system-reserved for critical System and Kubernetes Services
  6. Use Burstable QoS class without CPU limits for performance-critical workloads
  7. Disable swap (required by kubelet for proper QoS calculation)
  8. Use the latest version of language runtimes (because of cgroups awareness support improvements) and/or align GC and threads parameters based upon resource requests, using environment variables populated from resource fields, ie:
env:
- name: CPU_REQUEST
  valueFrom:
    resourceFieldRef:
      resource: requests.cpu
- name: MEM_LIMIT
  valueFrom:
    resourceFieldRef:
      resource: limits.memory

Slides: Inside Kubernetes Resource Management (QoS)


Notes on monitoring

Have you ever picked a 3rd party Grafana dashboard, imported into your Grafana installation and all charts are broken due to different labels? Well, to me this happens frequently and I’m glad Tom Wilkie and other people are trying to solve this problem introducing Prometheus mixins.

The core idea behind Prometheus mixins is to provide a way to package together templates for Grafana dashboards and Prometheus alerts related to a specific piece of software (ie. etcd), which get “compiled” into dashboard (JSON) and alerts (YAML) files upon providing a required input configuration (ie. label names). These mixins will also allow to distribute dashboards and alerts along with the code, and not separated from it.

jsonnet (a simple yet powerful extension of JSON) has been picked as the templating language for Prometheus mixins and a package manager for jsonnet - called jsonnet-bundler - built.

They also provide a mixin for Kubernets monitoring based on ksonnet, that’s basically a tool to ease config deployment on Kubernetes based on jsonnet templates. Looks complicated, but once you connect all such pieces together it makes quite sense (if you don’t have a trivial setup).

What’s about the future?


Notes on security

I’m the engineer most far from being a security expert you may find, so here is an attempt to sum up some of things I’ve heard around the topic.

Kubernetes is a complex system with many layers of attack surfaces exposed to internal threats. What we mean by internal threats is any attack driven from inside the container, like when you run untrusted containers on your cluster or an attacker gets access to one of your containers via an external threat.

Internal threats attack surfaces:

2018-05-05-all-code-is-vulnerable.png

Attacks via kernel syscalls

Run as non root

Configure pods to run as non root:

spec:
 securityContext:
 runAsUser: 1234
 runAsNonRoot: true

And prevent setuid binaries from changing the effective user ID, setting the no_new_privs flag on the container process:

containers:
 - name: untrusted-container
   securityContext:
     allowPrivilegeEscalation: false

Drop capabilities

Historyically Linux had two categories of processes: privileged (run by root) and unprivileged (run by non-root). Privileged processed bypass all kernel permission checks, while unprivileged processed are subject to permission checking based on the process’s credentials (UID, GID).

Starting from Kernel 2.2, Linux introduced capabilities which is a way to split privileges - traditionally associated with root user - into distinct units which can ben independently enabled and disabled on non-root processes (technically they are a per-thread attribute).

By default, Docker has a default list of capabilities that are kept and should be dropped when unnecessary. Default capabilities include but are not limited to:

See:

Seccomp

Seccomp limits the system calls a process (container) can make. Vulnerabilites in syscalls that are not allowed won’t hurt anymore. Seccomp profiles in pods can be controlled via annotations on the PodSecurityPolicy (alpha feature).

See: Seccomp on Kubernetes doc

AppArmor

AppArmor is a Linux kernel security module that supplements the standard Linux user and group based permissions to confine programs to a limited set of resources. It basically allows you to define a mandatory access control through profiles tuned to whitelist the access needed by a specific container, such as Linux capabilities, network access, file permissions, etc.

See: AppArmor on Kubernetes doc

Rootless containers

Even if we don’t run containers as root, the system used to run containers (ie. the Docker daemon) still run as root. The talk “The route to rootless containers” illustrates issues and workaround found while running rootless containers.

Sandboxed pods

Google has a policy that there should be two distinct defence layers for pods running untrusted code. Given syscalls get directly executed on the host’s kernel, adding a second layer of defence in front of syscalls means running each container on its own (lightweight) kernel, on top of the host’s kernel.

There are two main projects to run sandboxed pods:

gVisor is a user-space kernel - recently opensourced by Google - that implements a substantial portion of the Linux system surface. It includes an OCI runtime called runsc that provides an isolation boundary between the application and the host kernel. The runsc runtime integrates with Docker and Kubernetes, making it simple to run sandboxed containers.

gVisor introduces some performance penalties (working on it), it’s not 100% compatible with runc (ie. some features are not supported due to security reasons) and some work has been left to have a full hardening (ie. networking).

2018-05-05-gvisor.png

Kubernetes 1.12 will introduce Container Runtime Interface (CRI) API support (alpha) and basic Katacontainers and gVisor implementations - leveraging on CRI - should be expected.

In the future, running sandboxed pods may be as much simple as enabling sandbox in a pod’s security context (API still under discussion):

spec:
 securityContext:
   sandboxed: true

Slides: Secure Pods

Attacks via system logs

Each container logs to /dev/stdout and /dev/stderr. Such logs are collected by the kubelet and usually processed / parsed by a logging agent (ie. fluentd or logstash) running on the host. Vulnerabilities in logging agents or their dependencies (ie. JSON parser) are not uncommon, and they should be protected as well.

Running the logging agent in a pod (instead of directly on the host), as well as keeping such software updated, and applying the other security best practices is a good starting point to protect from attacks via system logs.

Building Docker images without Docker

The topic is not strictly related to security but is somewhat related when it comes to building Docker images in a secure way from a container running on top of Kubernetes (use cases not limited to this).

The goal around the work people is doing on building Docker images without Docker is to ideally split images building, push/pull and run into different tools.

There are two main approaches:

Daemon-less

Tools available to build Docker images without running the Docker daemon:

Runtime-less

The idea is to get more portable tools not depending on Linux namespaces or cgroups, so that they’re easier to nest inside existing containerized environments:

Slides: Building Docker Images without Docker


Notes on tools

kops

The overall feeling about kops is that it’s still the best tool to provision a Kubernetes cluster on AWS. Every people I talked to use kops in our exact same way (Terraform, no kops cli operations, custom solution to manage rolling updates). Didn’t met anyone using kops rolling updates in production.

The ability to decouple etcd nodes from K8S masters is a feature currently under discussion, but no progress has been made yet. Some people recommend not running etcd on the same masters node, other doesn’t see much issues doing so (I personally see a stronger setup doing the decoupling).

People in favor of splitting etcd from masters, recommends 3+ etcd nodes and 2+ Kubernetes master nodes (scale above two masters only if need to scale up API servers). Moreover the way kops setup the masters is having each API server connecting to the local etcd node (for low latency reasons), that makes this setup potentially more fragile than having a balancer in front of an HA pool of etcd nodes.

k8s-spot-termination-handler by Pusher

DeamonSet deployed on nodes running on AWS Spot instances, that continously watch the AWS metadata service and gracefully drain the node itself when the spot instance receives a termination notice (precisely 2 minutes before it’s termination).

Source: k8s-spot-termination-handler

microscanner by Aqua Security

Free-to-use tool for scanning your container images for package vulnerabilities, based on the same vulnerabilities database of Aqua’s commercial solution. MicroScanner is a binary running during the build steps in a Dockerfile and returns a non-zero exit code (plus a JSON report) if any vulnerability has been found:

ADD https://get.aquasec.com/microscanner /
RUN chmod +x microscanner
RUN microscanner <TOKEN>

A part from adding it directly to the Dockerfile (that’s questionable in my opinion), I personally see it very easy to integrate in a CI/CD pipeline, where on-the-fly Docker images are built FROM the to-be-tested image in a dedicated CI/CD step, microscanner gets executed and its output parsed.

Source: Aqua’s New MicroScanner: Free Image Vulnerability Scanner for Developers

heptio-authenticator-aws by Heptio

A tool for using AWS IAM credentials to authenticate kubectl to a Kubernetes cluster. Requires kubectl 1.10+.

Source:

Kubervisor by AmadeusITGroup

Kubervisor is an operator that control which pods should receive ingress traffic or not based on anomaly detection.

The anomaly detection is based on an analysis done on custom metrics (supports Prometheus) and decisions are taken at the controller level (not on each single pod independently) so that the controller can avoid major distruptions in case the majority of pods are detected as unhealthy (this controller removes traffic from pods only if a minority of them are unhealthy). The query and threshold configuration is up to the user, and can be driven by technical and/or business requirements.

It’s worth to note that the only “write” action done by kubevisor is relabelling pods so that an opportunely configured Service won’t match the unhealthy pod labels and thus will remove traffic from it.

Ie. by default each pod has kubervisor/traffic=yes label and this label matching is added to the Service; unhealthy pods will be relabeled to kubervisor/traffic=no, causing the Service to unmatch the unhealthy pods and get them removed from the pool of available pods.

It’s not used in production yet.

Source:

Operator Framework

CoreOs (now Red Hat) has recently announced the operator-framework, a Go lang framework to ease building operators (or controllers - they’re the same thing, just a different naming).


Keynotes really worth to watch

Most keynotes have been very interesting, but here I’ve shared a couple that - in some way - have inspired me. You can watch all KubeCon / CloudNativeCon videos (keynotes and talks) in this YouTube playlist.

Switching Horses Midstream: The Challenges of Migrating 150+ Microservices to Kubernetes

Anatomy of a Production Kubernetes Outage

How an hour and a half outage root cause started two weeks before, and why having the Kubernetes cluster fully down for such a long period wasn’t a catastrofic failure for a bank. Good lesson on distributed system complexity, post mortem analysis and design for failure.

Related articles

RSS Feed

To get a notification each time a new article gets published, type the following command on your Slack: /feed subscribe https://pracucci.com/feed.xml

Comments