December 12, 2019

OpenStack Superuser

Ethernet VPN Deployment Automation with OpenStack and ODL Controller

The cool thing about OpenStack is – its tight integration with SDN solutions like OpenDaylight to keep apart network traffic, on-demand scaling and enabling centralized control on geographically distributed data centers. In this article, we will talk about a proposed SDN based architecture in which how OpenStack and OpenDaylight can be used to automate the deployment of VPN instances (Ethernet VPN in this case), centrally manage them along with regular updates network policies and enhancement in terms of scalability and response time on VPNs.

Problem with Interconnection of data centers with L2VPN

Virtual Private Network is generally used for geographically distributed data center interconnection. There were a lot of generations of VPN technologies that were introduced to address the connectivity needs between different sites. Layer -2 VPN (L2VPN) is the one that is widely used by organizations due to its flexibility and transparency. Virtual Private Lan Service (VPLS) service is used by L2VPN to connect different data centers. The main advantage of VPLS is that it can extend the VLAN to data centers. But VPLS has its own barriers in terms of redundancy, scalability, flexibility, and limited forwarding policies. However, Internet Service Providers (ISPs) use Multiprotocol Label Switching (MPLS) for data center interconnection because of its flexibility and ease in deployment. That triggers the necessity to have VPN technology designed for MPLS. This is where Ethernet VPN (EVPN) comes in, that address concerns and challenges associated with using VPN with MPLS. EVPN simple enables an L2 VPN connection over MPLS.

The core problem with EVPN was with manual configuration and management of EPVN instances which can cause huge time consumption, error-prone configuration and high OPEX.

An SDN Based Solution

To address the problem, SDN based architecture was proposed by researchers and engineers from Karlstd University and Ericsson. It utilized OpenDaylight SDN controller and OpenStack for automated remote deployment and automation of EVPN related tasks.

The offered solution in this paper mainly reduces two existing limitations. One is flexible network management automation and other is control plane complexity of MPLS based VPN and provision of flexibility for adding new network changes.

Architecture

Before we dive into the architecture, let’s talk about how EVPN is a key technology for this solution to run EVPN dynamically on MPLS. EVPN uses MP-BGP in its control plane as a signaling method to broadcast addresses that removes the need of traditional flood-and-learn in the data plane. In EVPN, the control and data plane are abstracted and separated. That allows MPLS and Provider Edge Backbone Bridge to be used together with the EVPN control plane.

SDN Based EVPN Deployment Automation Architecture

The above architecture depicts the model-driven network management and automation of EVPN instances. In this model a YANG data modeling language is used to define services and configurations, represent state data and process notifications. A configuration data defined in YANG file transmitted to network devices. NETCONF protocol is used to for transmission of configuration along with installation, deletion, and manipulation of configuration of network devices. Transmitted messages are encoded in XML file. NETCONF admin help data to pass through, validate the configuration and after successful execution admin commit changes to network devices. SDN controller leverages the NETCONF for automating the configuration of EVIs on provider edge routers.

Let’s understand the role of key components in the architecture

OpenStack: It is used as a central cloud platform to orchestrate the management of EVPNs using SDN controller. OpenStack Neutron project API is used to communicate with ODL SDN controller to manage EVPN instances attached in network.

OpenDaylight SDN Controller: It is the core element of this architecture which extends the Multiprotocol Border Gatway Protocol (MP-BGP) inside OpenDaylight controller with MP-BGP control plane (EVPN instances on the provider edge/data center) and the VPNService inside the OpenDaylight controller that automates EVPN configuration using YANG and NETCONF. This bypasses the slow and error-prone tasks of manual EVPN configuration.

Open vSwtich (OVS): This switch sits inside OpenStack compute nodes. It is used to isolate the traffic among different VMs and connects them to the physical network.

Provider Edge (PE) routers: The PE acts as a middleware for the data centers and supports EVPN and MP-BGP extensions as well as NETCONF and YANG.

Above architecture solution is evaluated. You can refer to the paper for test results here.

 

The post Ethernet VPN Deployment Automation with OpenStack and ODL Controller appeared first on Superuser.

by Sagar Nangare at December 12, 2019 02:00 PM

Adam Young

Reading keystone.conf in a container

Step 3 of the 12 Factor app is to store config in the environment. For Keystone, the set of configuration options is controlled by the keystone.conf file. In an earlier attempt at containerizing the scripts used to configure Keystone, I had passed an environment variable in to the script that would then be written to the configuration file. I realize now that I want the whole keystone.conf external to the application. This allow me to set any of the configuration options without changing the code in the container. More importantly, it allows me to make the configuration information immutable inside the container, so that the applications cannot be hacked to change their own configuration options.

I was running the pod and mounting the local copy I had of the keystone.conf file using this command line:

podman run --mount type=bind,source=/home/ayoung/devel/container-keystone/keystone-db-init/keystone.conf,destination=/etc/keystone/keystone.conf:Z --add-host keystone-mariadb:10.89.0.47   --network maria-bridge  -it localhost/keystone-db-init 

It was returning with no output. To diagnose, I added on /bin/bash to the end of the command so I could poke around inside the running container before it exited.

podman run --mount /home/ayoung/devel/container-keystone/keystone-db-init/keystone.conf:/etc/keystone/keystone.conf    --add-host keystone-mariadb:10.89.0.47   --network maria-bridge  -it localhost/keystone-db-init /bin/bash

Once inside, I was able to look at the keystone log file. A Stack trasce made me realize that I was not able to actually read the file /etc/keystone/keystone.conf. Using ls I would show up like this:

-?????????? ? ?        ?             ?            ? keystone.conf:

It took a lot of trial and error to recitify it including:

  • adding a parallel entry to my hosts /etc/password and /etc/groups file for the keystone user and group
  • Ensuring that the file was owned by keystone outside the container
  • switching to the -v option to create the bind mount, as that allowed me to use the :Z option as well.
  • addingthe -u keystone option to the command line

The end command looked like this:

podman run -v /home/ayoung/devel/container-keystone/keystone-db-init/keystone.conf:/etc/keystone/keystone.conf:Z  -u keystone         --add-host keystone-mariadb:10.89.0.47   --network maria-bridge  -it localhost/keystone-db-init 

Once I had it correct, I could use the /bin/bash executable to again poke around inside the container. From the inside, I could run:

$ keystone-manage db_version
109
$ mysql -h keystone-mariadb -ukeystone -pkeystone keystone  -e "show databases;"
+--------------------+
| Database           |
+--------------------+
| information_schema |
| keystone           |
+--------------------+

Next up is to try this with OpenShift.

by Adam Young at December 12, 2019 12:09 AM

December 10, 2019

OpenStack Superuser

Unleashing the OpenStack “Train”: Contribution from Intel and Inspur

The OpenStack community released the latest version, “Train”, on October 16th. As Platinum and Gold members of OpenStack Foundation, Intel and Inspur OpenStack teams are actively contributing to the community projects, such as Nova, Neutron, Cinder, Cyborg, and others. During the Train development cycle, both companies collaborated, contributed to and completed multiple achievements. This includes 4 blueprints and design specifications in Train, commits, reviews and more, and reflects the high level of contribution to the development of OpenStack code base.

In early September 2019, Intel and Inspur worked together and used the InCloud OpenStack 5.6 (ICOS 5.6) to validate a single cluster deployment with 200 and 500 nodes. This created a solid foundational reference architecture for OpenStack in a large-scale single cluster environment. Intel and Inspur closely monitor the latest development updates in the community and upgraded ICOS5.6 to support new features of Train. For example, while validating the solution, a networking bottleneck issue (Neutron IPAM DLM and IP address allocation) was found in a large-scale high concurrency provisioning scenario (e.g. >800 VM creation). After applying a distributed lock solution with etcd, the network creation process was optimized and significantly improved system performance. The team also worked on Nova project to provide “delete on termination” feature for VM volumes. This greatly improves operation efficiency for cloud administrators. Another important new feature “Nova VPMEM” is also included in OpenStack “Train” release. This feature can guarantee persistent data storage functionality across power cycles, at a lower cost and larger capacity compared to DRAM. This can significantly improve workload performance for applications such as Redis, Rocksdb, SAP HANA, Aerospike, etc.

Intel and Inspur shared many of the engineering best practices at the recent Shanghai Open Infrastructure Summit, including resources for 500 node large-scale cluster deployment in relevant sessions such as “full stack security chain of trust and best practices in cloud”, “improving private cloud performance for big data analytics workloads”, and more.

Chief Architect of Intel Data Center Group, Enterprise & Government for China Division, Dr Yih Leong Sun said: Intel is actively contributing to the OpenStack upstream community and will continue to improve OpenStack architecture with Intel’s latest technology. We strive to build a software defined infrastructure, optimized at both the software and hardware layer, and to deliver an Open Cloud solution that meets the workload performance requirements of the industry.

Vice President of Inspur Group, Zhang Dong indicated: Inspur is increasingly investing more on upstream community and contributing our knowledge and experience with industry deployment and usage. We continue to strengthen our technical leadership and contribution in the community, to help users solve real-world challenges, and to promote the OpenStack adoption.

 

Photo // CC BY NC

The post Unleashing the OpenStack “Train”: Contribution from Intel and Inspur appeared first on Superuser.

by Brin Zhang and Lily Wu at December 10, 2019 08:00 AM

December 09, 2019

RDO

Community Blog Round Up 09 December 2019

As we sail down the Ussuri river, Ben and Colleen report on their experiences at Shanghai Open Infrastructure Summit while Adam dives into Buildah.

Let’s Buildah Keystoneconfig by Adam Young

Buildah is a valuable tool in the container ecosystem. As an effort to get more familiar with it, and to finally get my hand-rolled version of Keystone to deploy on Kubernetes, I decided to work through building a couple of Keystone based containers with Buildah.

Read more at https://adam.younglogic.com/2019/12/buildah-keystoneconfig/

Oslo in Shanghai by Ben Nemec

Despite my trepidation about the trip (some of it well-founded!), I made it to Shanghai and back for the Open Infrastructure Summit and Project Teams Gathering. I even managed to get some work done while I was there. 🙂

Read more at http://blog.nemebean.com/content/oslo-shanghai

Shanghai Open Infrastructure Forum and PTG by Colleen Murphy

The Open Infrastructure Summit, Forum, and Project Teams Gathering was held last week in the beautiful city of Shanghai. The event was held in the spirit of cross-cultural collaboration and attendees arrived with the intention of bridging the gap with a usually faraway but significant part of the OpenStack community.

Read more at http://www.gazlene.net/shanghai-forum-ptg.html

by Rain Leander at December 09, 2019 12:24 PM

December 06, 2019

OpenStack Superuser

A Guide to Kubernetes Etcd: All You Need to Know to Set up Etcd Clusters

We all know Kubernetes is a distributed platform that orchestrates different worker nodes and can be controlled by central master nodes. There can be ‘n’ number of worker nodes that can be distributed to handle pods. To keep track of all changes and updates of these nodes and pass on the desired action, Kubernetes uses etcd.

What is etcd in Kubernetes?

Etcd is a distributed reliable key-value store which is simple, fast and secure. It acts like a backend service discovery and database, runs on different servers in Kubernetes clusters at the same time to monitor changes in clusters and to store state/configuration data that should to be accessed by a Kubernetes master or clusters. Additionally, etcd allows Kubernetes master to support discovery service so that deployed application can declare their availability for inclusion in service.

The API server component in Kubernetes master nodes communicates with etcd the components spread across different clusters. Etcd is also useful to set up the desired state for the system.

By means of key-value store for Kubernetes etcd, it stores all configurations for Kubernetes clusters. It is different than traditional database which stores data in tabular form. Etcd creates a database page for each record which do not hampers other records while updating one. For example, this might happen that few records may require additional columns, but those not required by other records in the same database. This creates redundancy within database. Etcd adds and manages all records in reliable way for Kubernetes.

Distributed and Consistent

Etcd stores a critical data for Kubernetes. By means of distributed, it also maintains a copy of data stores across all clusters on distributed machines/servers. This copy is identical for all data stores and maintains the same data from all other etcd data stores. If one copy get destroys, the other two hold the same information.

Deployment Methods for etcd in Kubernetes Clusters

Etcd is implementation is architected in such a way to enable high availability in Kubernetes. Etcd can be deployed as pods in master nodes

Figure – etcd in the same cluster

Image source: https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/ha-topology/

It can also be deployed externally to enable resiliency and security

Figure – etcd deployed externally

Image source: https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/ha-topology/

How etcd Works

Etcd acts as the brain of the Kubernetes cluster. Monitoring the sequence of changes is done using the ‘Watch’ function of etcd. With this function, Kubernetes can subscribe to changes within clusters and execute any state request coming from API server. Etcd co-ordinates with different components within distributed clusters. Etcd reacts to changes with state of components and other components may get reacted to changes.

There might be a situation – while maintaining the same copy of all state among group of etcd components in clusters, the same data needs to be stored in two etcd instances. However, etcd is not supposed to update the same record in different instances. 

In such cases, etcd does not process the writes on each cluster node. Instead, only one of the instances gets the responsibility to process the writes internally. That node is called leader. The other nodes in cluster elect a leader using RAFT algorithm. Once the leader get elected, the other node becomes the followers of the leader. 

Now, when the write requests came to the leader node then the leader processes the write. The leader etcd node broadcasts a copy of the data to other nodes. If one of the follower nodes is not active or offline that moment, based on the majority of available nodes write requests get a complete flag. Normally, the write gets the complete flag if the leader gets consent from the other members in the cluster. 

This is the way they elect the leader among themselves and how do they ensure a write is propagated across all instances. This distributed consensus is implemented in etcd using raft protocol.

How Clusters Work in etcd 

Kubernetes is the main consumer for etcd project, initiated by CoreOS. Etcd has become a norm for functionality and overall tracking of Kubernetes cluster pods. Kubernetes allows various cluster architectures that may involve etcd as a crucial component or might involve multiple master nodes along with etcd as isolated component. 

The role of etcd changes per system configuration in any particular architecture. Such dynamic placement of etcd to manage clusters can be implemented to improve scaling. The result is easily supported and managed workloads. 

Here are the steps for initiating etcd in Kubernetes.

Wget the etcd files:

wget -q --show-progress --https-only --timestamping \ "https://github.com/etcd-io/etcd/releases/download/v3.4.0/etcd-v3.4.0-linux-amd64.tar.gz"

Tar and install the etcd server and the etcdctl tools:

{  

tar -xvf etcd-v3.4.0-linux-amd64.tar.gz  

sudo mv etcd-v3.4.0-linux-amd64/etcd* /usr/local/bin/

}

{  

sudo mkdir -p /etc/etcd /var/lib/etcd  

sudo cp ca.pem kubernetes-key.pem kubernetes.pem /etc/etcd/

}

Get the internal IP address for the current compute instance. It will be will be used to deal with client requests and data transmission with etcd cluster peers.:

INTERNAL_IP=$(curl -s -H "Metadata-Flavor: Google" \  http://metadata.google.internal/computeMetadata/v1/instance/network-interfaces/0/ip)

 Place the unique name for etcd to match the hostname of the current compute instance:

ETCD_NAME=$(hostname -s)

Create the etcd.service systemd unit file:

cat <<EOF | sudo tee /etc/systemd/system/etcd.service[Unit]

Description=etcd

Documentation=https://github.com/coreos

[Service]

Type=notify

ExecStart=/usr/local/bin/etcd \\  

--name ${ETCD_NAME} \\  

--cert-file=/etc/etcd/kubernetes.pem \\  

--key-file=/etc/etcd/kubernetes-key.pem \\  

--peer-cert-file=/etc/etcd/kubernetes.pem \\  

--peer-key-file=/etc/etcd/kubernetes-key.pem \\  

--trusted-ca-file=/etc/etcd/ca.pem \\  

--peer-trusted-ca-file=/etc/etcd/ca.pem \\  

--peer-client-cert-auth \\  

--client-cert-auth \\  

--initial-advertise-peer-urls https://${INTERNAL_IP}:2380 \\  

--listen-peer-urls https://${INTERNAL_IP}:2380 \\  

--listen-client-urls https://${INTERNAL_IP}:2379,https://127.0.0.1:2379 \\  

--advertise-client-urls https://${INTERNAL_IP}:2379 \\  

--initial-cluster-token etcd-cluster-0 \\  

--initial-cluster controller-0=https://10.240.0.10:2380,controller-1=https://10.240.0.11:2380,controller-2=https://10.240.0.12:2380 \\  

--initial-cluster-state new \\  

--data-dir=/var/lib/etcd

Restart=on-failure

RestartSec=5

[Install]

WantedBy=multi-user.target

EOF

 

Initiate etcd Server

{  

sudo systemctl daemon-reload  

sudo systemctl enable etcd  

sudo systemctl start etcd

}

Repeat above commands on: controller-0, controller-1, and controller-2.

List the etcd cluster members:

sudo ETCDCTL_API=3 etcdctl member list \  

--endpoints=https://127.0.0.1:2379 \  

--cacert=/etc/etcd/ca.pem \  

--cert=/etc/etcd/kubernetes.pem \  

--key=/etc/etcd/kubernetes-key.pem

Output:

3a57933972cb5131, started, controller-2, https://10.240.0.12:2380, https://10.240.0.12:2379

f98dc20bce6225a0, started, controller-0, https://10.240.0.10:2380, https://10.240.0.10:2379

ffed16798470cab5, started, controller-1, https://10.240.0.11:2380, https://10.240.0.11:2379

Conclusion

Etcd is an independent project at its core. But, it has been used extensively by the Kubernetes community to provide various benefits for managing states of clusters, enabling further automation for dynamic workloads. The key benefit for using Kubernetes with etcd is that, etcd is itself a distributed database that co-align with distributed Kubernetes clusters. So, using etcd with Kubernetes is vital for the health of the clusters.

About the author

Sagar Nangare is a technology blogger, focusing on data center technologies (networking, telecom, cloud, storage) and emerging domains like edge computing, IoT, machine learning, AI). He works at Calsoft Inc. as a digital strategist.

The post A Guide to Kubernetes Etcd: All You Need to Know to Set up Etcd Clusters appeared first on Superuser.

by Sagar Nangare at December 06, 2019 01:00 PM

December 03, 2019

Adam Young

Let’s Buildah Keystoneconfig

Buildah is a valuable tool in the container ecosystem. As an effort to get more familiar with it, and to finally get my hand-rolled version of Keystone to deploy on Kubernetes, I decided to work through building a couple of Keystone based containers with Buildah.

First, I went with the simple approach of modifying my old Dockerfiles to a later release of OpenStack, and kick off the install using buildah. I went with Stein.

Why not Train? Because eventually I want to test 0 down time upgrades. More on that later

The buildah command was just:

 buildah bud -t keystone 

However, to make that work, I had to adjust the Dockerfile. Here is the diff:

diff --git a/keystoneconfig/Dockerfile b/keystoneconfig/Dockerfile
index 149e62f..cd5aa5c 100644
--- a/keystoneconfig/Dockerfile
+++ b/keystoneconfig/Dockerfile
@@ -1,11 +1,11 @@
-FROM index.docker.io/centos:7
+FROM docker.io/centos:7
 MAINTAINER Adam Young 
  
-RUN yum install -y centos-release-openstack-rocky &&\
+RUN yum install -y centos-release-openstack-stein &&\
     yum update -y &&\
     yum -y install openstack-keystone mariadb openstack-utils  &&\
     yum -y clean all
  
 COPY ./keystone-configure.sql /
 COPY ./configure_keystone.sh /
-CMD /configure_keystone.sh
\ No newline at end of file
+CMD /configure_keystone.sh

The biggest difference is that I had to specify the name of the base image without the “index.” prefix. Buildah is strictah (heh) in what it accepts.

I also updated the package to stein. When I was done, I had the following:

$ buildah images
REPOSITORY                 TAG      IMAGE ID       CREATED          SIZE
localhost/keystone         latest   e52d224fa8fe   13 minutes ago   509 MB
docker.io/library/centos   7        5e35e350aded   3 weeks ago      211 MB

What if I wanted to do these same things via manual steps? Following the advice from the community, I can translate from Dockerfile-ese to buildah. First, I can fetch the original image using the buildah from command:

container=$(buildah from docker.io/centos:7)
$ echo $container 
centos-working-container

Now Add things to the container. We don’t build a new layer with each command, so the && approach is not required. So for the yum installs:

buildah run $container yum install -y centos-release-openstack-stein
buildah run $container yum update -y
buildah run $container  yum -y install openstack-keystone mariadb openstack-utils
buildah run $container  yum -y clean all

To Get the files into the container, use the copy commands:

buildah copy $container  ./keystone-configure.sql / 
buildah copy $container ./configure_keystone.sh / 

The final steps: tell the container what command to run and commit it to an image.

buildah config --cmd /configure_keystone.sh $container
buildah commit $container keystone

What do we end up with?

$ buildah images
REPOSITORY                 TAG      IMAGE ID       CREATED              SIZE
localhost/keystone         latest   09981bc1e95a   About a minute ago   509 MB
docker.io/library/centos   7        5e35e350aded   3 weeks ago          211 MB

Since I have an old, hard-coded IP address for the MySQL server, it is going to fail. But lets see:

buildah run centos-working-container /configure_keystone.sh
2019-12-03T16:34:16.000691965Z: cannot configure rootless cgroup using the cgroupfs manager
Database

And there it hangs. We’ll work on that in a bit.

I committed the container before setting the author field. That should be a line like:
buildah config --author "ayoung@redhat.com"
to map line-to-line with the Dockerfile.

by Adam Young at December 03, 2019 04:43 PM

December 01, 2019

Thomas Goirand

Upgrading an OpenStack Rocky cluster from Stretch to Buster

Upgrading an OpenStack cluster from one version of OpenStack to another has become easier, thanks to the versioning of objects in the rabbitmq message bus (if you want to know more, see what oslo.versionedobjects is). But upgrading from Stretch to Buster isn’t easy at all, event with the same version of OpenStack (it is easier to be running OpenStack Rocky backports on Stretch and upgrade to Rocky on Buster, rather than upgrading OpenStack at the same time as the system).

The reason it is difficult, is because rabbitmq and corosync in Stretch can’t talk to the versions shipped in Buster. Also, in a normal OpenStack cluster deployment, services on all machines are constantly doing queries to the OpenStack API, and exchanging messages through the RabbitMQ message bus. One of the dangers, for example, would be if a Neutron DHCP agent could not exchange messages with the neutron-rpc-server. Your VM instances in the OpenStack cluster then could loose connectivity.

If a constantly online HA upgrade with no downtime isn’t possible, it is however possible to minimize down time to just a few seconds, if following a correct procedure. It took me more than 10 tries to be able to do everything in a smooth way, understanding and working around all the issues. 10 tries, means installing 10 times an OpenStack cluster in Stretch (which, even if fully automated, takes about 2 hours) and trying to upgrade it to Buster. All of this is very time consuming, and I haven’t seen any web site documenting this process.

This blog post intends to document such a process, to save the readers the pain of hours of experimentation.

Note that this blog post asserts you’re cluster has been deployed using OCI (see: https://salsa.debian.org/openstack-team/debian/openstack-cluster-installer) however, it should also apply to any generic OpenStack installation, or even to any cluster running RabbitMQ and Corosync.

The root cause of the problem more in details: incompatible RabbitMQ and Corosync in Stretch and Buster

RabbitMQ in Stretch is version 3.6.6, and Buster has version 3.7.8. In theory, the documentation of RabbitMQ says it is possible to smoothly upgrade a cluster with these versions. However, in practice, the problem is the Erlang version rather than Rabbit itself: RabbitMQ in Buster will refuse to talk to a cluster running Stretch (the daemon will even refuse to start).

The same way, Corosync 3.0 in Buster will refuse to accept messages from Corosync 2.4 in Stretch.

Overview of the solution for RabbitMQ & Corosync

To minimize downtime, my method is to shutdown RabbitMQ on node 1, and let all daemons (re-)connect to node 2 and 3. Then we upgrade node 1 fully, and then restart Rabbit in there. Then we shutdown Rabbit on node 2 and 3, so that all daemons of the cluster reconnect to node 1. If done well, the only issue is if a message is still in the cluster of node 2 and 3 when daemons fail-over to node 1. In reality, this isn’t really a problem, unless there’s a lot of activity on the API of OpenStack. If this was the case (for example, if running a public cloud), then the advise would simply to firewall the OpenStack API for the short upgrade period (which shouldn’t last more than a few minutes).

Then we upgrade node 2 and 3 and make them join the newly created RabbitMQ cluster in node 1.

For Corosync, node 1 will not allow start the VIP resource before node 2 is upgraded and both nodes can talk to each other. So we just upgrade node 2, and turn off the VIP resource on node 3 immediately when it is up on node 1 and 2 (which happens during the upgrade of node 2).

The above should be enough reading for most readers. If you’re not that much into OpenStack, it’s ok to stop reading this post. For those who are move involved users of OpenStack on Debian deployed with OCI, let’s go more in details…

Before you start: upgrading OCI

In previous versions of OCI, the haproxy configuration was missing a “option httpcheck” for the MariaDB backend, and therefore, if a MySQL server on one node was going down, haproxy wouldn’t detect it, and the whole cluster could fail (re-)connecting to MySQL. As we’re going to bring some MySQL servers down, make sure the puppet-master is running with the latest version of puppet-module-oci, and that the changes have been applied in all OpenStack controller nodes.

Upgrading compute nodes

Before we upgrade the controllers, it’s best to start by compute nodes, which are the most easy to do. The easiest way is to live-migrate all VMs away from the machine before proceeding. First, we disable the node, so no new VM can be spawned on it:

openstack compute service set --disable z-compute-1.example.com nova-compute

Then we list all VMs on that compute node:

openstack server list –all-projects –host z-compute-1.example.com

Finally we migrate all VMs away:

openstack server migrate --live hostname-compute-3.infomaniak.ch --block-migration 8dac2f33-d4fd-4c11-b814-5f6959fe9aac

Now we can do the upgrade. First disable pupet, then tweak the sources.list, upgrade and reboot:

puppet agent --disable "Upgrading to buster"
apt-get remove python3-rgw python3-rbd python3-rados python3-cephfs librgw2 librbd1 librados2 libcephfs2
rm /etc/apt/sources.list.d/ceph.list
sed -i s/stretch/buster/g /etc/apt/sources.list
mv /etc/apt/sources.list.d/stretch-rocky.list /etc/apt/sources.list.d/buster-rocky.list
echo "deb http://stretch-rocky.debian.net/debian buster-rocky-proposed-updates main
deb-src http://stretch-rocky.debian.net/debian buster-rocky-proposed-updates main" >/etc/apt/sources.list/buster-rocky.list
apt-get update
apt-get dist-upgrade
reboot

Then we simply re-apply puppet:

puppet agent --enable ; puppet agent -t
apt-get purge linux-image-4.19.0-0.bpo.5-amd64 linux-image-4.9.0-9-amd64

Then we can re-enable the compute service:

openstack compute service set --enable z-compute-1.example.com nova-compute

Repeate the operation for all compute nodes, then we’re ready for the upgrade of controller nodes.

Removing Ceph dependencies from nodes

Most likely, if running with OpenStack Rocky on Stretch, you’d be running with upstream packages for Ceph Luminous. When upgrading to Buster, there’s no upstream repository anymore, and packages will use Ceph Luminous directly from Buster. Unfortunately, the packages from Buster are in a lower version than the packages from upstream. So before upgrading, we must remove all Ceph packages from upstream. This is what has been done just above for the compute nodes also. Upstream Ceph packages are easily identifiable, because upstream uses “bpo90” instead of what we do in Debian (ie: bpo9), so the operation can be:

apt-get remove $(dpkg -l | grep bpo90 | awk '{print $2}' | tr '\n' ' ')

This will remove python3-nova, which is fine as it is also running on the other 2 controllers. After switching the /etc/apt/sources.list to buster, Nova can be installed again.

In a normal setup by OCI, here’s the sequence of command that needs to be done:

rm /etc/apt/sources.list.d/ceph.list
sed -i s/stretch/buster/g /etc/apt/sources.list
mv /etc/apt/sources.list.d/stretch-rocky.list /etc/apt/sources.list.d/buster-rocky.list
echo "deb http://stretch-rocky.debian.net/debian buster-rocky-proposed-updates main
deb-src http://stretch-rocky.debian.net/debian buster-rocky-proposed-updates main" >/etc/apt/sources.list/buster-rocky.list
apt-get update
apt-get dist-upgrade
apt-get install nova-api nova-conductor nova-consoleauth nova-consoleproxy nova-placement-api nova-scheduler

You may notice that we’re replacing the Stretch Rocky backports repository by one for Buster. Indeed, even if all of Rocky is in Buster, there’s a few packages that are still pending for the review of the Debian stable release team before they can be uploaded to Buster, and we need the fixes for a smooth upgrade. See release team bugs #942201, #942102, #944594, #941901 and #939036 for more details.

Also, since we only did a “apt-get remove”, the Nova configuration in nova.conf must have stayed, and nova is already configured, so when we reinstall the services we removed when removing the Ceph dependencies, they will be ready to go.

Upgrading the MariaDB galera cluster

In an HA OpenStack cluster, typically, a Galera MariaDB cluster is used. That isn’t a problem when upgrading from Stretch to Buster, because the on-the-wire format stays the same. However, the xtrabackup library in Stretch is held by the MariaDB packages themselves, while in Buster, one must install the mariadb-backup. As a consequence, best is to simply turn off MariaDB in a node, do the Buster upgrade, install the mariadb-backup package, and restart MariaDB. To avoid that the MariaDB package attempts restarting the mysqld daemon, best is to mask the systemd unit:

systemctl stop mysql.service
systemctl disable mysql.service
systemctl mask mysql.service

Upgrading rabbitmq-server

Before doing anything, make sure all of your cluster is running with the python3-oslo.messaging version >= 8.1.4. Indeed, version 8.1.3 suffers from a bug where daemons would attempt reconnect constantly to the same server, instead of trying each of the servers described in the transport_url directive. Note that I’ve uploaded 8.1.4-1+deb10u1 to Buster, and that it is part of the 10.2 Buster point release. Though upgrading oslo.messaging will not restart daemons automatically: this must be done manually.

The strategy for RabbitMQ is to completely upgrade one node, start Rabbit on it, without any clustering, then shutdown the service on the other 2 node of the cluster. If this is performed fast enough, no message will be list in the message bus. However, there’s a few traps. Running “rabbitmqctl froget_cluster_node” only removes a node from the cluster for those who will still be running. It doesn’t remove the other nodes from the one which we want to upgrade. The way I’ve found to solve this is to simply remove the mnesia database of the first node, so that when it starts, RabbitMQ doesn’t attempt to cluster with the other 2 which are running a different version of Erlang. If it did, then it would just fail and refused to start.

However, there’s another issue to take care. When upgrading the 1st node to Buster, we removed Nova, because of the Ceph issue. Before we restart the RabbitMQ service on node 1, we need to install Nova, so that it will connect to either node 2 or 3. If we don’t do that, then Nova on node 1 may connect to the RabbitMQ service on node 1, which at this point, is a different RabbitMQ cluster than the one in node 2 and 3.

rabbitmqctl stop_app
systemctl stop rabbitmq-server.service
systemctl disable rabbitmq-server.service
systemctl mask rabbitmq-server.service
[ ... do the Buster upgrade fully ...]
[ ... reinstall Nova services we removed when removing Ceph ...]
rm -rf /var/lib/rabbitmq/mnesia
systemctl unmask rabbitmq-server.service
systemctl enable rabbitmq-server.service
systemctl start rabbitmq-server.service

At this point, since the node 1 RabbitMQ service was down, all daemons are connected to the RabbitMQ service on node 2 or 3. Removing the mnesia database removes all the credentials previously added to rabbitmq. If nothing is done, OpenStack daemons will not be able to connect to the RabbitMQ service on node 1. If like I do, one is using a config management system to populate the access rights, it’s rather easy: simply re-apply the puppet manifests, which will re-add the credentials. However, that isn’t enough: the RabbitMQ message queues are created when the OpenStack daemon starts. As I experienced, daemons will reconnect to the message bus, but will not recreate the queues unless daemons are restarted. Therefore, the sequence is as follow:

Do “rabbitmqctl start_app” on the first node. Add all credentials to it. If your cluster was setup with OCI and puppet, simply look at the output of “puppet agent -t –debug” to capture the list of commands to perform the credential setup.

Do a “rabbitmqctl stop_app” on both remaining nodes 2 and 3. At this point, all daemons will reconnect to the only remaining server. However, they wont be able to exchange messages, as the queues aren’t declared. This is when we must restart all daemons in one of the controllers. The whole operation normally doesn’t take more than a few seconds, which is how long your message bus wont be available. To make sure everything works, check the logs in /var/log/nova/nova-compute.log of one of your compute nodes to make sure Nova is able to report its configuration to the placement service.

Once all of this is done, there’s nothing to worry anymore about RabbitMQ, as all daemons of the cluster are connected to the service on node 1. However, one must make sure that, when upgrading node 2 and 3, they don’t reconnect to the message service on node 2 and 3. So best is to simply stop, disable and mask the service with systemd before continuing. Then, when restarting the Rabbit service on node 2 and 3, OCI’s shell script “oci-auto-join-rabbitmq-cluster” will make them join the new Rabbit cluster, and everything should be fine regarding the message bus.

Upgrading corosync

In an OpenStack cluster setup by OCI, 3 controllers are typically setup, serving the OpenStack API through a VIP (a Virtual IP). What we call a virtual IP is simply an IP address which is able to move from one node to another automatically depending on the cluster state. For example, with 3 nodes, if one goes down, one of the other 2 nodes will take over hosting the IP address which serves the OpenStack API. This is typically done with corosync/pacemaker, which is what OCI sets up.

The way to upgrade corosync is easier than the RabbitMQ case. The first node will refuse to start the corosync resource if it can’t talk to at least a 2nd node. Therefore, upgrading the first node is transparent until we touch the 2nd node: the openstack-api resource wont be started on the first node, so we can finish the upgrade in it safely (ie: take care of RabbitMQ as per above). The first thing to do is probably to move the resource to the 3rd node:

crm_resource --move --resource openstack-api-vip --node z-controller-3.example.com

Once the first node is completely upgraded, we upgrade the 2nd node. When it is up again, we can check the corosync status to make sure it is running on both node 1 and 2:

crm status

If we see the service is up on node 1 and 2, we must quickly shutdown the corosync resource on node 3:

crm resource stop openstack-api-vip

If that’s not done, then node 3 may also reclaim the VIP, and therefore, 2 nodes may it. If running with the VIP using L2 protocol, normally switches will connect only one of the machines declaring the VIP, so even if we don’t take care of it immediately, the upgrade should be smooth anyway. If, like I do in production, you’re running with BGP (OCI allows one to use BGP for the VIP, or simply use an IP on a normal L2 network), then the situation must be even better, as the peering router will continue to route to one of the controllers in the cluster. So no stress, this must be done, but no need to hurry as much as for the RabbitMQ service.

Finalizing the upgrade

Once node 1 and 2 are up, most of the work is done, and the 3rd node can be upgraded without any stress.

Recap of the procedure for controllers

  • Move all SNAT virtual routers running on node 1 to node 2 or 3 (note: this isn’t needed if the cluster has network nodes).
  • Disable puppet on node 1.
  • Remove all Ceph libraries from upstream on node 1, which also turn off some Nova services that runtime depend on them.
  • shutdown rabbitmq on node 1, including masking the service with systemd.
  • upgrade node 1 to Buster, fully. Then reboot it. This probably will trigger MySQL re-connections to node 2 or 3.
  • install mariadb-backup, start the mysql service, and make sure MariaDB is in sync with the other 2 nodes (check the log files).
  • reinstall missing Nova services on node 1.
  • remove the mnesia db on node 1.
  • start rabbitmq on node 1 (which now, isn’t part of the RabbitMQ cluster on node 2 and 3).
  • Disable puppet on node 2.
  • populate RabbitMQ access rights on node 1. This can be done by simply applying puppet, but may be dangerous if puppet restarts the OpenStack daemons (which therefore may connect to the RabbitMQ on node 1), so best is to just re-apply the grant access commands only.
  • shutdown rabbitmq on node 2 and 3 using “rabbitmqctl stop_app”.
  • quickly restart all daemons on one controller (for example the daemons on node 1) to declare message queues. Now all daemons must be reconnected and working with the RabbitMQ cluster on node 1 alone.
  • Re-enable puppet, and re-apply puppet on node 1.
  • Move all Neutron virtual routers from node 2 to node 1.
  • Make sure the RabbitMQ services are completely stopped on node 2 and 3 (mask the service with systemd).
  • upgrade node 2 to Buster (shutting down RabbitMQ completely, masking the service to avoid it restarts during upgrade, removing the mnesia db for RabbitMQ, and finally making it rejoin the newly node 1 single node cluster using oci-auto-join-rabbitmq-cluster: normally, puppet does that for us).
  • Reboot node 2.
  • When corosync on node 2 is up again, check corosync status to make sure we are clustering between node 1 and 2 (maybe the resource on node 1 needs to be started), and shutdown the corosync “openstack-api-vip” resource on node 3 to avoid the VIP to be declared on both nodes.
  • Re-enable puppet and run puppet agent -t on node 2.
  • Make node 2 rabbitmq-server has joined the new cluster declared on node 1 (do: rabbitmqctl cluster_status) so we have HA for Rabbit again.
  • Move all Neutron virtual routers of node 3 to node 1 or 2.
  • Upgrade node 3 fully, reboot it, and make sure Rabbit is connected to node 1 and 2, as well as corosync working too, then re-apply puppet again.

Note that we do need to re-apply puppet each time, because of some differences between Stretch and Buster. For example, Neutron in Rocky isn’t able to use iptables-nft, and puppet needs to run some update-alternatives command to select iptables-legacy instead (I’m writing this because this isn’t obvious, it’s just that sometimes, Neutron fails to parse the output of iptables-nft…).

Last words as a conclusion

While OpenStack itself has made a lot of progress for the upgrade, it is very disappointing that those components on which OpenStack relies (like corosync, who is typically used as the provider of high availability), aren’t designed with backward compatibility in mind. It is also disappointing that the Erlang versions in Stretch and Buster are incompatible this way.

However, with the correct procedure, it’s still possible to keep services up and running, with a very small down time, even to the point that a public cloud user wouldn’t even notice it.

As the procedure isn’t easy, I strongly suggest anyone attempting such an upgrade to train before proceeding. With OCI, it is easy to do run a PoC using the openstack-cluster-installer-poc package, which is the perfect environment to train on: it’s easy to reproduce, reinstall a cluster and restart the upgrade procedure.

by Goirand Thomas at December 01, 2019 04:45 PM

November 28, 2019

Aptira

Comparison of Software Defined Networking (SDN) Controllers. Part 8: Tungsten Fabric

Aptira Comparison of Software Defined Networking (SDN) Controllers. Tungsten Fabric

The previous Software Defined Networking (SDN) in this series might help users and organisations to choose a right SDN controller for their platform that matches their network infrastructure and requirements. These controllers could be a suitable choice to be used in Communication Service Providers (CSP), data centers, research or suitable choice for integration with other platforms. However, with the current IT market, organisations are moving towards migrating their old infrastructure to the Cloud and cloudifying every part of their infrastructure. As such, we will now look at one of the SDN controllers which has been designed to work in a cloud-grade network – Tungsten Fabric (TF).

TF can be a suitable choice for cloud builders and cloud-native platform engineers. It has been first associated with Juniper but now is under the Linux Foundation umbrella.

Architecture

Tungsten Fabrics architecture is composed of two major software components: TF vRouter and TF Controller.

Aptira Tungsten Fabric Architecture
TF vRouter is used for packet forwarding and applying network and security policies to the devices in the network.

  • VRouters need to be run in each host or compute node in the network. It replaces the Linux bridge and traditional routing stack IP tables, or OpenVSwitch networking on the compute hosts.
  • The TF Controller communicates with the vRouters via Extensible Messaging and Presence Protocol (XMPP) to apply the desired networking and security policies.

TF Controllers consists of following software services:

  • Control and Configuration services for communicating with vRouters and maintaining the network topology and network policies.
  • Analytics services for telemetry and troubleshooting.
  • Web UI services for interacting with users.
  • And finally, services to provide integration with private and public could, CNI plugins, virtual machine and bare metal.

Tungsten Fabric version 5.0 and later architecture use microservices based on Docker containers as shown in figure below to deploy the services mentioned above. This makes the controller resilient against failure and highly available which result in the customer user experience.

Aptira Tungsten Fabric Architecture

Modularity and Extensibility

TF microservice-based architecture allows developing particular services based on the performance requirement and increasing load. Also, microservices by nature are modular which makes the maintenance and extensibility of the platform easy whilst isolating the failure of services from each other.

Scalability

Cluster Scalability

  • TF proceeds towards cluster scalability in a modular fashion. This means each TF role can be scaled horizontally by adding more nodes for that related role. Also, the number of pods for each node is scalable. Zookeeper has been used to choose the active node so the number of pods deployed in the Controller and Analytics nodes must be an odd number according to the nature of the Zookeeper algorithm.

Architectural Scalability

  • TF supports BGP protocol and each TF controller can be connected to other controllers via the BGP protocol. This means TF can be used to connect different SDN islands.

Interfaces

  • Southbound: TF uses the XMPP protocol for communicating with vRouters (data plane) to deliver the overlay SDN solution. BPG also can be used to communicate with legacy devices.
  • Northbound: TF supports Web GUI and RESTful APIs. Plug-ins integrate with other platforms such as orchestrators, clouds and OSS/BSS.

Telemetry

Analytics nodes extract usable telemetry information form infrastructure. The data can then be normalised to the common format and the output is sent via the Kafka service into a Cassandra database. This data can be used in a multitude of ways operationally, from problem solving to capacity planning. Redis uses the data for generating graphs and running queries. The Redis pod is deployed between the analytics pod and the Web UI pod.

Resilience and Fault Tolerance

The modular architecture of Tungsten Fabric makes it resilient against failure, with typically several controllers/pods running on several servers for high availability. Also, the failure of a service is isolated, so it does not affect the whole system. The API and Web GUI services are accessed through a load balancer. The load balancer can allow pods to be in different subnets.

Programming Language

TF supports C++, Python, Go, Node.js.

Community

TF was first associated with Juniper but is now supported under the Linux Foundation Networking umbrella and boasts a large developer and user community.

Conclusion

Given this evaluation; TF is a suitable choice for cloud builders and cloud-native platform engineers. This is because it works flexibly with private and public Clouds, CNI plugins, virtual machines and bare metal. Depending on the orchestrator integrated, it exposes heat APIs, Kubernetes APIs, etc. to instantiate network and security policies. The scalability of TF makes it highly available and resilient against failure which increases the customer user experience. Finally, the modularity features of it allows users to easily customise, read, test and maintain each module separately.

SDN Controller Comparisons:

Remove the complexity of networking at scale.
Learn more about our SDN & NFV solutions.

Learn More

The post Comparison of Software Defined Networking (SDN) Controllers. Part 8: Tungsten Fabric appeared first on Aptira.

by Farzaneh Pakzad at November 28, 2019 12:48 PM

November 26, 2019

OpenStack Superuser

Inside open infrastructure: The latest from the OpenStack Foundation

Welcome to the latest edition of the OpenStack Foundation Open Infrastructure newsletter, a digest of the latest developments and activities across open infrastructure projects, events and users. Sign up to receive the newsletter and email community@openstack.org to contribute.

Spotlight on the Open Infrastructure Summit Shanghai

Attendees from over 45 countries attended the Open Infrastructure Summit earlier this month that was hosted in Shanghai, followed by the Project Teams Gathering (PTG). Use cases, tutorials, and demos covering 40+ open source projects including Airship, Ceph, Hadoop, Kata Containers, Kubernetes, OpenStack, StarlingX, and Zuul were featured at the Summit.

With the support of the active Open Infrastructure community in China, the market share of OpenStack in the APAC region is expected to increase by 36% in the next four years (451 Research report: OpenStack Market Monitor, 451 Research, September 2019). Currently, China is the second largest market adopting OpenStack software, and it ranks second in the code contribution of the latest version of the OpenStack Train release. Just like what Jonathan Bryce said in the keynotes, “The Summits bring our community members together to meet face to face, advancing the software we build and use daily.”
Check out the highlights of the Open Infrastructure Summit Shanghai:

  • In the Monday morning keynotes, Guohua Xi, the President of the China Communications Standards Association (CCSA), kicked off the event by sharing a call to action for the Chinese community to encourage cross community collaboration to drive innovation. Open Infrastructure users including Baidu, China Mobile, China Telecom, China Unicom, Intel, and Tencent also gave a keynote and shared the key role of the open source projects, such as Kata Containers and OpenStack, in their 5G and container business strategies. Keynote videos are now available here
  • In breakout sessions, Alibaba, Baidu and Tencent presented their Open Infrastructure use cases, highlighting the integration of multiple technologies including Ceph, Kata Containers, Kubernetes, OpenStack, and more. China Railway, China Mobile, Walmart Labs, Line and China UnionPay are among additional Open Infrastructure users who shared their innovations and open source best practices at the Shanghai Summit. Breakout session videos are being added here
  • For its latest release Train, OpenStack received 25,500 code changes by 1,125 developers from 150 different companies. This pace of development makes OpenStack one of the top three most active open source projects in the world alongside Chromium and Linux. 
  • Selected by members of the OSF community, Baidu ABC Cloud Group and Edge Security Team won the Superuser Award for the unique nature of its Kata Containers and OpenStack use case as well as its integration and application of open infrastructure.
  • Combining OpenStack and Kubernetes to address users’ infrastructure needs at scale, Airship joined Kata Containers and Zuul as confirmed Open Infrastructure Projects supported by the OpenStack Foundation. SKT, Intel, Inspur and more companies presented their Airship uses case on developing infrastructure solution.
  • Congratulations to Troila for being elected as a new Gold Member of the OpenStack Foundation! Learn more about it here

Summit keynote videos are already available, and breakout videos will be available on the Open Infrastructure videos page in the upcoming weeks. Thank you to our Shanghai Summit sponsors for supporting the event!

OpenStack Foundation (OSF)

  • The next OSF event will be a collaboration-centric event, happening in Vancouver, Canada June 8-11, 2020. Mark your calendars!
  • Troila was elected as a new Gold Member for the OpenStack Foundation at the Shanghai Board of Directors meeting.

Airship: Elevate your infrastructure

  • Last month, Airship was confirmed by OSF as a top level project — congratulations to the community!
  • The Airship community has made significant progress in Airship 2.0. 17% of planned work was completed, and another 18% is in progress and/or in review. The community is looking for more developers to contribute code. Interested in getting involved? Check out this page.

Kata Containers: The speed of containers, the security of VMs

OpenStack: Open source software for creating private and public clouds

  • Several OpenStack project teams, SIGs and working groups met during the Project Teams Gathering in Shanghai to prepare the Ussuri development cycle. Reports are starting to be posted to the openstack-discuss mailing-list.
  • Sławek Kapłoński, the Neutron PTL, recently reported that neutron-fwaas, neutron-vpnaas, neutron-bagpipe and neutron-bgpvpn are lacking interested maintainers. The Neutron team will drop those modules from future official OpenStack releases if nothing changes by the ussuri-2 milestone, February 14. If you are using those features and would like to step up to help, now is your chance!
  • We are looking for a name for the ‘V’ release of OpenStack, to follow the Ussuri release. Learn more about it in this post by Sean McGinnis
  • The next OpenStack Ops meetup will happen in London, UK on January 7-8. Stay tuned for registration information!

StarlingX: A fully featured cloud for the distributed edge

  • The StarlingX community met during the Project Teams Gathering in Shanghai to discuss topics like 4.0 release planning, documentation and how to improve the contribution process. You can check notes on their etherpad for the event.
  • The upcoming StarlingX 3.0 release will contain the Train version of OpenStack. The community is working on some last bits including testing and bug fixes before the release in December. You can find more information in StoryBoard about the release.

Zuul: Stop merging broken code

  • The Open Infrastructure Summit in Shanghai included a variety of talks, presentations, and discussions about Zuul; a quick project update from lead Zuul maintainer James Blair during keynotes set the tone for the days which followed.

Find the OSF at these upcoming Open Infrastructure community events

Questions / feedback / contribute

This newsletter is written and edited by the OSF staff to highlight open infrastructure communities. We want to hear from you! If you have feedback, news or stories that you want to share, reach us through community@openstack.org . To receive the newsletter, sign up here.

The post Inside open infrastructure: The latest from the OpenStack Foundation appeared first on Superuser.

by Allison Price at November 26, 2019 09:11 PM

November 25, 2019

Ghanshyam Mann

Recap of Open Infrastructure Summit & PTG, Shanghai 2019

Open Infrastructure Summit, Shanghai 2019

Open Infrastructure Summit followed by OpenStack PTG was held in Shanghai, China: 4th Nov 2019 till 8th Nov 2019. The first 3 days were for Summit which is market event including Forum sessions and the last 3 days for Project Team Gathering (PTG) with one day overlap.

I arrived in Shanghai on 1st Nov to participate in pre-summit events like Upstream Training and Board of Directors meeting.

    Upstream Institute Training Shanghai:

Like other Summits, Upstream training was held in Shanghai for 1.5 days. The second half on 2nd Nov and a full day on 3rd Nov. Thanks to Lenovo and Jay to sponsor the training this time too.

Etherpad

The first day was 9 mentors and ~20 students. The first day covered the introduction, registration and governance part including VM image setup etc. Students were from different countries, for example South Korea, India and of course China. Two developers from South Korea were interested in Swift contribution. They later joined the Swift PTG and interacted with the team.  One developer from India is doing cloud testing of their baremetal nodes via QA tooling. I had further discussion with him in QA PTG. I am happy to get this kind of interaction in every training and useful to get them onboard in Upstream activities.

The second day was fewer mentors and more students. I and few more mentors could not participate in Training due to the Joint leadership meeting.

    Ussuri cycle community-wide goals discussion:

Three goals were discussed in detail and how to proceed with each of them. Etherpad.

    Drop Python 2.7 Support:

Ussuri is time to drop the python 2 support from OpenStack. Plan and schedule were already discussed during TC office hour and on ML.  This was agreed to make community-wide goal. We discussed keeping the CI/CD support for Swift which is the only project keeping the py2 support. Swift needs the devstack to keep installing on py2 env with the rest of the services on py3 (same as old jobs when Swift was on py2 by default on devstack). There is no oslo dependency from swift and all the other dependency will be capped for py2 version. Requirements check job currently checks that if openstack/requirements list two entries for a requirement. smcginnis patch to change the requirement check is already merged. Everything else will go as discussed in ML. The work on this already started and patches for all the services are up for review now.

    Project Specific New Contributor & PTL Docs

As per feedback in Forum sessions, this is a good goal which will make documentation more consistent. All the projects should edit their contributor.rst to follow a more complete template and adjust/add PTL documentation. This is accepted as a pre-approved as Ussuri goal. Kim Hindhart is working on getting EU funding for people to work on OpenStack and they like consistent documentation.

    Switch remaining legacy jobs to Zuul v3 and drop legacy support

Many projects are still not ready for this goal. Grenade job is not yet on zuulv3 which is required to finish first. Few projects waiting for big projects finishing the zuulv3 migration first. This needs more work and can be a “pre-approved” thing for V, and this would be split to focus on the Grenade work in U. We will continue to review the proposed goal and pre-work etc.

Other than above 3 goals, there were few more ideas for goal candidate and good to go in goal backlogs etherpad:
– cdent: stop using paste, pastedeploy and WSME
Note from Chris: This does not need to be a community goal as such but requires the common solution from TC WSME is still used, has contributions, and at least a core or two

– cmurphy: Consistent and secure default policies. As per the forum discussion this is going with pop-up team first.

– support matrix documentation to be consistent across projects. going with pop-up team (fungi can propose the pop-up team in governance) first Richard Pioso (rpioso) to help fungi on this once consistent framework is identified, the pop-up team can expire with the approval of a related cycle goal for implementing it across remaining projects

    OpenStack QA PTG & Forum Sessions Summary:

I wrote a separate blog to summarize the QA discussions that happened in Forum or PTG.

    Nova API Policies defaults:

Etherpad.

Nova planned to implement the default policy refresh by adopting the system scope and new default roles available in keystone. This was planned for the Train cycle when spec was merged but could not start the implementation. Nova Spec is already merged for Ussuri cycle. The main challenge to do this work is how to complete this in a single cycle so that users upgrade would not impact more than once. We discussed various options like a flag to suppress the deprecation warning or new policy enforcement. Getting all review up and hold the procedural hold on the first patch and later we merge all of them together. Doing the code up after first set merge and more active review will be required for this. Keystone team will help in reviewing the changes. I am very positive to complete this in Ussuri cycle.

    Technical Committee:

Friday was the full day for Technical Committee discussions. It was started with fun when JP collected the number of TC interested per topic and least interested topic to be discussed first :). He did a good job in organizing the discussion with time-based checks.   I am summarizing the few of the Topic below:

    Select U release goals:

This session was to selected the Ussuri goal. Ussuri cycle is already started so we have to finalize the goal asap. We agreed to proceed with below two goals for the Ussuri cyle.

1. Drop Python 2.7 Support – Already Accepted.
2. Project Specific New Contributor & PTL Docs – Under Review

Goal “Switch remaining legacy jobs to Zuul v3 and drop legacy support” will be pre-selected for V cycle, that does not mean to stop the review on the proposed goal or any ongoing work. All ongoing efforts will continue on this.

    Structurize the goal select schedule

After Summit and PTG event is merged we did not do well on community-wide goal selection. It was late in Train cycle also and so is for Ussuri.  Rico and I discussed about having the schedule for goal selection in TC. The main goal is to finalize the cycle community-wide goal before cycle start. This etherpad had the drafted schedule which all TC member agreed on. The first cycle to enforce this schedule will be challenging but let’s see how it will go. I will propose the schedule into TC documents soon.

    Python 2 drop current progress and next steps: 

During PTG, many projects discussed their plan to drop the python2 support. Cinder decided that we are going to just remove py2 compatibility and then do changes to backports in the cases that we need to for py27. Glance has greater challenges around WSGI. Other projects are almost good to go on this. Devstack also defaults to py3 now.

    Release naming convention:

This was again non-fun discussion. Fortunately, we concluded this. The agreement is that any community member can propose the name and TC will finally select the final one.

Few more topics were discussed in PTG which can be found in this Etherpad or ML.

by Ghanshyam Mann at November 25, 2019 05:04 AM

November 24, 2019

Ghanshyam Mann

Open Infrastructure Summit, Shanghai 2019: QA Summit & PTG Summary

OpenStack Quality Assurance Summit & PTG Summary

Open Infrastructure Summit, Shanghai 2019

Open Infrastructure Summit followed by OpenStack PTG was held in Shanghai, China: 4th Nov 2019 till 8th Nov 2019.

The first 3 days were for Summit where we had the forum sessions about user feedback on QA tooling on Monday and the last 3 days for Project Team Gathering (PTG) with one day overlap.

QA Forum sessions

    OpenStack QA – Project Update:  Wednesday, November 6, 10:15am-10:30am

We gave the updates on what we finished on Train and draft plan for the Ussuri cycle.
due to fewer contributors in QA, Train cycle activities are decreased as compare to Stein.  We tried to maintain the daily QA
activity and finished a few important things.

Slides: QA Project Update

    Users / Operators adoption of QA tools / plugins :Mon 41:20pm – 2:00pm

Etherpad. This is another useful session for QA to get feedback as well as information about downstream tooling.

Few tools we talked about:

  • Fault injection tests

One big concern shared from a few people about a long time to get merged tempest patches. One idea to solve this is to bring critical reviews in Office hours.

 

  QA PTG: 6th – 8th Nov:

It was a small gathering this time for one day for PTG on Wednesday. Even with small number of developers, we had good discussions on many topics.  I am summarizing the discussions:

Etherpad.

  Train Retrospective  

Retrospective bought up the few key issues where we need improvement. We collected the below action items including bug triage. Untriage QA bugs are increasing day by day.

  • Action:
    • need to discuss blacklist plugins and how to notify and remove them if dead – gmann
    • start the process of community-goal work in QA – masayuki
    • sprint for bug triage with number of volunteers – 
      • (chandankumar)Include one bug in each sprint in TripleO CI tempest member
      • Traige the new bug and then pick the bug based on priority
      • For tripleo Ci team we will track here: https://tree.taiga.io/project/tripleo-ci-board/ – chandankumar

  How to deal with an aging testing stack. 

With testtools being not so active, we need to think on the alternate or best suitable options to solve this issue. We discussed the few options which need to be discussed further on ML.

  • Can we fork the dependecies of testtools in Temepst or stestr ? 
  • As we are removing the py2.7 support in tempest, we can completly ignore/remove the unittest2 things but that is not case for testtools ?
  • Remove the support of unittest2 from testtools ? py2.7 is going away from everywhere and testools can create tag or something for py2.7 usage ?
  • Since Python2 is going EOL on 01st Jan, 2020, so let’s create a tag and remove the unitest2 with unitest for python3 release only

Action:

  • Document the official supported test runner by Tempest. –  Soniya Vyas/Chandan Kumar
  • ML to discuss the above options – gmann 

  Remove/migrate the .testr.conf to .stestr

60 openstack/* repositories have .stestr.conf AND .testr.conf. We don’t need to have both files at least. Let’s take a look some of them and make a plan to remove if we can.

If both exist then remove the .testr.conf and Then verify that .stestr conf has the correct test path. If only .testr.conf then migrate to stestr.conf

We need to figure out the purpose of pbr .testr.conf code before removing. Is this just old codes or necessary?

  Moving subunit2html script from os-testr

Since os-testr runner piece in os-testr project is deprecated but subunit2html project still exists there, it is widely used across the OpenStack ecosystem, Can we move to somewhere else?  I do not find any benefits to move those scripts to other places. We asked chandan to open an issue on stestr to discuss moving to stestr repo. mtreinish replied on this: os-testr meant to be the place in openstack that we could host the ostestr runner wrapper/script subunit2html, generate_subunit, etc. Just because ostestr is deprecated and being removed doesn’t mean it’s not the proper home for those other tools.

  Separate integrated services tests can be used in TriplO CI

TriplO CI maintains a separate file to run dependent tests per service. Tempest has dependent services tox and integrated jobs and the same can be used in TriplO CI.

For example:

  • tox for networking.

  RBAC testing strategy

This was a cross-project strategy for positive/negative testing for system scope and new defaults in keystone. Keystone has implemented the new defaults and system scope in its policy and added a unit test to cover the new policies.  Nova is implementing the same in Ussuri cycle. As discussed in Denver PTG also, Tempest will implement the new credential for all 9 personas available in keystone.  Slowly migrate the tests start using the new policies. That will be done via a flag switching Tempest to use system scope or new defaults and that flag will be false to keep using the old policies for stable branch testing.

We can use patrole tests or implement new tests in the Tempest plugin and verify the response. Both have the issue of performing the complete operation which is not required always for policy verification.  Running full functional tests is expensive and duplicates existing tests. One solution for that (we talked about it in Denver PTG also) is via some flag like os-profiler by just do the policy check and return the API response with specific return code.

AGREE:

  • Tempest to provide the all 9 personas available from keystone. Slowly migrate Tempest existing tests to run with new policies.
  • We agreed to have two ways to test the policy:
    1. Tempest like tests in tempest plugins with the complete operation and verify the things on response, not just policy return code. It depends on the project if they want to implement such tests.
    2. Unit/Functional tests on the project’s side.
  • Document the both way so that project can adopt the best suitable one.

  How to remove tempest plugin sanity BLACKLIST

We have tempest plugin blacklist. It should be removed in the future if possible. Some of them shouldn’t be as a tempest-plugin because they’re just neutron studium things which already moved to neutron-tempest-plugin but still exiting in repo also. Some of them are less active.  Remove below plugins from BLACKLIST:

  • openstack/networking-generic-switch needs to be checked (setup.py/cfg?)

Action: 

  • Add the start date in blacklist doc so that we can know how long a plugin is blacklisted. 
  • After 60 days: we send the email notification to openstack-discuss, PTL, maitainer and TC to either fix it or remove it from the governance. 

  Python 2.7 drop plan

We discussed the next steps to drop the py2 from Tempest and other QA tools.

AGREE:

  • Will be doing before milestone 2
  • Create a new tag for python-2.7 saying it is the last tag and document that the Tempest tag needs Train u-c. 
  • Test the Tempest tag with Train u-c, if fail then we will disucss. 
  • TripleO and OSA is going to use CentOS 8 for train and master

  Adding New glance tests to Tempest

We discussed on testing the new glance v2 api and feature. Below are the glance features and agreed points on how to test them.

  • Hide old images: Test can be added in Tempest. Hide the image and try to boot the server from the image in scenario tests. 
  • Delete barbican secrets from glance images: This test belongs to barbican-tempest-plugin which can be run as part of the barbican gate using an existing job. Running barbican job on glance gate is not required, we can add a new job (multi stores) on glance gate which can run this + other new features tests. 
  • Multiple stores: DevStack patch is already up, add a new zuul job to set up multiple stores and run on the glance gate with api and scenario tests. gmann to setup the zuulv3 job for that.

  Tempest volunteers for reviewing patches

We’ve noticed that the amount of merged patches in October is less than in September and much less than it was during the summer. This has been brought in feedback sessions also. There is no perfect solution for this. Nowadays QA has less active core developers. We encourage people to bring up the critical or stuck patches in office hours.

  Improving Tempest cleanup

Tempest cleanup is not so stable and not a perfect design. We have spec up to redesign that but could not get a consensus on that. I am ok to move with resource prefix with UUID.  We should extend the cleanup tool for plugins also.

  Ussuri Priority & Planning

This was the last session for the PTG which could not happen on Wed due to strict time-up policy of the conference place which I really liked. Time-based working is much needed for IT people :). We met on Thursday morning in coffee area and discussed about priority for Ussuri cycle. QA Ussuri Priority Etherpad has the priority items with the assignee.

See you in Vancouver!

by Ghanshyam Mann at November 24, 2019 11:51 PM

Open Infrastructure Summit: QA Project Updates, Shanghai 2019

Open Infrastructure Summit, Shanghai 2019

        OpenStack QA – Project Update:  Wednesday, November 6, 10:15am-10:30am

This time no video recording for the Project Updates. Here are the complete slides QA Project Update.

Train cycle Stats:

by Ghanshyam Mann at November 24, 2019 12:31 AM

November 19, 2019

Mirantis

Create and manage an OpenStack-based KaaS child cluster

Deploying and managing Kubernetes clusters doesn't have to be complicated. Here's how to do it with Mirantis Kubernetes as a Service (KaaS).

by Nick Chase at November 19, 2019 05:02 PM

November 18, 2019

StackHPC Team Blog

High Performance Ethernet for HPC – Are we there yet?

Recently there has been a resurgence of interest around the use of Ethernet for HPC workloads, most notably from recent announcements from Cray and Slingshot. In this article I examine some of the history around Ethernet in HPC and look at some of the advantages within modern HPC Clouds.

Of course Ethernet has been the mainstay of many organisations involved in High Throughput Computing large-scale cluster environments (e.g. Geophysics, Particle Physics, etc.) although it does not (generally) hold the mind-share for those organisations where conventional HPC workloads predominate, notwithstanding the fact that for many of these environments, the operational workload for a particular application rarely goes above a small to moderate number of nodes. Here Infiniband has held sway for many years now. A recent look at the TOP500 gives some indication of the spread of Ethernet vs. Infiniband vs. Custom or Proprietary interconnects for both system and performance share, or as I often refer to them as the price-performance and performance, respectively, of the HPC market.

Ethernet share of the TOP500

My interest in Ethernet was piqued some 15-20 years ago as it is a standard, and very early on there were mechanisms to obviate kernel overheads which allowed some level of scalability even back in the days of 1Gbps. This meant even then, that one could exploit Landed-on-Motherboard network technology instead of more expensive PCI add-in cards, Since then as we moved to 10Gbps and beyond, and I coincidentally joined Gnodal (later acquired by Cray), RDMA-enablement (through RoCE and iWarp) allowed standard MPI environment support and with the 25, 50 and 100Gbps implementations, bandwidth and latency promised on par with Infiniband. As a standard we would expect a healthy ecosystem of players within both the smart NIC and switch markets to flourish. For most switches such support is now a standard (see next section). In terms of rNICs Broadcom, Chelsio, Marvel and Mellanox currently offer products supporting either, or both, the RDMA Ethernet protocols.

Pause for Thought (Pun Intended)

I think the answer to the question, on “are we there yet” is, (isn’t it always) going to be “it depends”. That “depends” will largely be influenced by the market segmentation into the Performance, Price-Performance and Price regimes. The question is can Ethernet address the areas of “Price” and “Price-Performance” as opposed to the “Performance Region” where some of the deficiencies of Ethernet RDMA may well be exposed, e.g. multi-switch congestion at large scale but for moderate sized clusters with nodes spanning only a single switch may well be a better fit.

So for example, a cluster of 128 nodes (minus nodes for management, access, storage): if it was possible to assess that 25GbE vs 100Gbps EDR was sufficient, then I can build a system from a single 32-port 100GbE Switch (using break-out cables) as opposed to multiple 36-port EDR switches, which if I take the standard practise of over-subscription, I would end-up with similar cross-sectional bandwidth to the single Ethernet switch anyway. Of course, within the bounds of a single switch the bandwidth would be higher for IB. I guess down the line with 400GbE devices coming to a Data Centre soon, this balance will change.

Recently I had the chance to revisit this when running test benchmarks on a bare-metal OpenStack system being used for prototyping of the SKA (I’ll come on to OpenStack a bit later on but just to remark here that this system runs OpenStack to prototype an operating environment for the Science Data Processing Platform of the SKA).

I wanted to stress-test the networks, compute nodes and to some extent the storage. StackHPC operate the system as a performance prototype platform on behalf of astronomers across the SKA community and so ensuring performance is maintained across the system is critical. The system, eponymously named ALaSKA, looks like this.

ALaSKA - A la SKA

ALaSKA is used to software-define various platforms of interest to various aspects of the SKA-Science Data Processor. The two predominant platforms of interest currently are a Container Orchestration environment (previously Docker-Swarm but now Kubernetes) and a Slurm-as-a-Service HPC platform.

Here we focus on the latter of these, which gives us a good opportunity to look at 100G IB vs 25G RoCE vs 25Gbps TCP vs 10G (network not shown in the above diagram but is used for provisioning) to compare performance. First let us look more closely at the Slurm PaaS. From the base, compute, storage and network infrastructure we use OpenStack Kayobe to deploy the OpenStack control plane (based on Kolla-Ansible) and then marshal the creation of bare-metal compute nodes via the OpenStack Ironic service. The flow looks something like this with the Ansible Control Host being used to configure the OpenStack (via a Bifrost service running on the seed node) as well the configuration of network switches. Github provides the source repositories.

ALaSKA - A la SKA

Further Ansible playbooks together with OpenStack Heat permit the deployment of the Slurm platform, based on the latest OpenHPC image and various high performance storage subsystems, in this case using BeeGFS Ansible playbooks. The graphic above depicts the resulting environment with the addition of OpenStack Monasca Monitoring and Logging Service (depicted by the lizard logo). As we will see later on, this provides valuable insight to system metrics (for both system administrators and the end user).

So let us assume that we first want to address the Price-Performance and Price driven markets - at scale we need to be concerned around East-West traffic congestion between switches, where this can be somewhat mitigated by the fact that with modern 100GbE switches we can break-out to 25/50GbE which increases the arity of a single switch and (likely congestion). Of course, this means we need to be able to justify the reduction in bandwidth of the NIC. Of course if the total system only spans a single switch then congestion may not be an issue, although further work may be required to understand end-point congestion.

To test the systems performance I used (my preference) HPCC and OpenFoam as two benchmark environments. All tests used gcc, MKL and openmpi3 and no attempt was made to further optimise the applications. Afterall, all I want to do is run comparative tests of the same binary, by changing run-time variables to target the underlying fabric. For openmpi, this can be achieved with the following (see below). The system uses an OpenHPC image. At the BIOS level, the system has hyperthreading enabled and so I was careful to ensure that process placement ensured I pinned only half the number of available slots (I’m using Slurm) and mapped by CPU. This is important to know when we come to examine the performance dashboards below. Here are the specific mca parameters for targeting the fabrics.

DEV=" roce ibx eth 10Geth"
for j in $DEV;
do

if [ $j == ibx ]; then
MCA_PARAMS="--bind-to core --mca btl openib,self,vader  --mca btl_openib_if_include mlx5_0:1 "
fi
if [ $j == roce ]; then
MCA_PARAMS="--bind-to core --mca btl openib,self,vader  --mca btl_openib_if_include mlx5_1:1
fi
if [ $j == eth ]; then
MCA_PARAMS="--bind-to core --mca btl tcp,self,vader  --mca btl_tcp_if_include p3p2"
fi
if [ $j == 10Geth ]; then
MCA_PARAMS="--bind-to core --mca btl tcp,self,vader  --mca btl_tcp_if_include em1"
fi
if [ $j == ipoib ]; then
MCA_PARAMS="--bind-to core --mca btl tcp,self,vader  --mca btl_tcp_if_include ib0"
fi

In the results below, I’m comparing the performance across each network using HPCC for a size of 8 nodes (up to 256 cores, albeit 512 virtual cores are available as described above). I think this would cover the vast majority of cases in Research Computing.

Results

HPCC Benchmark

The results for major operations of the HPCC suite are shown below together with a personal narrative of the performance. A more thorough description of the benchmarks can be found here.

8 nodes 256 cores

Benchmark 10GbE (TCP) 25GbE (TCP) 100Gb IB 25GbE RoCE
HPL_Tflops 3.584 4.186 5.476 5.233
PTRANS_GBs 5.656 16.458 44.179 17.803
MPIRandomAccess_GUPs 0.005 0.004 0.348 0.230
StarFFT_Gflops 1.638 1.635 1.636 1.640
SingleFFT_Gflops 2.279 2.232 2.343 2.322
MPIFFT_Gflops 27.961 62.236 117.341 59.523
RandomlyOrderedRingLatency_usec 87.761 100.142 3.054 2.508
RandomlyOrderedRingBandwidth_GBytes 0.027 0.077 0.308 0.092
  • HPL – We can see here that it is evenly balanced between low-latency and b/w with RoCE and IB on a par even with the reduction in b/w of RoCE. In one sense this performance underlies the graphics shown above in terms of HPL, where Ethernet occupies ~50% of the share of total clusters which is not matched in terms of the performance share.
  • PTRANS – Performance pretty much in line with b/w
  • GUPS – latency dominated. IB wins by some margin
  • STARFFT– Embarrassingly Parallel (HTC use-case) no network effect.
  • SINGLEFFT – No effect no comms.
  • MPIFFT – Heavily b/w dominated see effect of 100 vs 25 Gbps (no latency effect)
  • Random Ring Latency – see effect of RDMA vs. TCP. Not sure why RoCE is better that IB, but may be due to the random order?
  • Random Ring B/W – In line with 100Gbps (IB) vs 25Gbps (RDMA) vs TCP networks.

OpenFoam

I took the standard Motorbike benchmark and ran this on 128 (4 nodes) and 256 (8 nodes) cores on the same networks as above. I did not change the mesh sizing between runs and thus on higher processor counts, comms will be imbalanced. The results are shown below, showing very little difference between the RDMA networks despite the bandwidth difference.

Nodes(Processors) 100Gbps IB 25Gbps ROCE 25Gbps TCP 10Gbps TCP
8/(256) 87.64 93.35 560.37 591.23
4/(128) 99.83 101.49 347.19 379.32

Elapsed Time in Seconds. NB the increase in time for TCP when running on more processors!

Future Work

So at present I have only looked at MPI communication. The next big thing to look at is storage, where the advantages of Ethernet need to be assessed not only in terms of performance but also the natural advantage the Ethernet standard has in connectivity for many network-attached devices.

Why OpenStack

As was mentioned above, one of the prototypical aspects of the AlaSKA system is to model operational aspects of the Science Data Processor element of the SKA. A good description of the SDP and the Operational scenarios are described in the architectural description of the system. A description of the architecture and that prototyping can be found here.

Using Ethernet, and in particular the use of High Performance Ethernet (HPC Ethernet in the parlance of Cray), holds a particular benefit in the case of on-premise cloud, as infrastructure may be isolated in terms of multiple tenants. For the particular case of IB and OPA this can be achieved using ad-hoc methods for the respective network. For Ethernet, however, multi-tenancy is native.

For many HPC scenarios, multi tenancy is not important, nor even a requirement. For others, it is key and mandatory, e.g. secure clouds for clinical research. One aspect of multi-tenancy is shown in the analysis of the results, where we use the aspects of OpenStack Monasca (multi-tenant monitoring and logging service) and Grafana dashboards. More information on the architecture of Monasca can be found in a previous blog article.

Appendix – OpenStack Monasca Monitoring O/P

HPCC

The plot below shows CPU usage and network b/w for the runs of HPCC using a grafana dashboard and OpenStack Monasca monitoring as a service. The 4 epochs are shown for the IB, RoCE, 25Gbps (TCP) and 10Gbps (TCP). The total CPU usage is set at 50% as these are HT-enabled nodes but mapped by-core with 1 thread per core. Thus, we are only consuming 50% of the available resources. Network bandwidth is shown for 3 of the epochs shown. “Inbound ROCE Network Traffic”, “Inbound Infiniband Network Traffic” and “Inbound Bulk Data Network Traffic” – Bulk Data Network refers to an erstwhile name for the ingest network for the SDP.

HPCC performance data in Monasca

For the case of CPU usage, a reduction in performance is observed for the TCP cases. This is further evidenced by a 2nd plot that shows the system CPU, showing heavy system overhead for the 4 separate epochs.

HPCC CPU performance data in Monasca

by John Taylor at November 18, 2019 02:00 AM

November 17, 2019

Christopher Smart

Use swap on NVMe to run more dev KVM guests, for when you run out of RAM

I often spin up a bunch of VMs for different reasons when doing dev work and unfortunately, as awesome as my little mini-itx Ryzen 9 dev box is, it only has 32GB RAM. Kernel Samepage Merging (KSM) definitely helps, however when I have half a dozens or so VMs running and chewing up RAM, the Kernel’s Out Of Memory (OOM) killer will start executing them, like this.

[171242.719512] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/machine.slice/machine-qemu\x2d435\x2dtest\x2dvm\x2dcentos\x2d7\x2d00.scope,task=qemu-system-x86,pid=2785515,uid=107
[171242.719536] Out of memory: Killed process 2785515 (qemu-system-x86) total-vm:22450012kB, anon-rss:5177368kB, file-rss:0kB, shmem-rss:0kB
[171242.887700] oom_reaper: reaped process 2785515 (qemu-system-x86), now anon-rss:0kB, file-rss:68kB, shmem-rss:0kB

If I had more slots available (which I don’t) I could add more RAM, but that’s actually pretty expensive, plus I really like the little form factor. So, given it’s just dev work, a relatively cheap alternative is to buy an NVMe drive and add a swap file to it (or dedicate the whole drive). This is what I’ve done on my little dev box (actually I bought it with an NVMe drive so adding the swapfile came for free).

Of course the number of VMs you can run depends on the amount of RAM each VM actually needs for what you’re running on it. But whether I’m running 100 small VMs or 10 large ones, it doesn’t matter.

To demonstrate this, I spin up a bunch of CentOS 7 VMs at the same time and upgrade all packages. Without swap I could comfortably run half a dozen VMs, but more than that and they would start getting killed. With 100GB swap file I am able to get about 40 going!

Even with pages swapping in and out, I haven’t really noticed any performance decrease and there is negligible CPU time wasted waiting on disk I/O when using the machines normally.

The main advantage for me is that I can keep lots of VMs around (or spin up dozens) in order to test things, without having to juggle active VMs or hoping they won’t actually use their memory and have the kernel start killing my VMs. It’s not as seamless as extra RAM would be, but that’s expensive and I don’t have the slots for it anyway, so this seems like a good compromise.

by Chris at November 17, 2019 07:26 AM

November 14, 2019

Nate Johnston

Shanghai PTG Summary - Remote

I attended the Neutron meetings for the OpenInfra PTG in Shanghai last week. I was not in Shanghai, so I participated entirely remotely over BlueJeans. Remote Participation Typically I would work most of a day - 5-6 hours with a nap in the middle - and then be on the PTG from 3-5 hours in the evening. The timeshift was such that the scheduled block of meetings started at 8:00pm my time and ended at 3:30am.

November 14, 2019 07:55 PM

Ben Nemec

Oslo in Shanghai

Despite my trepidation about the trip (some of it well-founded!), I made it to Shanghai and back for the Open Infrastructure Summit and Project Teams Gathering. I even managed to get some work done while I was there. :-)

First, I recommend reading the opening of Colleen Murphy's blog post about the event (and the rest of it too, if you have any interest in what Keystone is up to). It does an excellent job of describing the week at a high level. To summarize in my own words, the energy of this event was a little off. Many regular contributors were not present because of the travel situation and there was less engagement from local contributors than I would have hoped for. However, that doesn't mean nothing good came out of it!

In fact, it was a surprisingly active week for Oslo, especially given that only myself and two other cores were there and we had limited discussion within the team. It turns out Oslo was a popular topic of conversation in various Forum sessions, particularly oslo.messaging. This led to some good conversation at the PTG and a proposal for a new Oslo library. Not only were both Oslo summit sessions well attended, but good questions were asked in both so people weren't just there waiting for the next talk. ;-) In fact, I went 10 minutes over time on the project update (oops!), in part because I hadn't really planned time for questions since I've never gotten any in the past. Not complaining though.

Read on for more detail about all of this.

oslo.messaging drivers

It should come as no surprise to anyone that one of major pain points for OpenStack operators is RabbitMQ administration. Rabbit is a frequent bottleneck that limits the scale of deployed clouds. While it should be noted that this is not always Rabbit's fault, scaling of the message queue is a problem almost everyone runs into at some point when deploying large clouds. If you don't believe me, ask someone how many people attended the How we used RabbitMQ in wrong way at a scale presentation during the summit (which I will talk more about in a bit). The room was packed. This is definitely a topic of interest to the OpenStack community.

A few different solutions to this problem have been suggested. First, I'll talk about a couple of new drivers that have been proposed.

NATS

This was actually submitted to oslo.messaging even before the summit started. It's a new driver that uses the NATS messaging system. NATS makes some very impressive performance claims on its site, notably that it has around an order of magnitude higher throughput than RabbitMQ. Anybody interested in being able to scale their cloud 10x just by switching their messaging driver? I thought so. :-)

Now, this is still in the early discussion phase and there are some outstanding questions surrounding it. For one, the primary Python driver is not compatible with Eventlet (sigh...) which makes it unusable for oslo.messaging. There does exist a driver that would work, but it doesn't seem to be very maintained and as a result we would likely be taking on not just a new oslo.messaging driver but also a new NATS library if we proceed with this. Given the issues we've had in the past with drivers becoming unmaintained and bitrotting, this is a non-trivial concern. We're hoping to work with the driver proposers to make sure that there will be sufficient staffing to maintain this driver in the long run. If you are interested in helping out with this work please contact us ASAP. Currently it is being driven by a single contributor, which is likely not sustainable.

We will also need to ensure that NATS can handle all of the messaging patterns that OpenStack uses. One of the issues with previous high performance drivers such as ZeroMQ or Kafka was that while they were great at some things, they were missing important functionality for oslo.messaging. As a result, that functionality either had to be bolted on (which reduces the performance benefits and increases the maintenance burden) or the driver had to be defined as notification-only, in which case operators end up having to deploy multiple messaging systems to provide both RPC and notifications. Even if the benefits are worth it, it's a hard sell to convince operators to deploy yet another messaging service when they're already struggling with the one they have. Fortunately, according to the spec the NATS driver is intended to be used for both so hopefully this won't be an issue.

gRPC

In one of the sessions, I believe "Bring your crazy idea", a suggestion was made to add a gRPC driver to oslo.messaging as well. Unfortunately, I think this is problematic because gRPC is also not compatible with Eventlet, and I'm not sure there's any way to make it work. It's also not clear to me that we need multiple alternatives to RabbitMQ. As I mentioned above, we've had problems in the past with alternative drivers not being maintained, and the more drivers we add the more maintenance burden we take on. Given that the oslo.messaging team is likely shrinking over the next cycle, I don't know that we have the bandwidth to take on yet another driver.

Obviously if someone can do a PoC of a gRPC driver and show that it has significant benefits over the other available drivers then we could revisit this, but until that happens I consider this a non-starter.

Out-of-tree Drivers

One interesting suggestion that someone made was to implement some of these proposed drivers outside of oslo.messaging. I believe this should be possible with no changes to oslo.messaging because it already makes use of generic entry points for defining drivers. This could be a good option for incubating new drivers or even as a longer term solution for drivers that don't have enough maintainers to be included in oslo.messaging itself. We'll need to keep this option in mind as we discuss the new driver proposals.

Reduce the amount of RPC in OpenStack

This also came out of the crazy idea session, but I don't recall that there was much in the way of specifics (I was distracted chatting with tech support in a failed attempt to get my cell phone working during this session). In general, reducing the load on the messaging layer would be a good thing though. If anyone has suggestions on ways to do this please propose them on the openstack-discuss mailing list.

LINE

Now we get to some very concrete solutions to messaging scaling that have already been implemented. LINE gave the RabbitMQ talk I mentioned earlier and had some novel approaches to the scaling problems they encountered. I suggest watching the recording of their session when it is available because there was a lot of interesting stuff in it. For this post, I'm going to focus on some of the changes they made to oslo.messaging in their deployment that we're hoping to get integrated into upstream.

Separate Notification Targets

One important architecture decision that LINE made was to use a separate RabbitMQ cluster for each service. This obviously reduces the load on an individual cluster significantly, but it isn't necessarily the design that oslo.messaging assumes. As a result, we have only one configuration section for notifications, but in a split architecture such as LINE is using you may want service-specific notifications to go to the service-specific Rabbit cluster. The spec linked in the title for this section was proposed to provide that functionality. Please leave feedback on it if this is of interest to you.

oslo.messaging instrumentation and oslo.metrics

One of the ways LINE determined where their messaging bottlenecks were was some instrumentation that they added to oslo.messaging to provide message-level metrics. This allowed them to get very granular data about what messages were causing the most congestion on the messaging bus. In order to collect these metrics, they created a new library that they called oslo.metrics. In essence, the oslo.messaging instrumentation calls oslo.metrics when it wants to output a metric, oslo.metrics then takes that data, converts it to a format Prometheus can understand, and serves it on an HTTP endpoint that the oslo.metrics library creates. This allowed them to connect the oslo.messaging instrumentation to their existing telemetry infrastructure.

Interestingly, this concept came up in other discussions throughout the week as well, so we're hoping that we can get oslo.metrics upstreamed (currently it is something they implemented downstream that is specific to their deployment) and used in more places. Another interesting related possibility was to add a new middleware to oslo.middleware that could do a similar thing for the API services and potentially provide useful performance metrics from them.

We had an extended discussion with the LINE team about this at the Oslo PTG table, and the next steps will be for them to fill out a spec for the new library and hopefully make their code changes available for review. Once that is done, we had commitments from a number of TC members to review and help shepherd this work along. All in all, this seems to be an area of great interest to the community and it will be exciting to see where it goes!

Policy Improvements

I'm going to once again refer you to Colleen's post, specifically the "Next Steps for Policy in OpenStack" section since this is being driven more by Keystone than Oslo. However, one interesting thing that was discussed with the Nova team that may affect Oslo was how to manage these changes if they end up taking more than one cycle. Because the oslo.policy deprecation mechanism is used to migrate services to the new-style policy rules, operators will start seeing quite a few deprecation messages in their logs once this work starts. If it takes more than one cycle then that means they may be seeing deprecations for multiple cycles, which is not ideal.

Currently Nova's plan is to queue up all of their policy changes in one big patch series of doom and once they are all done merge the whole thing at once. It remains to be seen how manageable such a patch series that touches code across the project will be though. If it proves untenable, we may need to implement some sort of switch in oslo.policy that would allow deprecations to be temporarily disabled while this work is ongoing, and then when all of the policy changes have been made the switch could be flipped so all of the deprecations take effect at once. As of now I have no plans to implement such a feature, but it's something to keep in mind as the other service projects get serious about doing their policy migrations.

oslo.limit

The news is somewhat mixed on this front. Unfortunately, the people (including me) who have been most involved in this work from the Keystone and Oslo sides are unlikely to be able to drive it to completion due to changing priorities. However, there is still interest from the Nova side, and I heard rumors at the PTG that there may be enough operator interest in the common quota work that they would be able to have someone help out too. It would be great if this is still able to be completed as it would be a shame to waste all of the design work and implementation of unified limits that has already been done. The majority of the initial API is available for review and just needs some massaging to be ready to merge. Once that happens, projects can start consuming it and provide feedback on whether it meets their needs.

Demo of Oslo Tools That Make Life Easier for Operators

A bit of shameless self-promotion, but this is a presentation I did in Shanghai. The recording isn't available yet, but I'll link it once it is. In essence, this was my attempt to evangelize some Oslo tools that have been added somewhat recently but people may not have been aware of. It covers what the tools are good for and how to actually use them.

Conclusion

As I tweeted on the last day of the PTG, this was a hard event for me to leave. Changes in my job responsibilities mean this was likely my last summit and my last opportunity to meet with the OpenStack family face-to-face. Overall it was a great week, albeit with some rough edges, which is a double-edged sword. If the week had gone terribly maybe I wouldn't have been so sad to leave, but on the other hand it was nice to go out on a high note.

If you made it this far, thanks! Please don't hesitate to contact me with any comments or questions.

by bnemec at November 14, 2019 06:35 PM

SUSE Conversations

The Brains Behind the Books – Part VII: Alexandra Settle

The content of this article has been contributed by Alexandra Settle, Technical Writer at the SUSE Documentation Team. It is part of a series of articles focusing on SUSE Documentation and the great minds that create the manuals, guides, quick starts, and many more helpful documents.       A Dream of  Ice Cream Shops and Lego […]

The post The Brains Behind the Books – Part VII: Alexandra Settle appeared first on SUSE Communities.

by chabowski at November 14, 2019 12:23 PM

November 13, 2019

Colleen Murphy

Shanghai Open Infrastructure Forum and PTG

The Open Infrastructure Summit, Forum, and Project Teams Gathering was held last week in the beautiful city of Shanghai. The event was held in the spirit of cross-cultural collaboration and attendees arrived with the intention of bridging the gap with a usually faraway but significant part of the OpenStack community …

by Colleen Murphy at November 13, 2019 01:00 AM

Sean McGinnis

November 2019 OpenStack Board Notes

The Open Infrastructure Summit was held in mainland China for the first time the week of November 4th, 2019, in Shanghai. As usual, we took advantage of the opportunity of having so many members in one place by having a Board of Directors meeting on Sunday, November 3.

Attendance was a little lighter due to visa challenges, travel budgets, and other issues. But we still had a quorum with a lot of folks in the room, and I’m sure it was a nice change for our Chinese board members and others from the APAC region.

The original meeting agenda is published on the wiki as usual.

OSF Updates

Following the usual pattern, Jonathan Bryce kicked things off with an update of Foundation and project activity.

One interesting thing that really stood out to me, which Jonathan also shared the next day in the opening keynotes, as an analyst report putting OpenStack’s market at $7.7 billion in 2020. I am waiting for those slides to be published, but I think this really showed that despite the decrease in investment by companies in the development of OpenStack, its adoption and growth is stable and growing.

This was especially highlighted in China, with companies like China UnionPay, China Mobile, and other large companies from other industries increasing their use of OpenStack. And public clouds like Huawei and other local service providers basing their services on top of OpenStack.

I can definitely state from experience after that week, access to the typical big 3 public cloud providers in the US is a challenge through the Great Firewall. Being able to base your services on top of a borderless open source option like OpenStack is a great option with the current political pressures. A community-based solution, rather than a foreign tech company’s offerings, probably makes a lot of sense and is helping drive this adoption.

Of course, telecom adoption is still growing as well. I’m not as involved in that space, but it really seems like OpenStack is becoming the de facto standard for having a programmable infrastructure to base dynamic NFV solutions on top of, but directly with VMs and baremetal, and as a locally controlled platform to serve as the underlying infrastructure for Kubernetes.

Updates and Community Reports

StarlingX Progress Report

The StarlingX project has made a lot of progress over the last several months. They are getting closer and closer to the latest OpenStack code. They have been actively working on getting their custom changes merged upstream so they do not need to continue maintaining a fork. So far, they have been able to get a lot of changes in to various projects. They hope to eventually be able to just deploy standard OpenStack services configured to meet their needs, focusing instead on the services on top of OpenStack that make StarlingX attractive and a great solution for edge infrastructure.

Indian Community Update

Prakash Ramchandran gave an update on the various meetups and events being organized across India. This is a large market for OpenStack. Recently approved government initiatives could make this an ideal time to help nurture the Indian OpenStack community.

I’m glad to see all of the activity that Prakash has been helping support there. This is another region where I expect to see a lot of growth in OpenStack adoption.

Interop Working Group

Egle gave an update of the Interop WG activity and the second 2019 set of changes were approved. Nothing too exciting there, with just minor updates to the interop requirements.

The larger discussion was about the need and the health of the Interop WG. Chris Hoge was a very active contributor to this, but he recently left the OSF, and the OpenStack community, to pursue a different opportunty. Egle Sigler is really the only one left on the team, and she has shared that she would not be able to do much more with the group other than keeping the lights on.

This team is responsible for the guidelines that must be followed for someone to certify their service or distribution of OpenStack meets the minimum functionality requirements to be consistent with other OpenStack deployments. This is certification is needed to be able to use the OpenStack logo and be called “OpenStack Powered”.

I think there was pretty unanimous agreement that this kind of thing is still very important. Users need to be able to have a consistent user experience when moving between OpenStack-based clouds. Inconsistency would lead to unexpected behaviors or responses and a poor user experience.

For now it is a call for help and to raise awareness. It did make me think about how we’ve been able to decentralize some efforts within the community, like moving documentation into each teams repos rather than having a centralized docs team and docs repo. I wonder if we can put some of this work on the teams themselves to mark certain API calls as “core”, then some testing in place to ensure none of these set APIs are changed or start producing different results. Something to think about at least.

First Contact SIG Update

The First Contact SIG works on things to make getting involved in the community easier. They’ve done a lot of work in the past on training and contributor documentation. They’ve recently added a Contributing Organization Guide that is targeted at the organization management level to help them understand how they can make an impact and help their employees to be involved and productive.

That’s an issue we’ve had to varying degrees in the past. Companies have had good intentions of getting involved, but they are not always sure where to start. Or they task a few employees to contribute without a good plan on how or where to do so. I think it will be good having a place to direct these companies to, to help them understand how to work with OpenStack and an open source community.

Troila Gold Member Application

Troila is an IT services company in China that provides a cloud product based on OpenStack to their customers. They have been using OpenStack for some time and saw the value in becoming an OSF Gold level sponsor.

As part of the Member Committee, Rob Esker and I met with them the week prior to go over their application and answer any questions and give feedback. That preview was pretty good, and Rob and I only had minor suggestions for them to help highlight what they have been doing with OpenStack and what their future plans were.

They had taken these suggestions and made updates to their presentation, and I think they did a very nice job explaining their goals. There was some discussion and additional questions from the board, but after a quick executive session, we voted and approved Troila as the latest Gold member of the OpenStack Foundation.

Combined Leadership Meeting

The second half of the day was a joint session with the Board and the Technical Committees or Technical Steering Committees of the OpenStack, StarlingX, Airship, Kata, and Zuul projects. Each team gave a community update for their respective areas.

My biggest takeaway from this was that although we are unresources in some areas, we really do have a large and very active community of people that really care about the things they are working on. Seeing growing adoption for things like Kata Containers and Zuul is really exciting.

Next Meeting

The next meeting will be a conference call on December 10th. No word yet on the agenda for that, but I wouldn’t expect too much being so soon after Shanghai. I expect their will probably be some buzz about the annual elections coming up.

Once available, the agenda will be published to the usual spot.

I have the issue that I have been able to finish out my term because the rest of the board voted to allow me to do so as an exception to the two seat per company limit since I had rejoined Dell half way through the year. That won’t apply for the next election, so if the three of us from Dell all hope to continue, one of us isn’t going to be able to.

I’ve waffled on this a little, but at least right now, I do think I am going to run for election again. Prakash has been doing some great work with his participation in the India OpenStack community, so I will not feel too bad if I lose out to him. I do think I’ve been more integrated in the overall development community, so since an Individual Director is supposed to be a representative for the community, I do hope I can continue. That will be up to the broader community, so I am not going to worry about it. The community will be able to elect those they support, so no matter what it will be good.

by Sean McGinnis at November 13, 2019 12:00 AM

November 12, 2019

Sean McGinnis

Why is the Cinder mascot a horse?!

I have to admit, I have to laugh to myself every time I see the Cinder mascot in a keynote presentation.

Cinder horse mascot

History (or, why the hell is that the Cinder mascot!)

The reason at least a few of us find it so funny is that it’s a bit of an inside joke.

Way back in the early days of Cinder, someone from Solidfire came up with a great looking cinder block logo for the project. It was along the style if the OpenStack logo at the time and was nice and recognizable.

Cinder logo

Then around 2016, they decided it was time to refresh the OpenStack logo and make it look more modern and flat. Our old logo no longer matched the overall project, but we still loved it.

I did make an attempt to update it. I made a stylized version of the Cinder block logo using the new OpenStack logo as a basis for it. I really wish I could find it now, but I may have lost the image when I switched jobs. You may still see it on someone’s laptop - I had a very small batch of stickers made while I was still Cinder PTL.

It was soon after the OpenStack logo change that the Foundation decided to introduce mascots for each project. They were asking for each team to thing of an animal that they could identify with. It was supposed to be a fun exercise for the teams to be able to pick their own kind of logo, with graphic designers coming up with very high quality images.

The Cinder team didn’t really have an obvious animal. At least not as obvious as a Cinder block had been. It was during one of our midcycle meetups in Ft. Collins, Co while we were brainstorming that led to our horse.

Trying to think of something that would actually represent the team, we were talking over what Cinder actually was. We were mostly all from different storage vendors. We refer to the different storage devices that are used with Cinder as backends.

Backends are also what some call butts. Butts… asses. Donkeys are also called asses. Donkey!

One or two people on the team had cultural objects to having a donkey as a mascot. They didn’t think it was a good representation of our project. So we compromised with going with a horse.

So we asked for a horse to be our mascot. The initial design they came up with was a Ferrari looking stallion. Way to sporty and fierce for our team. Even though the OpenStack Foundation has actually published it and even created some stickers, we explained our, erm… thought process… behind coming up with the horse in the first place. The design team was great, and went back to the drawing board. The result is the back-end view of the horse that we have today. They even worked a little ‘C’ into the swish of the horse’s tail.

So that’s the story behind the Cinder logo. It’s just because we’re all a bunch of backends.

by Sean McGinnis at November 12, 2019 12:00 AM

November 11, 2019

RDO

Community Blog Round Up 11 November 2019

As we dive into the Ussuri development cycle, I’m sad to report that there’s not a lot of writing happening upstream.

If you’re one of those people waiting for a call to action, THIS IS IT! We want to hear about your story, your problem, your accomplishment, your analogy, your fight, your win, your loss – all of it.

And, in the meantime, Adam Young says it’s not that cloud is difficult, it’s networking! Fierce words, Adam. And a super fierce article to boot.

Deleting Trunks in OpenStack before Deleting Ports by Adam Young

Cloud is easy. It is networking that is hard.

Read more at https://adam.younglogic.com/2019/11/deleting-trunks-before-ports/

by Rain Leander at November 11, 2019 01:46 PM

November 09, 2019

Aptira

OSN-Day

Aptira OSN Day

The Open Networking technology landscape has evolved quickly over the last two years. How can Telco’s keep up?

Our team of Network experts have used Software Defined Networking techniques for many different use cases, including: Traffic EngineeringSegment RoutingIntegration and Automated Traffic Engineering and many more, addressing many of the key challenges associated with networks; including security, volume and flexibility concerns to provide customers with an uninterrupted user experience.

At OSN Day, we will be helping attendees to learn about the risks associated with 5G networks. Edge Compute is needed for 5G and 5G-enabled use cases, but currently 5G-enabled use cases are ill-defined and incremental revenue is uncertain. Therefore, it’s not clear what is actually required, and the Edge business case is risky. We’ll be on site explaining how to mitigate against these risks, ensuring successful network functionality through the implementation of a risk-optimised approach to 5G. You can download the full whitepaper here.

We will also have our amazingly talented Network Consultant Farzaneh Pakzad presenting in The Programmable Network breakout track. Farzaneh will be comparing, rating and evaluating each of the most popular Open Source SDN controllers in use todayThis comparison will be useful for organisations to help them select the right SDN controller for their platform which match their network design and requirements. 

Farzaneh has a PhD in Software Defined Networks from the University of Queensland. Her research interests include Software Defined Networks, Cloud Computing and Network Security. During her career, Farzaneh has provided advisory service for transport SDN solutions and implemented Software Defined Networking Wide Area Network functionalities for some of Australia’s largest Telco’s.

We’ve got some great swag to giveaway and will also be running a demonstration on Tungsten Fabric as a Kubernetes CNI, so if you’re at OSN Day make sure you check out Farzaneh’s session in Breakout room 2 and also visit the team of Aptira Solutionauts in the expo room. They can help you to create, design and deploy the network of tomorrow.

Ready to move your network into the software defined future?
Automate your network with ONAP.

Find Out How

The post OSN-Day appeared first on Aptira.

by Jessica Field at November 09, 2019 12:53 PM

November 07, 2019

Adam Young

Deleting Trunks in OpenStack before Deleting Ports

Cloud is easy. It is networking that is hard.

Red Hat supports installing OpenShift on OpenStack. As a Cloud SA, I need to be able to demonstrate this, and make it work for customers. As I was playing around with it, I found I could not tear down clusters due to a dependency issue with ports.


When building and tearing down network structures with Ansible, I had learned the hard way that there were dependencies. Routers came down before subnets, and so one. But the latest round had me scratching my head. I could not get ports to delete, and the error message was not a help.

I was able to figure out that the ports linked to security groups. In fact, I could unset almost all of the dependencies using the port set command line. For example:

openstack port set openshift-q5nqj-master-port-1  --no-security-group --no-allowed-address --no-tag --no-fixed-ip

However, I still could not delete the ports. I did notice that there was a trunk_+details section at the bottom of the port show output:

trunk_details         | {'trunk_id': 'dd1609af-4a90-4a9e-9ea4-5f89c63fb9ce', 'sub_ports': []} 

But there is no way to “unset” that. It turns out I had it backwards. You need to delete the port first. A message from Kristi Nikolla:

the port is set as the parent for a “trunk” so you need to delete the trunk firs

Kristi In IRC
<pre lang="bash">curl -H "x-auth-token: $TOKEN" https://kaizen.massopen.cloud:13696/v2.0/trunks/</pre>

It turns out that you can do this with the CLI…at least I could.

$ openstack network trunk show 01a19e41-49c6-467c-a726-404ffedccfbb
FieldValue
admin_state_up UP
created_at 2019-11-04T02:58:08Z
description
id 01a19e41-49c6-467c-a726-404ffedccfbb
name openshift-zq7wj-master-trunk-1
port_id 6f4d1ecc-934b-4d29-9fdd-077ffd48b7d8
project_id b9f1401936314975974153d78b78b933
revision_number 3
status DOWN
sub_ports
tags [‘openshiftClusterID=openshift-zq7wj’]
tenant_id b9f1401936314975974153d78b78b933
updated_at 2019-11-04T03:09:49Z

Here is the script I used to delete them. Notice that the status was DOWN for all of the ports I wanted gone.

for PORT in $( openstack port list | awk '/DOWN/ {print $2}' ); do TRUNK_ID=$( openstack port show $PORT -f json | jq  -r '.trunk_details | .trunk_id ') ; echo port  $PORT has trunk $TRUNK_ID;  openstack network trunk delete $TRUNK_ID ; done

Kristi had used the curl command because he did not have the network trunk option in his CLI. Turns out he needed to install python-neutronclient first.

by Adam Young at November 07, 2019 07:27 PM

November 06, 2019

StackHPC Team Blog

Worlds Collide: Virtual Machines & Bare Metal in OpenStack

Ironic's mascot, Pixie Boots

To virtualise or not to virtualise?

If performance is what you need, then there's no debate - bare metal still beats virtual machines; particularly for I/O intensive applications. However, unless you can guarantee to keep it fully utilised, iron comes at a price. In this article we describe how Nova can be used to provide access to both hypervisors and bare metal compute nodes in a unified manner.

Scheduling

When support for bare metal compute via Ironic was first introduced to Nova, it could not easily coexist with traditional hypervisor-based workloads. Reported workarounds typically involved the use of host aggregates and flavor properties.

Scheduling of bare metal is covered in detail in our bespoke bare metal blog article (see Recap: Scheduling in Nova).

Since the Placement service was introduced, scheduling has significantly changed for bare metal. The standard vCPU, memory and disk resources were replaced with a single unit of a custom resource class for each Ironic node. There are two key side-effects of this:

  • a bare metal node is either entirely allocated or not at all
  • the resource classes used by virtual machines and bare metal are disjoint, so we could not end up with a VM flavor being scheduled to a bare metal node

A flavor for a 'tiny' VM might look like this:

openstack flavor show vm-tiny -f json -c name -c vcpus -c ram -c disk -c properties
{
  "name": "vm-tiny",
  "vcpus": 1,
  "ram": 1024,
  "disk": 1,
  "properties": ""
}

A bare metal flavor for 'gold' nodes could look like this:

openstack flavor show bare-metal-gold -f json -c name -c vcpus -c ram -c disk -c properties
{
  "name": "bare-metal-gold",
  "vcpus": 64,
  "ram": 131072,
  "disk": 371,
  "properties": "resources:CUSTOM_GOLD='1',
                 resources:DISK_GB='0',
                 resources:MEMORY_MB='0',
                 resources:VCPU='0'"
}

Note that the vCPU/RAM/disk resources are informational only, and are zeroed out via properties for scheduling purposes. We will discuss this further later on.

With flavors in place, users choosing between VMs and bare metal is handled by picking the correct flavor.

What about networking?

In our mixed environment, we might want our VMs and bare metal instances to be able to communicate with each other, or we might want them to be isolated from each other. Both models are possible, and work in the same way as a typical cloud - Neutron networks are isolated from each other until connected via a Neutron router.

Bare metal compute nodes typically use VLAN or flat networking, although with the right combination of network hardware and Neutron plugins other models may be possible. With VLAN networking, assuming that hypervisors are connected to the same physical network as bare metal compute nodes, then attaching a VM to the same VLAN as a bare metal compute instance will provide L2 connectivity between them. Alternatively, it should be possible to use a Neutron router to join up bare metal instances on a VLAN with VMs on another network e.g. VXLAN.

What does this look like in practice? We need a combination of Neutron plugins/drivers that support both VM and bare metal networking. To connect bare metal servers to tenant networks, it is necessary for Neutron to configure physical network devices. We typically use the networking-generic-switch ML2 mechanism driver for this, although the networking-ansible driver is emerging as a promising vendor-neutral alternative. These drivers support bare metal ports, that is Neutron ports with a VNIC_TYPE of baremetal. Vendor-specific drivers are also available, and may support both VMs and bare metal.

Where's the catch?

One issue that more mature clouds may encounter is around the transition from scheduling based on standard resource classes (vCPU, RAM, disk), to scheduling based on custom resource classes. If old bare metal instances exist that were created in the Rocky release or earlier, they may have standard resource class inventory in Placement, in addition to their custom resource class. For example, here is the inventory reported to Placement for such a node:

$ openstack resource provider inventory list <node UUID>
+----------------+------------------+----------+----------+-----------+----------+--------+
| resource_class | allocation_ratio | max_unit | reserved | step_size | min_unit |  total |
+----------------+------------------+----------+----------+-----------+----------+--------+
| VCPU           |              1.0 |       64 |        0 |         1 |        1 |     64 |
| MEMORY_MB      |              1.0 |   131072 |        0 |         1 |        1 | 131072 |
| DISK_GB        |              1.0 |      371 |        0 |         1 |        1 |    371 |
| CUSTOM_GOLD    |              1.0 |        1 |        0 |         1 |        1 |      1 |
+----------------+------------------+----------+----------+-----------+----------+--------+

If this node is allocated to an instance whose flavor requested (or did not explicitly zero out) standard resource classes, we will have a usage like this:

$ openstack resource provider usage show <node UUID>
+----------------+--------+
| resource_class |  usage |
+----------------+--------+
| VCPU           |     64 |
| MEMORY_MB      | 131072 |
| DISK_GB        |    371 |
| CUSTOM_GOLD    |      1 |
+----------------+--------+

If this instance is deleted, the standard resource class inventory will become available, and may be selected by the scheduler for a VM. This is not likely to end well. What we must do is ensure that these resources are not reported to Placement. This is done by default in the Stein release of Nova, and Rocky may be configured to do the same by setting the following in nova.conf:

[workarounds]
report_ironic_standard_resource_class_inventory = False

However, if we do that, then Nova will attempt to remove inventory from Placement resource providers that is already consumed by our instance, and will receive a HTTP 409 Conflict. This will quickly fill our logs with unhelpful noise.

Flavor migration

Thankfully, there is a solution. We can modify the embedded flavor in our existing instances to remove the standard resource class inventory, which will result in the removal of the allocation of these resources from Placement. This will allow Nova to remove the inventory from the resource provider. There is a Nova patch started by Matt Riedemann which will remove our standard resource class inventory. The patch needs pushing over the line, but works well enough to be cherry-picked to Rocky.

The migration can be done offline or online. We chose to do it offline, to avoid the need to deploy this patch. For each node to be migrated:

nova-manage db ironic_flavor_migration --resource_class <node resource class> --host <host> --node <node UUID>

Alternatively, if all nodes have the same resource class:

nova-manage db ironic_flavor_migration --resource_class <node resource class> --all

You can check the instance embedded flavors have been updated correctly via the database:

sql> use nova
sql> select flavor from instance_extra;

Now (Rocky only), standard resource class inventory reporting can be disabled. After the nova compute service has been running for a while, Placement will be updated:

$ openstack resource provider inventory list <node UUID>
+----------------+------------------+----------+----------+-----------+----------+-------+
| resource_class | allocation_ratio | max_unit | reserved | step_size | min_unit | total |
+----------------+------------------+----------+----------+-----------+----------+-------+
| CUSTOM_GOLD    |              1.0 |        1 |        0 |         1 |        1 |     1 |
+----------------+------------------+----------+----------+-----------+----------+-------+

$ openstack resource provider usage show <node UUID>
+----------------+--------+
| resource_class |  usage |
+----------------+--------+
| CUSTOM_GOLD    |      1 |
+----------------+--------+

Summary

We hope this shows that OpenStack is now in a place where VMs and bare metal can coexist peacefully, and that even for those pesky pets, there is a path forward to this brave new world. Thanks to the Nova team for working hard to make Ironic a first class citizen.

by Mark Goddard at November 06, 2019 02:00 AM

November 04, 2019

Dan Smith

Start and Monitor Image Pre-cache Operations in Nova

When you boot an instance in Nova, you provide a reference to an image. In many cases, once Nova has selected a host, the virt driver on that node downloads the image from Glance and uses it as the basis for the root disk of your instance. If your nodes are using a virt driver that supports image caching, then that image only needs to be downloaded once per node, which means the first instance to use that image causes it to be downloaded (and thus has to wait). Subsequent instances based on that image will boot much faster as the image is already resident.

If you manage an application that involves booting a lot of instances from the same image, you know that the time-to-boot for those instances could be vastly reduced if the image is already resident on the compute nodes you will land on. If you are trying to avoid the latency of rolling out a new image, this becomes a critical calculation. For years, people have asked for or proposed solutions in Nova for allowing some sort of image pre-caching to solve this, but those discussions have always become stalled in detail hell. Some people have resorted to hacks like booting host-targeted tiny instances ahead of time, direct injection of image files to Nova’s cache directory, or local code modifications. Starting in the Ussuri release, such hacks will no longer be necessary.

Image pre-caching in Ussuri

Nova’s now-merged image caching feature includes a very lightweight and no-promises way to request that an image be cached on a group of hosts (defined by a host aggregate). In order to avoid some of the roadblocks to success that have plagued previous attempts, the new API does not attempt to provide a rich status result, nor a way to poll for or check on the status of a caching operation. There is also no scheduling, persistence, or reporting of which images are cached where. Asking Nova to cache one or more images on a group of hosts is similar to asking those hosts to boot an instance there, but without the overhead that goes along with it. That means that images cached as part of such a request will be subject to the same expiry timer as any other. If you want them to remain resident on the nodes permanently, you must re-request the images before the expiry timer would have purged them. Each time an image is pre-cached on a host, the timestamp for purge is updated if the image is already resident.

Obviously for a large cloud, status and monitoring of the cache process in some way is required, especially if you are waiting for it to complete before starting a rollout. The subject of this post is to demonstrate how this can be done with notifications.

Example setup

Before we can talk about how to kick off and monitor a caching operation, we need to set up the basic elements of a deployment. That means we need some compute nodes, and for those nodes to be in an aggregate that represents the group that will be the target of our pre-caching operation. In this example, I have a 100-node cloud with numbered nodes that look like this:

$ nova service-list --binary nova-compute
+--------------+--------------+
| Binary | Host |
+--------------+--------------+
| nova-compute | guaranine1 |
| nova-compute | guarnaine2 |
| nova-compute | guaranine3 |
| nova-compute | guaranine4 |
| nova-compute | guaranine5 |
| nova-compute | guaranine6 |
| nova-compute | guaranine7 |
.... and so on ...
| nova-compute | guaranine100 |
+--------------+--------------+

In order to be able to request that an image be pre-cached on these nodes, I need to put some of them into an aggregate. I will do that programmatically since there are so many of them like this:

$ nova aggregate-create my-application
+----+-----------------+-------------------+-------+----------+--------------------------------------+
| Id | Name | Availability Zone | Hosts | Metadata | UUID |
+----+-----------------+-------------------+-------+----------+--------------------------------------+
| 2 | my-application | - | | | cf6aa111-cade-4477-a185-a5c869bc3954 |
+----+-----------------+-------------------+-------+----------+--------------------------------------+
$ for i in seq 1 95; do nova aggregate-add-host my-application guaranine$i; done
... lots of noise ...

Now that I have done that, I am able to request that an image be pre-cached on all the nodes within that aggregate by using the nova aggregate-cache-images command:

$ nova aggregate-cache-images my-application c3b84ecf-43e9-4c6c-adfd-ab6db0e2bca2

If all goes to plan, sometime in the future all of the hosts in that aggregate will have fetched the image into their local cache and will be able to use that for subsequent instance creation. Depending on your configuration, that happens largely sequentially to avoid storming Glance, and with so many hosts and a decently-sized image, it could take a while. If I am waiting to deploy my application until all the compute hosts have the image, I need some way of monitoring the process.

Monitoring progress

Many of the OpenStack services send notifications via the messaging bus (i.e. RabbitMQ) and Nova is no exception. That means that whenever things happen, Nova sends information about those things to a queue on that bus (if so configured) which you can use to receive asynchronous information about the system.

The image pre-cache operation sends start and end versioned notifications, as well as progress notifications for each host in the aggregate, which allows you to follow along. Ensure that you have set [notifications]/notification_format=versioned in your config file in order to receive these. A sample intermediate notification looks like this:

{
'index': 68,
'total': 95,
'images_failed': [],
'uuid': 'ccf82bd4-a15e-43c5-83ad-b23970338139',
'images_cached': ['c3b84ecf-43e9-4c6c-adfd-ab6db0e2bca2'],
'host': 'guaranine68',
'id': 1,
'name': 'my-application',
}

This tells us that host guaranine68 just completed its cache operation for one image in the my-application aggregate. It was host 68 of 95 total. Since the image ID we used is in the images_cached list, that means it was either successfully downloaded on that node, or was already present. If the image failed to download for some reason, it would be in the images_failed list.

In order to demonstrate what this might look like, I wrote some example code. This is not intended to be production-ready, but will provide a template for you to write something of your own to connect to the bus and monitor a cache operation. You would run this before kicking off the process, it waits for a cache operation to begin, prints information about progress, and then exists with a non-zero status code if there were any errors detected. For the above example invocation, the output looks like this:

$ python image_cache_watcher.py
Image cache started on 95 hosts
Aggregate 'foo' host 95: 100% complete (8 errors)
Completed 94 hosts, 8 errors in 2m31s
Errors from hosts:
guaranine2
guaranine3
guaranine4
guaranine5
guaranine6
guaranine7
guaranine8
guaranine9
Image c3b84ecf-43e9-4c6c-adfd-ab6db0e2bca2 failed 8 times

In this case, I intentionally configured eight hosts so that the image download would fail for demonstration purposes.

Future

The image caching functionality in Nova may gain more features in the future, but for now, it is a best-effort sort of thing. With just a little bit of scripting, Ussuri operators should be able to kick off and monitor image pre-cache operations and substantially improve time-to-boot performance for their users.

by Dan at November 04, 2019 07:30 PM

Mirantis

How to build an edge cloud part 1: Building a simple facial recognition system

Learn about the basics of building an edge cloud -- and build a facial recognition system while you're at it.

by Nick Chase at November 04, 2019 07:06 PM

OpenStack Superuser

Baidu wins Superuser Award at Open Infrastructure Summit Shanghai

 The Baidu ABC Cloud Group & Security Edge teams is the 11th organization to win the Superuser Award. The news was announced today at the Open Infrastructure Summit in Shanghai. Baidu ABC Cloud Group and Edge Security Team integrated Kata Containers into the platform for all of Baidu internal and external cloud services including edge applications. Their cloud products, including both VMs and bare metal servers, cover 11 regions in China with over 5,000 physical machines. Today, 17 important online businesses have been migrated to the Kata Containers platform thus far.

Elected by members of the OSF community, the team that wins the Superuser Award is lauded for the unique nature of its use case as well as its integration and application of open infrastructure. Four out of five nominees for the Superuser Award presented today were from the APAC region: Baidu ABC Cloud Group and Edge Security Team, InCloud OpenStack Team of Inspur, Information Management Department of Wuxi Metro, and Rakuten Mobile Network Organization. Previous award winners from the APAC region include China Mobile, NTT Group, and the Tencent TStack Team.

Baidu Keynote at Open Infrastructure Summit

On the keynote stage in Shanghai, Baidu Cloud Senior Architect Zhang Yu explained that Kata Containers provides a virtual machine-like security mechanism at the container level, which gives their customers a great deal of confidence. When moving their business to a container environment, they have less concern. Kata Containers is compatible with the OCI standard and users can directly manage the new environment with popular management suites such as Kubernetes. Kata Containers is now an official project under the OpenStack Foundation, which gives the company confidence to invest in the project.

“Baidu is an amazing example of how open infrastructure starts with OpenStack,” said Mark Collier, COO of the OpenStack Foundation. “They’re running OpenStack at massive scale, combined with other open infrastructure technologies like Kata Containers and Kubernetes, and they’re doing it in production for business-critical workloads.”

*** Download the Baidu Kata Containers White Paper ***

The company has published a white paper titled, “The Application of Kata Containers in Baidu AI Cloud” available here.

The post Baidu wins Superuser Award at Open Infrastructure Summit Shanghai appeared first on Superuser.

by Allison Price at November 04, 2019 04:39 AM

November 02, 2019

StackHPC Team Blog

StackHPC joins the OpenStack Marketplace

In many areas, our participation in the OpenStack community is no secret.

One area we haven't focussed on is our commercial representation within the OpenStack Foundation. As described here, StackHPC works with clients to solve challenging problems with cloud infrastructure. Our business has been won through word of mouth.

Now our services can also be found in the OpenStack Marketplace.

John Taylor, StackHPC's co-founder and CEO, adds:

We are pleased to announce our OpenStack Foundation membership and inclusion in the OpenStack Marketplace. Our success in driving the HPC and Research Computing use-case in cloud has been in no small part coupled to working closely with the OpenStack Foundation and the open community it fosters. The era of hybrid cloud and the emergence of converged AI/HPC infrastructure and coupled workflows is now upon us, driving the need for architectures that seamlessly transition across these resources while not compromising on performance. We look forward to continuing our partnership with OpenStack through the Scientific SIG and to active participation within OpenStack projects.

Get in touch

If you would like to get in touch we would love to hear from you. Reach out to us via Twitter or directly via our contact page.

by Stig Telfer at November 02, 2019 09:00 AM

October 31, 2019

RDO

RDO Train Released

The RDO community is pleased to announce the general availability of the RDO build for OpenStack Train for RPM-based distributions, CentOS Linux and Red Hat Enterprise Linux. RDO is suitable for building private, public, and hybrid clouds. Train is the 20th release from the OpenStack project, which is the work of more than 1115 contributors from around the world.

The release is already available on the CentOS mirror network at http://mirror.centos.org/centos/7/cloud/x86_64/openstack-train/. While we normally also have the release available via http://mirror.centos.org/altarch/7/cloud/ppc64le/ and http://mirror.centos.org/altarch/7/cloud/aarch64/ – there have been issues with the mirror network which is currently being addressed via https://bugs.centos.org/view.php?id=16590.

The RDO community project curates, packages, builds, tests and maintains a complete OpenStack component set for RHEL and CentOS Linux and is a member of the CentOS Cloud Infrastructure SIG. The Cloud Infrastructure SIG focuses on delivering a great user experience for CentOS Linux users looking to build and maintain their own on-premise, public or hybrid clouds.

All work on RDO and on the downstream release, Red Hat OpenStack Platform, is 100% open source, with all code changes going upstream first.

PLEASE NOTE: At this time, RDO Train provides packages for CentOS7 only. We plan to move RDO to use CentOS8 as soon as possible during Ussuri development cycle so Train will be the last release working on CentOS7.

Interesting things in the Train release include:

  • Openstack Ansible, which provides ansible playbooks and roles for deployment, added murano support and fully migrated to systemd-journald from rsyslog. This project makes deploying OpenStack from source in a way that makes it scalable while also being simple to operate, upgrade, and grow.
  • Ironic, the Bare Metal service, aims to produce an OpenStack service and associated libraries capable of managing and provisioning physical machines in a security-aware and fault-tolerant manner. Beyond providing basic support for building software RAID and a myriad of other highlights, this project now offers a new tool for building ramdisk images, ironic-python-agent-builder.

Other improvements include:

  • Tobiko is now available within RDO! This project is an OpenStack testing framework focusing on areas mostly complementary to Tempest. While the tempest main focus has been testing OpenStack rest APIs, the main Tobiko focus would be to test OpenStack system operations while “simulating” the use of the cloud as the final user would. Tobiko’s test cases populate the cloud with workloads such as instances, allows the CI workflow to perform an operation such as an update or upgrade, and then runs test cases to validate that the cloud workloads are still functional.
  • Other highlights of the broader upstream OpenStack project may be read via https://releases.openstack.org/train/highlights.html.

Contributors
During the Train cycle, we saw the following new RDO contributors:

  • Joel Capitao
  • Zoltan Caplovic
  • Sorin Sbarnea
  • Sławek Kapłoński
  • Damien Ciabrini
  • Beagles
  • Soniya Vyas
  • Kevin Carter (cloudnull)
  • fpantano
  • Michał Dulko
  • Stephen Finucane
  • Sofer Athlan-Guyot
  • Gauvain Pocentek
  • John Fulton
  • Pete Zaitcev

Welcome to all of you and Thank You So Much for participating!

But we wouldn’t want to overlook anyone. A super massive Thank You to all 65 contributors who participated in producing this release. This list includes commits to rdo-packages and rdo-infra repositories:

  • Adam Kimball
  • Alan Bishop
  • Alex Schultz
  • Alfredo Moralejo
  • Arx Cruz
  • Beagles
  • Bernard Cafarelli
  • Bogdan Dobrelya
  • Brian Rosmaita
  • Carlos Goncalves
  • Cédric Jeanneret
  • Chandan Kumar
  • Damien Ciabrini
  • Daniel Alvarez
  • David Moreau Simard
  • Dmitry Tantsur
  • Emilien Macchi
  • Eric Harney
  • fpantano
  • Gael Chamoulaud
  • Gauvain Pocentek
  • Jakub Libosvar
  • James Slagle
  • Javier Peña
  • Joel Capitao
  • John Fulton
  • Jon Schlueter
  • Kashyap Chamarthy
  • Kevin Carter (cloudnull)
  • Lee Yarwood
  • Lon Hohberger
  • Luigi Toscano
  • Luka Peschke
  • marios
  • Martin Kopec
  • Martin Mágr
  • Matthias Runge
  • Michael Turek
  • Michał Dulko
  • Michele Baldessari
  • Natal Ngétal
  • Nicolas Hicher
  • Nir Magnezi
  • Otherwiseguy
  • Gabriele Cerami
  • Pete Zaitcev
  • Quique Llorente
  • Radomiropieralski
  • Rafael Folco
  • Rlandy
  • Sagi Shnaidman
  • shrjoshi
  • Sławek Kapłoński
  • Sofer Athlan-Guyot
  • Soniya Vyas
  • Sorin Sbarnea
  • Stephen Finucane
  • Steve Baker
  • Steve Linabery
  • Tobias Urdin
  • Tony Breeds
  • Tristan de Cacqueray
  • Victoria Martinez de la Cruz
  • Wes Hayutin
  • Yatin Karel
  • Zoltan Caplovic

The Next Release Cycle
At the end of one release, focus shifts immediately to the next, Ussuri, which has an estimated GA the week of 11-15 May 2020. The full schedule is available at https://releases.openstack.org/ussuri/schedule.html.

Twice during each release cycle, RDO hosts official Test Days shortly after the first and third milestones; therefore, the upcoming test days are 19-20 December 2019 for Milestone One and 16-17 April 2020 for Milestone Three.

Get Started
There are three ways to get started with RDO.

To spin up a proof of concept cloud, quickly, and on limited hardware, try an All-In-One Packstack installation. You can run RDO on a single node to get a feel for how it works.

For a production deployment of RDO, use the TripleO Quickstart and you’ll be running a production cloud in short order.

Finally, for those that don’t have any hardware or physical resources, there’s the OpenStack Global Passport Program. This is a collaborative effort between OpenStack public cloud providers to let you experience the freedom, performance and interoperability of open source infrastructure. You can quickly and easily gain access to OpenStack infrastructure via trial programs from participating OpenStack public cloud providers around the world.

Get Help
The RDO Project participates in a Q&A service at https://ask.openstack.org. We also have our users@lists.rdoproject.org for RDO-specific users and operrators. For more developer-oriented content we recommend joining the dev@lists.rdoproject.org mailing list. Remember to post a brief introduction about yourself and your RDO story. The mailing lists archives are all available at https://mail.rdoproject.org. You can also find extensive documentation on RDOproject.org.

The #rdo channel on Freenode IRC is also an excellent place to find and give help.

We also welcome comments and requests on the CentOS devel mailing list and the CentOS and TripleO IRC channels (#centos, #centos-devel, and #tripleo on irc.freenode.net), however we have a more focused audience within the RDO venues.

Get Involved
To get involved in the OpenStack RPM packaging effort, check out the RDO contribute pages, peruse the CentOS Cloud SIG page, and inhale the RDO packaging documentation.

Join us in #rdo and #tripleo on the Freenode IRC network and follow us on Twitter @RDOCommunity. You can also find us on Facebook and YouTube.

by Rain Leander at October 31, 2019 04:18 PM

October 29, 2019

VEXXHOST Inc.

Join us at the Open Infrastructure Summit in Shanghai!

As active users and contributors to OpenStack and its projects, VEXXHOST is excited to be attending the Open Infrastructure Summit in Shanghai this year!

The post Join us at the Open Infrastructure Summit in Shanghai! appeared first on VEXXHOST.

by Samridhi Sharma at October 29, 2019 04:48 PM

Galera Cluster by Codership

Galera Cluster for MySQL 5.6.46 and MySQL 5.7.28 is GA

Codership is pleased to announce a new Generally Available (GA) release of Galera Cluster for MySQL 5.6 and 5.7, consisting of MySQL-wsrep 5.6.46 (release notes, download) and MySQL-wsrep 5.7.28 (release notes, download). There is no Galera replication library release this time, so please continue using the 3.28 version, implementing wsrep API version 25.

This release incorporates all changes to MySQL 5.6.46 and 5.7.28 respectively and can be considered an updated rebased version. It is worth noting that we will have some platforms reach end of life (EOL) status, notably OpenSUSE 13.2 and Ubuntu Trusty 14.04.

You can get the latest release of Galera Cluster from https://www.galeracluster.com. There are package repositories for Debian, Ubuntu, CentOS, RHEL, OpenSUSE and SLES. The latest versions are also available via the FreeBSD Ports Collection.

by Colin Charles at October 29, 2019 12:42 PM

October 28, 2019

Mirantis

53 Things to look for in OpenStack Train

Now that OpenStack Train has been released, here are some features to look for.

by Nick Chase at October 28, 2019 04:27 PM

October 24, 2019

OpenStack Superuser

Using GitHub and Gerrit with Zuul: A leboncoin case study

Described as an online flea market, leboncoin is a portal that allows individuals to buy and sell new and used goods online in their local communities.  Leboncoin is one of the top ten searched websites in France, following Google, Facebook, and YouTube to name a few.

We got talking with Guillaume Chenuet to get some answers to why Leboncoin chose Zuul, an open source CI tool, and how they use it with GitHub, Gerrit, and OpenStack.  

How did your organization get started with Zuul?

We started using Zuul for open source CI two years ago with Zuulv2 and Jenkins. At the beginning, we only used Gerrit and Jenkins, but as new developers joined leboncoin each new day, this solution was not enough. After some research and a proof-of-concept, we gave Zuul a try, running between Gerrit and Jenkins. In less than a month (and without an official thick documentation) we’ve setup a complete new stack. We ran it for a year before moving to Zuulv3. Zuulv3 is more complex in terms of setup but brings us more features using up-to-date tools like Ansible or OpenStack.

Describe how you’re using it:

We’re using Zuulv3 with Gerrit. Our workflow is close to the OpenStack one. For each review, Zuul is trigger on three “checks” pipelines: quality, integration and build. Once results are correct, we use the gate system to merge the code into repositories and build artifacts.

We are using two small OpenStack clusters (3 CTRL / 3 STRG / 5 COMPUTE) on each datacenter. Zuul is currently setup on all Gerrit projects and some GitHub projects too. Below, is our Zuulv3 infrastructure in production and in the case of datacenter loss.

 

Zuulv3 infrastructure in production.

 

Zuulv3 infrastructure in the case of DC loss.

What is your current scale?

In terms of compute resources, we currently have 480 cores, 1.3To Ram and 80To in our Ceph clusters available. In terms of jobs, we ran around 60,000 jobs per month which means ~around 2,500 jobs per day. Jobs average time is less than 5 minutes.

 

What benefits has your organization seen from using Zuul?

As leboncoin is growing very fast (and microservices too 🙂 ), Zuul allows us to ensure everything can be tested and at scale. Zuul is also able to work with Gerrit and GitHub which permits us to open our CI to more teams and workflows.

What have the challenges been (and how have you solved them)?

Our big challenge was to migrate from Zuulv2 to Zuulv3. Even if everything is using Ansible, it was very tiresome to migrate all our CI jobs (around 500 Jenkins jobs). With the help of Zuul guys on IRC, we used some Ansible roles and playbooks used by OpenStack but migration time was about a year.

What are your future plans with Zuul?

Our next steps are to use Kubernetes backend for small jobs like linters and improve Zuul with GitHub.

How can organizations who are interested in Zuul learn more and get involved?

Coming from OpenStack, I think meeting the community at Summits or on IRC is a good start. But Zuul needs better visibility. It is a powerful tool but the information online is limited.

Are there specific features that drew you to Zuul?

Scalability! And also ensuring than every commit merge into the repository is clean and can’t be broken.

What would you request from the Zuul upstream community?

Work on a better integration to Gerrit 3, new nodepool features and provider, a full HA and more visibility on the Internet.

 

Cover image courtesy of Guillaume Chenuet.

The post Using GitHub and Gerrit with Zuul: A leboncoin case study appeared first on Superuser.

by Ashleigh Gregory at October 24, 2019 02:00 PM

October 23, 2019

Corey Bryant

OpenStack Train for Ubuntu 18.04 LTS

The Ubuntu OpenStack team at Canonical is pleased to announce the general availability of OpenStack Train on Ubuntu 18.04 LTS via the Ubuntu Cloud Archive. Details of the Train release can be found at:  https://www.openstack.org/software/train

To get access to the Ubuntu Train packages:

Ubuntu 18.04 LTS

You can enable the Ubuntu Cloud Archive pocket for OpenStack Train on Ubuntu 18.04 installations by running the following commands:

    sudo add-apt-repository cloud-archive:train
    sudo apt update

The Ubuntu Cloud Archive for Train includes updates for:

aodh, barbican, ceilometer, ceph (14.2.2), cinder, designate, designate-dashboard, dpdk (18.11.2), glance, gnocchi, heat, heat-dashboard, horizon, ironic, keystone, libvirt (5.4.0), magnum, manila, manila-ui, mistral, murano, murano-dashboard, networking-arista, networking-bagpipe, networking-bgpvpn, networking-hyperv, networking-l2gw, networking-mlnx, networking-odl, networking-ovn, networking-sfc, neutron, neutron-dynamic-routing, neutron-fwaas, neutron-lbaas, neutron-lbaas-dashboard, neutron-vpnaas, nova, octavia, openstack-trove, openvswitch (2.12.0), panko, placement, qemu (4.0), sahara, sahara-dashboard, senlin, swift, trove-dashboard, vmware-nsx, watcher, and zaqar.

For a full list of packages and versions, please refer to:

http://reqorts.qa.ubuntu.com/reports/ubuntu-server/cloud-archive/train_versions.html

Python support

The Train release of Ubuntu OpenStack is Python 3 only; all Python 2 packages have been dropped in Train.

Branch package builds

If you would like to try out the latest updates to branches, we deliver continuously integrated packages on each upstream commit via the following PPA’s:

    sudo add-apt-repository ppa:openstack-ubuntu-testing/mitaka
    sudo add-apt-repository ppa:openstack-ubuntu-testing/ocata
    sudo add-apt-repository ppa:openstack-ubuntu-testing/queens
    sudo add-apt-repository ppa:openstack-ubuntu-testing/rocky
    sudo add-apt-repository ppa:openstack-ubuntu-testing/train

Reporting bugs

If you have any issues please report bugs using the ‘ubuntu-bug’ tool to ensure that bugs get logged in the right place in Launchpad:

    sudo ubuntu-bug nova-conductor

Thanks to everyone who has contributed to OpenStack Train, both upstream and downstream. Special thanks to the Puppet OpenStack modules team and the OpenStack Charms team for their continued early testing of the Ubuntu Cloud Archive, as well as the Ubuntu and Debian OpenStack teams for all of their contributions.

Enjoy and see you in Ussuri!

Corey
(on behalf of the Ubuntu OpenStack team)

by coreycb at October 23, 2019 12:56 PM

October 22, 2019

RDO

Cycle Trailing Projects and RDO’s Latest Release Train

The RDO community is pleased to announce the general availability of the RDO build for OpenStack Train for RPM-based distributions, CentOS Linux and Red Hat Enterprise Linux. RDO is suitable for building private, public, and hybrid clouds. Train is the 20th release from the OpenStack project, which is the work of more than 1115 contributors from around the world.

The release is already available on the CentOS mirror network at http://mirror.centos.org/centos/7/cloud/x86_64/openstack-train/.

BUT!

This is not the official announcement you’re looking for.

We’re doing something a little different this cycle – we’re waiting for some of the “cycle-trailing” projects that we’re particularly keen about, like TripleO and Kolla, to finish their push BEFORE we make the official announcement.

Photo by Denis Chick on Unsplash

Deployment and lifecycle-management tools generally want to follow the release cycle, but because they rely on the other projects being completed, they may not always publish their final release at the same time as those projects. To that effect, they may choose the cycle-trailing release model.

Cycle-trailing projects are given an extra three months after the final release date to request publication of their release. They may otherwise use intermediary releases or development milestones.

While we’re super hopeful that these cycle trailing projects will be uploaded to the CentOS mirror before OpenInfrastructure Summit Shanghai, we’re going to do the official announcement just before the Summit with or without the packages.

We’ve got a lot of people to thank!

Do you like that we’re waiting a bit for our cycle trailing projects or would you prefer the official announcement as soon as the main projects are available? Let us know in the comments and we may adjust the process for future releases!

In the meantime, keep an eye here or on the mailing lists for the official announcement COMING SOON!

by Rain Leander at October 22, 2019 02:34 PM

Sean McGinnis

October 2019 OpenStack Board Notes

Another OpenStack Foundation Board of Directors meeting was held October 22, 2019. This meeting was added primarily as to discuss the Airship’s request to for confirmation to become an official project.

The meeting agenda is published on the wiki.

OSF Updates

Jonathan Bryce gave a quick update on the OpenStack Train release that went out last week. The number of contributors, variety of companies, and overall commit numbers were pretty impressive. There were over 25,500 merged commits in Train, with 1,125 unique developers from 165 different organizations. With commits over the last cycle, OpenStack is still one of the top three active open source projects out there, after the Linux kernel and Chromium.

Jonathan also reiterated that the event structure will be different starting in 2020. The first major event planned is in Vancouver, June 8. This will be more of a collaborative event, so expect the format to be different than past Summits. I’m thinking more Project Teams Gathering than Summit.

Airship Confirmation

Matt McEuen, Alex Hughes, Kaspar Skels, and Jay Ahn went through the Airship Confirmation presentation and answered questions about the project and their roadmap. Overall, really pretty impressive what the Airship community has been able to accomplish so far.

The Airship mission statement is:

Openly collaborate across a divers, global community to provide and integrate a collection of loosely coupled but interoperable, open source tools taht declaratively automates cloud lifecycle management.

They started work inside of openstack-helm and kept to the OpenStack community Four Opens right from the start.

Project Diversity

The project was started by AT&T, so there is still a lot of work being done (code reviews, commits, etc.) from the one company, but the trend over the last couple of years has been really good, trending towards more and more contributor diversity.

They also have good policies in place to make sure the Technical Committee and Working Committee have no more than two members from the same company. Great to see this policy in place to really encourage more diversity in the spots where overall project decisions are made. Kudos to the AT&T folks for not only getting things started, but driving a lot of change while still actively encouraging others so it is not a one company show. It can be hard for some companies to realize that giving up absolute control is a good thing, especially when it comes to an open source community.

Community Feedback

Part of the confirmation process is to make sure the existing OSF projects are comfortable with the new project. There was feedback from the Zuul project and from the OpenStack TC. Rico Lin went through the TC feedback in the meeting. Only minor questions or concerns were raised there, and Matt was able to respond to most of them in the meeting. He did state he would respond to the mailing list so there was a record there of the responses.

Licensing

Really the only point of concern was raised at the end. One difference between Airship and other OpenStack projects is that it is written in Go. Go has a great system built in to be able to easily use modules written by others. But that led to the question of licensing.

The OSF bylaws state:

The Board of Directors may approve a license for an Open Infrastructure Project other than Apache License 2.0, but such license must be a license approved by the Open Source Initiative at the date of adoption of such license.

The Airship code itself is Apache 2.0. But there isn’t anything done today to vet the dependencies that are pulled in to actually compile the project. The concern is the copyleft licenses usually have provisions that if they are pulled in and linked to non-copyleft code, it then makes that code fall under the copyleft requirements. So the only concern was that it just wasn’t known what the effective license of the project is today based on what is being pulled in.

It can be a very tricky area that definitely requires involvement of lawyers that understand copyright law and open source licensing. Luckily it wasn’t a show stopper. We moved to add the project and have them work with OSF legal to better understand the licensing impacts and resolve any concerns by using different dependencies if any are found to be licensed with something that would impose copyleft into Airship. The board unanimously voted in favor of Airship becoming a fully official Open Infrastructure Project.

Next Meeting

The next OSF board meeting will take place November 3rd, in Shanghai, the day before the Open Infrastructure Summit.

by Sean McGinnis at October 22, 2019 12:00 AM

October 21, 2019

OpenStack Superuser

Collaboration across Boundaries: Open Culture Drives Evolving Tech

This past summer marked a pinnacle in OpenStack’s history — the project’s ninth birthday — a project that epitomizes collaboration without boundaries. Communities comprised of diverse individuals and companies united around the globe to celebrate this exciting milestone, from Silicon Valley to the Pacific Northwest to the Far East. Participants from communities that spanned OpenStack, Kubernetes, Kata Containers, Akraino, Project ACRN and Clear Linux — and represented nearly 60 organizations — shared stories about their collective journey, and looked towards the future.

An Amazing Journey

The Shanghai event brought together several organizations, including 99Cloud, China Mobile, Intel, ZStack, East China Normal University, Shanghai Jiaotong University and Tongji University, as well as the OpenStack Foundation.

Individual OpenStack Foundation board director, Shane Wang, talked about OpenStack’s history. What began as an endeavor to bring greater choice in cloud solutions to users, combining Nova for compute from NASA with Swift for object storage from Rackspace, has since grown into a strong foundation for open infrastructure. The project is supported by one of the largest, global open source communities of 105,000 members in 187 countries from 675 organizations, backed by over 100 member companies.

“After nine years of development, OpenStack and current Open Infrastructure have attracted a large number of enterprises, users, developers and enthusiasts to join the community, and together we’ve initiated new projects and technologies to address emerging scenarios and use cases,” said Shane. “Through OpenStack and Open Infrastructure, businesses can realize healthy profits, users can satisfy their needs, innovations can be incubated through a thriving community, and individuals can grow their skills and talents. These are the reasons that the community stays strong and popular.”

Truly representative of cross-project collaboration, this Open Infrastructure umbrella now encompasses components that can be used to address existing and emerging use cases across data center and edge. Today’s applications span enterprise, artificial intelligence, machine learning, 5G networking and more. Adoption ranges from retail, financial services, academia and telecom to manufacturing, public cloud and transportation.

 

 

Junwei Liu, OpenStack board member from China Mobile, winner of the 2016 SuperUser award, joined the birthday celebration. He reflected on OpenStack’s capability to address existing and emerging business needs: “Since 2015, China Mobile, a leading company in the cloud industry, has built a public cloud, private cloud and networking cloud for internal and external customers based on OpenStack. OpenStack has been proven mature enough to meet the needs of core business and has become the de facto standard of IaaS resource management. The orchestration systems integrating Kubernetes into OpenStack as the core will be the most controllable and the most suitable cloud computing platform which meets enterprises’ own business needs.”

Ruoyu Ying, Cloud Software Engineer at Intel, reflected on the various release names, and summits, over the years. There have been several exciting milestones along the way: the inaugural release and summit, both bearing the name, Austin, to commemorate OpenStack’s birthplace in Austin, Texas; the fourth release, Diablo, which established a bi-annual release frequency and expanded the summit outside of Texas to Santa Clara, California; the ninth release, Icehouse, which heralded a move of the summit outside North America to Hong Kong and invited more developers from Asia to contribute; the eleventh release, Kilo, which expanded the summit into Europe, specifically Paris, France; the 17th release, Queens, that saw the summit move into the southern hemisphere in Sydney, Australia; and ultimately, the 20th release, Train, with the vital change in summit name to OpenInfra to accurately reflect the evolution in the project and community.


In November, the summit will be held on mainland China for the first time, and the team there is looking forward to welcoming the global community with open arms!

Collaboration across Boundaries

Meetups across Silicon Valley and the Pacific Northwest, which were sponsored by Intel, Portworx, Rancher Labs and Red Hat, personified collaboration across projects and communities. Individuals from the OpenStack, Kubernetes, Kata Containers, Akraino, Clear Linux and Project ACRN communities — representing over 50 organizations — came together to celebrate this special milestone with commemorative birthday cupcakes and a strong lineup of presentations focused on emerging technologies and use cases.

Containers and container orchestration technologies were highlights, as Jonathan Gershater, Senior Principal Product Marketing Manager at Red Hat, talked about how to deploy, orchestrate and manage enterprise Kubernetes on OpenStack, while Gunjan Patel, Cloud Architect at Palo Alto Networks, talked about the full lifecycle of a Kubernetes pod. Rajashree Mandaogane, Software Engineer at Rancher Labs, and Oksana Chuiko, Software Engineer at Portworx, delivered lightning talks focused on Kubernetes. Eric Ernst, Kata Containers’ Technical Lead and Architecture Committee Member, and Senior Software Engineer at Intel, talked about running container solutions with the extra isolation provided by Kata Containers, while Manohar Castelino and Ganesh Mahalingam, Software Engineers at Intel, gave demos of many of Kata Containers’ newest features.

Edge computing and IoT were also hot topics. Zhaorong Hou, Virtualization Software Manager at Intel, talked about how Project ACRN addresses the need for lightweight hypervisors in booming IoT development, while Srinivasa Addepalli, Senior Principal Software Architect at Intel, dove into one of the blueprints set forth by the Akraino project—the Integrated Cloud Native Stack—and how it addresses edge deployments for both network functions and application containers.

Beatriz Palmeiro, Community and Developer Advocate at Intel, engaged attendees in a discussion about how to collaborate and contribute to the Clear Linux project, while Kateryna Ivashchenko, Marketing Specialist at Portworx, provided us all with an important reminder about how not to burn out in tech.

Open Culture Drives Evolving Tech

There is incredible strength in the OpenStack community. As noted at the Shanghai event, OpenStack powers open infrastructure across data centers and edge, enabling private and hybrid cloud models to flourish. This strength is due, in part, to the amazing diversity within the OpenStack community.

Throughout its history, OpenStack has been committed to creating an open culture that invites diverse contributions. This truth is evident in many forms: diversity research, representation of women on the keynote stage and as speakers across the summits, speed mentoring workshops, diversity luncheons and more. The breadth of allies and advocates for underrepresented minorities abounds in our community, from Joseph Sandoval, who keynoted at the Berlin summit to talk about the importance of projects like OpenStack in enabling diversity, to Tim Berners-Lee, who participated in the speed mentoring workshop in Berlin, to Lisa-Marie Namphy, who organized and hosted the event in the Silicon Valley and made sure that over 50% of her presenters were women, among many others.

“OpenStack is a strategic platform that I believe will enable diversity.” — Joseph Sandoval, OpenStack User Committee Member and SRE Manager, Infrastructure Platform, Adobe

As OpenStack evolves as the foundation for the open infrastructure, and new projects and technologies emerge to tackle the challenges of IoT, edge and other exciting use cases, diversity — in gender, race, perspective, experience, expertise, skill set, and more — becomes increasingly important to the health of our communities. From developers and coders to community and program managers, ambassadors, event and meetup organizers, and more, it truly takes a village to sustain a community and ensure the health of a project!

Early OpenStack contributor, community architect, and OpenStack Ambassador Lisa-Marie Namphy reflected on OpenStack’s evolution and what she’s most excited about looking forward. As organizer of the original San Francisco Bay Area User Group, which has now expanded beyond just OpenStack to reflect the broader ecosystem of Cloud Native Containers, she has established one of the largest OpenStack & Cloud Native user groups in the world. “Our user group has always committed to showcasing the latest trends in cloud native computing, whether that was OpenStack, microservices, serverless, open networking, or our most exciting recent trend: containers! In response to our passionate and vocal community members, we’ve added more programming around Kubernetes, Istio, Kata Containers and other projects representing the diversity of the open infrastructure ecosystem. It’s as exciting as ever to be a part of this growing open cloud community!” Lisa now works as Director of Marketing at Portworx, contributing to the OpenStack, Kubernetes, and Istio communities.

Looking Forward

As we blow out the birthday candles, we’d like to thank the organizers, sponsors, contributors and participants of these meetup — with a special thank you to Kari Fredheim, Liz Aberg, Liz Warner, Sujata Tibrewala, Lisa-Marie Namphy, Maggie Liang, Shane Wang and Ruoyu Ying.

As we look forward, the OpenStack Foundation has just revealed the name of the project’s next release — Ussuri, a river in China — commemorative of the summit’s next location in Shanghai. “The river teems with different kinds of fish: grayling, sturgeon, humpback salmon (gorbusha), chum salmon (keta), and others.”1 A fitting name to embody diverse projects, communities and technologies working in unison to further innovation!

***

1 Source: https://en.wikipedia.org/wiki/Ussuri_River

 

 

The post Collaboration across Boundaries: Open Culture Drives Evolving Tech appeared first on Superuser.

by Nicole Huesman at October 21, 2019 02:00 PM

RDO

Community Blog Round Up 21 October 2019

Just in time for Halloween, Andrew Beekhof has a ghost story about the texture of hounds.

But first!

Where have all the blog round ups gone?!?

Well, there’s the rub, right?

We don’t usually post when there’s one or less posts from our community to round up, but this has been the only post for WEEKS now, so here it is.

Thanks, Andrew!

But that brings us to another point.

We want to hear from YOU!

RDO has a database of bloggers who write about OpenStack / RDO / TripleO / Packstack things and while we’re encouraging those people to write, we’re also wondering if we’re missing some people. Do you know of a writer who is not included in our database? Let us know in the comments below.

Photo by Jessica Furtney on Unsplash

Savaged by Softdog, a Cautionary Tale by Andrew Beekhof

Hardware is imperfect, and software contains bugs. When node level failures occur, the work required from the cluster does not decrease – affected workloads need to be restarted, putting additional stress on surviving peers and making it important to recover the lost capacity.

Read more at http://blog.clusterlabs.org/blog/2019/savaged-by-softdog

by Rain Leander at October 21, 2019 09:17 AM

October 18, 2019

OpenStack Superuser

OpenStack Ops Meetup Features Ceph, OpenStack Architectures and Operator Pain Points

Bloomberg recently hosted an OpenStack Ops Meetup in one of its New York engineering offices on September 3 and 4. The event was well attended with between 40 and 50 attendees, primarily from North America, with a few people even traveling from Japan!

The OpenStack Ops Meetups team was represented by Chris Morgan (Bloomberg), Erik McCormick (Cirrus Seven) and Shintaro Mizuno (NTT). In addition to this core group, other volunteer moderators that lead sessions included Matthew Leonard (Bloomberg), Martin Gehrke (Two Sigma), David Medberry (Red Hat), Elaine Wong-Perry (Verizon), Assaf Muller (Red Hat), David Desrosiers (Canonical), and Conrad Bennett (Verizon) with many others contributing. The official meetups team is rather small, so volunteer moderators make such events come alive and we couldn’t make them happen without all of you, thanks to everyone that helped.

An interesting topic that Bloomberg brought up at this meetup was the concept of expanding the Ceph content. Ceph is a very popular storage choice in production-quality OpenStack deployments, which is shown by the OpenStack user survey and by the fact that Ceph sessions at previous meetups have always been very popular. Bloomberg’s Matthew Leonard suggested to those attending the first Ceph session that we build upon this with more Ceph sessions, and perhaps even launch a separate Ceph operators meetup series in the future. Some of this discussion was captured here. Matthew also lead a group discussion around a deeper technical dive into challenging use cases for Ceph, such as gigantic (multi-petabyte) object stores using Ceph’s RadosGW interface. It’s a relief that we are not the only ones hitting certain technical issues at this scale.

Response from the Ceph users at the meetup was positive and we will seek to expand Ceph content at the next event.

Other evergreen topics for OpenStack operators include deployment/upgrades, upgrades/long-term support, monitoring, testing and billing. These all saw some spirited debate and exchanging of experience. The meetups team also shared some things that the ops community can point to as positive changes we have achieved, such as the policy changes allowing longer retention of older OpenStack documentation and maintenance branches.

To make the event a bit more fun, the meetups team always includes lightning talks at the end of each day. Day 1 saw an “arch show and tell” where those who were willing grabbed a microphone and talked about the specific architecture of their cloud. The variety of OpenStack architectures, use cases, market segments is astonishing.

On day 2, many of the most noteworthy sessions were again moderated by volunteers. Assaf Muller from Red Hat lead an OpenStack networking state of the union discussion, with a certain amount of RDO focus, although not exclusively. Later on Martin Gehrke from Two Sigma ran a double session covering choosing appropriate workloads for your cloud, and then one on reducing OpenStack toil.

As a slight change of pace, David Desrosiers demonstrated a lightning fast developer build of OpenStack using Canonical’s nifty “microstack” snap install of an all-in-one OpenStack instance, although our guest wifi picked this exact moment to pitch a fit – sorry David!

The final technical session of the event was another lightning talk, this time asking the guests to recount their best “ops war stories”. The organizers strongly encouraged everyone to participate, and later on revealed why – we arranged for a lighthearted scoring system and eventually awarded a winner (chosen by the attendees). There were even some nominal prizes! David Medberry moderated this session and it was a fun way to finish off the event.

The overall winner was Julia Kreger from Red Hat, who shared with us a story about “it must be a volume knob?” – it seems letting visitors near the power controls in the data center isn’t a great idea? Well, let’s just say it’s probably best if you try and hear Julia tell it in person!

The above gives just a brief flavor of the event and sorry for those sessions and moderators I didn’t mention. The next OpenStack Ops Meetup is expected to be somewhere in Europe in the first quarter of 2020.

Cover Photo courtesy of David Medberry

The post OpenStack Ops Meetup Features Ceph, OpenStack Architectures and Operator Pain Points appeared first on Superuser.

by Chris Morgan at October 18, 2019 02:00 PM

October 17, 2019

Mirantis

Tips for taking the new OpenStack COA (Certified OpenStack Administrator) exam – October 2019

Mirantis will be providing resources to the OpenStack Foundation, including becoming the new administrators of the upgraded Certified OpenStack Administrator (COA) exam

by Nick Chase at October 17, 2019 01:36 AM

Sean McGinnis

September 2019 OpenStack Board Notes

There was another OpenStack Foundation Board of Directors conference call on September 10, 2019. There were a couple of significant updates during this call. Well, at least one significant for the community, and one that was significant to me (more details below).

In case this is your first time reading my BoD updates, just a reminder that upcoming and past OSF board meeting information is published on the wiki and the meetings are open to everyone. Occasionally there is a need to have a private, board member only portion of the call to go over any legal affairs that can’t be discussed publicly, but that should be a rare occasion.

September 10, 2019 OpenStack Foundation Board Meeting

The original agenda can be found here. Usually there are official and unofficial notes sent out, but at least at this time, it doesn’t appear Jonathan has been able to get to that. Watch for that to show up on the wiki page referenced in the previous section.

Director Changes

There were a couple of changes in the assigned Platinum Director seats. The Platinum level sponsors are the only seats on the board that are guaranteed to the sponsor and allows them to assign a Director. So no change in sponsorships at this point, just a couple of internal personel changes that led to these changes.

With all the churn and resulting separation of Futurewei in the US from the rest of Huawei, their chair seat was moved over to Fred Li. I worked with Fred quite a bit during my time with the company. He’s a great guy and has put in a lot of work, mostly behind the scenes, to support OpenStack. Really happy to be able to work with him again. Anni has also done a lot over the years, so sad to see her go. I’m sure she will be quite busy on new things though.

On the Red Hat side, Mark McLoughlin has transitioned out, handing things over to Daniel Becker. It sounds like with the internal structure at Red Hat, Daniel is now the better representative for the OpenStack Foundation. I personally didn’t get a lot of opportunity to work with Mark, but I know he has been around for a long time and has done a lot of great things, so I’m a little sad to see him go. But also looking forward to working with Daniel.

Director Diversity Waiver

This was the significant topic to me, because, well… it was about me.

In June I switched employers, going back to Dell EMC. So far, I’ve been very happy, and it feels like I’ve gone bake home with the 14+ years between Compellent and Dell that I had prior to joining Huawei. Not that my time with Huawei wasn’t great. I think I learned a lot and had some opportunities to do things that I hadn’t done before, so no regrets.

But the catch with my going back to Dell was that they already have a Gold sponsor seat with Arkady Kanevsky and a community spot with Prakash Ramchandran.

The OpenStack Foundation Bylaws have a section (4.17) on Director Diversity. This clause limits the number of directors that can be affiliated with the same corporate entity to two. So even though Prakash and I are Individual Members (which means we are there as representatives of the community, not as representatives of our company), my move to Dell now violated that clause.

I think this was added to the bylaws back in the days where there were a few large corporate sponsors that had large teams of people dedicated to working on OpenStack. It was a safeguard to ensure no one company could overrun the Foundation based solely on their sheer number of people involved. That’s not quite as big of an issue today, but I do still think it makes sense. It is a very good thing to make sure any group like this has a diversity of people and viewpoints.

The bylaws actually explicitly state what should happen in my situation too - Article 4.17(d) states:

If a director who is an individual becomes Affiliated during his or her term and such Affiliation violates the Director Diversity Requirement, such individual shall resign as a director.

As such, I really should have stepped down on moving to Dell.

But luckily for me, there is also a provision called out in 4.17(e):

A violation of the Director Diversity Requirement may be waived by a vote of two thirds of the Board of Directors (not including the directors who are Affiliated)

This meant that 2/3 of the Board members present, not including any of us from Dell, would have to vote in favor of allowing me to continue out my term. If less than that were in favor, then I would need to step down. And presumably there would just be an open board seat for the rest of the term until elections are held again.

There was brief discussion, but I was very happy that everyone present did vote in favor of allowing me to continue out my term. I kind of feel like I should have stepped out during this portion of the call to make sure no one felt pressure by not wanting to say no in my presence, but hopefully that wasn’t the case for anyone. It was really nice get these votes, and some really good back channel support from non-board attendees listening in on the call.

What can I say - compliments and positive reinforcement go far with me. :)

So I’m happy to say I will at least be able to finish out my term for the rest of 2019. I will have to see about 2020. I don’t believe Arkady nor Prakash are planning on going anywhere, so we may need to have some internal discussions about the next election. Or, probably better, leave it up to the community to decide who they would like representing them for the Individual Director seats. Prakash has been doing a lot of great work for the India community, so if it came down to it and I lost to him, I would be just fine with that.

OW2 Associate Membership

Thierry then presented a proposal to join OW2 as an Assocaite Member. OW2 is “an independent, global, open-source software community”. So what does that mean? Basically, like the Open Source Initiative and others, they are a group of like-minded individuals, companies, and foundations that work together to support and further open source.

We (OpenStack) have actually worked with them for some time, but we had never officially joined as an Associate Member. There is no fee to join at this level, and it is really just formalizing that we are supportive of OW2’s efforts and willing to work with them and the members to help support their goals.

They have been in support of OpenStack and open infrastructure for years, so it was great to approve this effort. We are now listed as one of their Associate Members.

Interop WG Guidelines 2019.06

Egle moved to have the board approve the 2019.06 guidelines. We had held an email vote for this approval, but since we did not get reponses from every Directory, we now performed a vote in-meeting to record the voting. All present were in favor.

The interop guidelines are a way to make sure all OpenStack deployments conform to a base set of requirements. This makes sure that an end user of an OpenStack cloud has at least some level of assurance that they can move from one cloud to another and not getting a wildly different user experience. The work of the Interop Working Group has been very important to ensuring this stability and helping the ecosystem around OpenStack grow.

Miscellaneous

Prakash gave a quick update on the meetups and mini-Summits being organized in India. Sounds like a lot of really good activity happening in various regions. It’s great to see this community being supported and growing.

Alan also made a call for volunteers for the Finance and Membership committees. I had tried to get involved earlier in the year, but I think due to timing there really just wasn’t much going on at the time. With the next election coming up, and some changes in sponsors, now is actually a good time for the Membership Committee to have some more attention. I’ve joined Rob Esker to help review any new Platinum and Gold memberships. Sounds like we will have at least one new one of those coming up soon.

Summit Events

It wasn’t really an agenda topic for this Board Meeting, but I do think it’s worth pointing out here that the proposed changes to the structure of our yearly events have gone through and 2020 will start to diverge from the typical pattern we have had so far of holding to major Summits per year.

Erin Disney sent out a post about these changes to the mailing list. We will have a smaller event focused on collaboration in the spring, then a larger Summit (or Summit-like) event in later in the year.

With the maturity of OpenStack and where we are today, I really think this makes a lot more sense. There simply isn’t enough big new functionality and news coming out of the community today to justify two large marketing focused events like the Summit per year. What we really need now is to foster the environment to make sure the developers, operators, and others that are working on implementing new functionality and fixing bugs have the time and venue they need to work together and get things done. Having these smaller events and supporting more things like the regional Open Infrastructure Days will hopefully help keep that collaboration going and allow us to focus on the things that we need to do.

And the next event will be in beautiful Vancouver again, so that’s a plus!

by Sean McGinnis at October 17, 2019 12:00 AM

October 16, 2019

OpenStack Superuser

Zuul Community Answers Questions around Open Source CI at AnsibleFest

The Zuul team recently attended Ansiblefest in Atlanta in September. We had the opportunity to meet loads of people who were excited to find out and learn more about Zuul. With that in mind, we compiled some of the most common questions we received, to help educate the public on Zuul, open source CI, and project gating.

If you’re interested in learning more about Zuul, check out this presentation that Ansible gave about how they put Zuul open source CI to use.

Now, let’s look at the questions we heard at the Zuul booth and throughout AnsibleFest.

How does Zuul compare to…

Jenkins

Zuul is purpose built to be a gating continuous integration and deployment system. Jenkins is a generic automation tool that can be used to perform continuous integration (CI) and continuous delivery (CD). Major differences that come out of this include:

  • Zuul expects all configuration to be managed as code
  • Zuul provides test environment provisioning via Nodepool
  • Zuul includes out of the box support for gating commits

Molecule

Molecule is a test runner for Ansible. It is used to test your Ansible playbooks. Zuul is an entire CI system that leverages Ansible to run its workloads. One of the workloads that Zuul can run for you is Molecule tests.

Tower

Zuul is designed to trigger continuous integration and deployment jobs based on events in your code review system. This means Zuul is coupled to actions like new commits showing up, reviewer approval, commits merging, git tags, and so on. Historically Tower has primarily been used to trigger Ansible playbooks via human inputs. Recently Tower has grown HTTP API support for GitHub and Gitlab webhooks. While these new features overlap with Zuul you will get a better CI experience through Zuul as it has more robust support for source control events and supports gating out of the box.

Is Zuul a replacement for Jenkins?

For some users, Zuul has been a replacement for Jenkins. Others use Zuul and Jenkins together. Zuul intends to be a fully featured CI and CD system that does not need Jenkins to operate. Additionally you do not need Jenkins to get any additional features in Zuul itself.

Does it work with GitLab or Bitbucket?

Since we received so many requests for Gitlab support at this year’s Ansiblefest, we put a call for help out on Twitter. We are happy to say that a Zuul driver to support GitLab as a code review input is in the early stages of development. Earlier interest in Bitbucket support, including at Ansiblefest 2018, has already led to a proof-of-concept driver some folks are using, which we hope to have in a release very soon.

Can I self host my Zuul?

Absolutely. Zuul is open source software which you are free to deploy yourself.

Can I run Zuul air gapped?

To function properly Zuul needs to talk to your code review system and some sort of test compute resources. As long as Zuul can talk to those (perhaps they are similarly air gapped), then running Zuul without external connectivity should be fine.

Is there a hosted version of Zuul I can pay for?

Yes, Vexxhost has recently announced their hosted Zuul product.

Can I pay someone for Zuul support?

You can use Vexxhost’s managed Zuul service. The Zuul community is also helpful and responsive and can be reached via IRC (#zuul on Freenode) or their mailing list (zuul-discuss@lists.zuul-ci.org).

What’s the catch? What’s your business model? How do you expect to make money at this? Is this project venture-capital backed? Are you planning an IPO any time soon?

Zuul is a community-developed free/libre open-source software collaboration between contributors from a variety of organizations and backgrounds. Contributors have a personal and professional interest in seeing Zuul succeed because they, their colleagues, and their employers want to use it themselves to improve their own workflows.

There is no single company backing the project, it’s openly-governed by a diverse group of maintainers and anyone who’s interested in helping improve Zuul is welcome to join the effort. Some companies do have business models which include running Zuul as a service or selling technical support for it, and so have an incentive to assist with writing and promoting the software, but they don’t enjoy any particular position of privilege or exert special decision-making power within the project.


If you’re interested in learning more about Zuul, check out our FAQ, read through our documentation, or test it yourself on a local install.

The post Zuul Community Answers Questions around Open Source CI at AnsibleFest appeared first on Superuser.

by Clark Boylan, Jeremy Stanley, Paul Belanger and Jimmy McArthur at October 16, 2019 03:25 PM

October 15, 2019

Galera Cluster by Codership

Planning for Disaster Recovery (DR) with Galera Cluster (EMEA and USA webinar)

We talk a lot about Galera Cluster being great for High Availability, but what about Disaster Recovery (DR)? Database outages can occur when you lose a data centre due to data center power outages or natural disaster, so why not plan appropriately in advance?

In this webinar, we will discuss the business considerations including achieving the highest possible uptime, analysis business impact as well as risk, focus on disaster recovery itself, as well as discussing various scenarios, from having no offsite data to having synchronous replication to another data centre. 

This webinar will cover MySQL with Galera Cluster, as well as branches MariaDB Galera Cluster as well as Percona XtraDB Cluster (PXC). We will focus on architecture solutions, DR scenarios and have you on your way to success at the end of it.

EMEA webinar 29th of October 1-2 PM CEST (Central European Time)                 JOIN THE EMEA WEBINAR

USA webinar 29th of October 9-10 AM PDT (Pacific Daylight Time)                     JOIN THE USA WEBINAR

Presenter: Colin Charles, Codership

by Sakari Keskitalo at October 15, 2019 10:10 AM

StackHPC Team Blog

Kubeflow on Baremetal OpenStack

Kubeflow logo

DISCLAIMER: No GANs were harmed in the writing of the blog.

Kubeflow is a machine learning toolkit for Kubernetes. It aims to bring popular tools and libraries under a single umbrella to allow users to:

  • Spawn Jupyter notebooks with persistent volume for exploratory work.
  • Build, deploy and manage machine learning pipelines with initial support for the TensorFlow ecosystem but has since expanded to include other libraries that have recently gained popularity in the research communitity like PyTorch.
  • Tune hyperparameters, serve models, etc.

In our ongoing effort to demonstrate that OpenStack managed baremetal infrastructure is a suitable platform for performing cutting-edge science, we set out to deploy this popular machine learning framework on top of underlying Kubernetes container orchestration layer deployed via OpenStack Magnum. The control plane for the baremetal OpenStack cloud constitute of Kolla containers deployed using Kayobe which provides for containerised OpenStack to baremetal and is how we manage the vast majority of our deployments to customer sites. The justification for running baremetal instances is to minimise the performance overhead of virtualisation.

Apparatus

  • Baremetal OpenStack cloud (minimum Rocky) except for OpenStack Magnum (which must be at least Stein 8.1.0 for various reasons detailed later but critically in order to support Fedora Atomic 29 which addresses a CVE present in earlier Docker version).
  • A few spare baremetal instances (minimum 2 for 1 master and 1 worker).

Deployment Steps

  • Provision a Kubernetes cluster using OpenStack Magnum. For this step, we recommend using Terraform or Ansible. Since Ansible 2.8, os_coe_cluster_template and os_coe_cluster modules are available to support Magnum cluster template and cluster creation. However, in our case, we opted for Terraform which has a nicer user experience because it understands the interdependency between the cluster template and the cluster and therefore automatically determines the order in which they need to be created and updated. To be exact, we create our cluster using a Terraform template defined in this repo where the README.md has details of the how to setup Terraform, upload image and bootstrap Ansible in order to deploy Kubeflow. The key labels we pass to the cluster template are as follows:
cgroup_driver="cgroupfs"
ingress_controller="traefik"
tiller_enabled="true"
tiller_tag="v2.14.3"
monitoring_enabled="true"
kube_tag="v1.14.6"
cloud_provider_tag="v1.14.0"
heat_container_agent_tag="train-dev"
  • Run ./terraform init && ./terraform apply to create the cluster.
  • Once the cluster is ready, source magnum-tiller.sh to use tiller enabled by Magnum and run our Ansible playbook to deploy Kubeflow along with ingress to all the services (edit variables/example.yml to suit your OpenStack environment):
ansible-playbook k8s.yml -e @variables/example.yml
  • At this point, we should see a list of ingresses which use *-minion-0 as the ingress node by default when we run kubectl get ingress -A. We are using a nip.io based wildcard DNS service so that traffic generating from different subdomains map to various services we have deployed. For example, the Kubeflow dashboard is deployed as ambassador-ingress and the Tensorboard dashboard is deployed as tensorboard-ingress. Similarly, the Grafana dashboard deployed by placing monitoring_enabled=True label is deployed as monitoring-ingress. The mnist-ingress ingress is currently functioning as a placeholder for the next part where we train and serve a model using the Kubeflow ML pipeline.
$ kubectl get ingress -A
NAMESPACE    NAME                  HOSTS                           ADDRESS   PORTS   AGE
kubeflow     ambassador-ingress    kubeflow.10.145.0.8.nip.io                80      35h
kubeflow     mnist-ingress         mnist.10.145.0.8.nip.io                   80      35h
kubeflow     tensorboard-ingress   tensorboard.10.145.0.8.nip.io             80      35h
monitoring   monitoring-ingress    grafana.10.145.0.8.nip.io                 80      35h
git clone https://github.com/stackhpc/kubeflow-examples examples -b dell
cd examples/mnist && bash deploy-kustomizations.sh

Notes on Monitoring

Kubeflow comes with a Tensorboard service which allows users to visualise machine learning model training logs, model architecture and also the efficacy of the model itself by reducing the latent space of the weights in the final layer before the model makes a classification.

The extensibility of the OpenStack Monasca service also lends itself well to the integration into machine learning model training loops given that the agent is configured to accept non-local traffic on workers which can be done by setting the following values inside /etc/monasca/agent/agent.yaml and a restart of the monasca-agent.target service:

monasca_statsd_port: 8125
non_local_traffic: true

On the client side where the machine model example is running, metrics of interest can now be posted to the monasca agent. For example, we can provide a callback function to FastAI, a machine learning wrapper library which uses PyTorch primitives underneath with an emphasis on transfer learning (and can be launched as a GPU flavored notebook container on Kubeflow) for tasks such as image and natural language processing. The training loop of the library hooks into callback functions encapsulated within the PostMetrics class defined below at the end of every batch or at the end of every epoch of the model training process:

# Import the module.
from fastai.callbacks.loss_metrics import *
import monascastatsd as mstatsd

conn = mstatsd.Connection(host='openhpc-login-0', port=8125)

# Create the client with optional dimensions
client = mstatsd.Client(connection=conn, dimensions={'env': 'fastai'})

# Create a gauge called fastai
gauge = client.get_gauge('fastai', dimensions={'env': 'fastai'})

class PostMetrics(LearnerCallback):

    def __init__(self):
        self.stop = False

    def on_batch_end(self, last_loss, **kwargs:Any)->None:
        if self.stop: return True #to skip validation after stopping during training
        # Record a gauge 50% of the time.
        gauge.send('trn_loss', float(last_loss), sample_rate=1.0)

    def on_epoch_end(self, last_loss, epoch, smooth_loss, last_metrics, **kwargs:Any):
        val_loss, error_rate = last_metrics
        gauge.send('val_loss', float(val_loss), sample_rate=1.0)
        gauge.send('error_rate', float(error_rate), sample_rate=1.0)
        gauge.send('smooth_loss', float(smooth_loss), sample_rate=1.0)
        gauge.send('trn_loss', float(last_loss), sample_rate=1.0)
        gauge.send('epoch', int(epoch), sample_rate=1.0)

  # Pass PostMetrics() callback function to cnn_learner's training loop
  learn = cnn_learner(data, models.resnet34, metrics=error_rate, bn_final=True, callbacks=[PostMetrics()])

These metrics are sent to the OpenStack Monasca API which can then be visualised on a Grafana dashboard against GPU power consumption which can then allow a user to determine the tradeoff against model accuracy as shown in the following figure:

Kubeflow logo

In addition, general resource usage monitoring may also be of interest. There are two Prometheus based monitoring options available on Magnum:

  • First, non-helm based method uses prometheus_monitoring label which when set to True deploys a monitoring stack consisting of a Prometheus service, a Grafana service and a DaemonSet (Kubernetes terminology which translates to a service per node in the cluster) of node exporters. However, the the deployed Grafana service does not provide any useful dashboards that acts as an interface with the collected metrics due to a change in how default dashboards are loaded in recent versions of Grafana. A dashboard can be installed manually but it does not allow the user to drill down into the visible metrics further and presents the information in a flat way.
  • Second, helm based method (recommended) requires monitoring_enabled and tiller_enabled labels to be set to True. It deploys a similar monitoring stack as above but because it is helm based, it is also upgradable. In this case, the Grafana service comes preloaded with several dashboards that present the metrics collected by the node exporters in a meaningful way allowing users to drill down to various levels of detail and types of groupings, e.g. by cluster, namespace, pod, node, etc.

Of course, it is also possible to deploy a Prometheus based monitoring stack without having it managed by Magnum. Additionally, we have demonstrated that it is also as option to deploy the Monasca agent running inside of a container to post metrics to the Monasca API which may be available if it is configured to be the way to monitor the control plane metrics.

Why we recommend upgrading Magnum to Stein (8.1.0 release)

  • OpenStack Magnum (Rocky) supports up to Fedora Atomic 27 which is EOL. Support for Fedora Atomic 29 (with the fixes for the CVE mentioned earlier) requires a backport of various fixes from the master branch that reinstate support for the two network plugin types supported by Magnum (namely Calico and Flannel).
  • Additonally, there have been changes to the Kubernetes API which are outside of Magnum project's control. Rocky only supports the versions of Kubernetes upto v1.13.x and the Kubernetes project maintainers only actively maintain a development branch and 3 stable releases. The current development release is v1.17.x which means v1.16.x, v1.15.x and 1.14.x can expect updates and backport of critical fixes. Support for v1.15.x and v1.16.x are coming to Train release but upgrading to Stein will enable us to support up to v1.14.x.
  • The traefik ingress controller deployed by magnum is no longer working in Rocky release due to the fact the former behaviour was to always deploy the latest tag. However, a new major version (2.0.0) has been released with breaking changes to the API which inevitably fails. Stein 8.1.0 has the necessary fixes and additionally, also supports the more popular nginx based ingress controller.

Get in touch

If you would like to get in touch we would love to hear from you. Reach out to us via Twitter or directly via our contact page.

by Bharat Kunwar at October 15, 2019 10:00 AM

October 10, 2019

Aptira

10% off + FREE Consulting – FINAL DAYS!

Aptira 10 year birthday 10% off sale

Final days to claim our Birthday Special!

Incase you missed it, on the 9th of the 9th 2019, we turned 10! So until the 10th of the 10th, we’re offering 10% off all our services. That’s 10% off managed services, 10% off training, 10% off everything except hardware. This 10% discount also applies to pre-paid services, so you can pre-pay for the next 12 months to really maximise your savings!

We’re also offering a free 2 hour consulting session to help get you started with transforming your Cloud solution.

This offer is ending soon, so chat with a Solutionaut today to take advantage of this once in a decade discount and let us turn your business capabilities into a competitive advantage.

Let us make your job easier.
Find out how Aptira's managed services can work for you.

Find Out Here

The post 10% off + FREE Consulting – FINAL DAYS! appeared first on Aptira.

by Jessica Field at October 10, 2019 12:59 PM

OpenStack Superuser

From Containers to Edge Computing, Research Organizations Rely on Open Infrastructure

A lot separates Milan and Rome—including three hours by train—but one thing that connects these two cities is the open infrastructure community. 

The Italian community organizers—Binario Etico and Irideos— made two big changes to the local event this year. First, they renamed the OpenStack Day to OpenInfra Days to broaden the scope of the content at the event. They also planned two events this year in order to put the latest trends and user stories in front of as many local community members as possible. The events would not have been possible without the support of the event sponsors: D2iQ, GCI, Linux Professional Institute, OpenStack Foundation, and Mellanox. 

A combined crowd of over 300 attendees gathered in Milan and Rome last week at the OpenInfra Days Italy to hear how organizations are building and operating open infrastructure. 

Mariano Cunietti and Davide Lamanna kicked off both events explaining how important it is for European organizations to embrace open source components and cross community collaboration.

“It’s the way we collaborate and the way we shape communication flow that works,” Cunietti said. “Collaborative open source is a way to shift technicians from being consumers to participants and citizens of the community. This is a very important shift.” 

From a regional perspective, Lamanna explained how European standards and privacy laws create a requirement that have given local, open source organizations a competitive advantage around interoperability and flexibility features.   

To exemplify the power of open infrastructure and community collaboration in Europe, several users shared their production stories. An industry that is very pervasive in Europe—particularly Italy—is research. 

  • GARR: Saying that no infrastructure is open until you open it, GARR harmonizes and implements infrastructure for the benefit of the scientific community in Italy—amassing to around 4 million users.  Alex Barchiesi shared some stats around GARR’s current OpenStack deployment—8,500 cores with 10 PB of storage in five data centers across three regions—as well as their approach to identity federation. GARR’s concept of federation: the simpler, the better; the less requirements, the more inclusive. With their multi-region, multi-domain model, Barchiesi explained how they have architected a shared identity service. To give back to the community, the GARR team contributes upstream to OpenStack Horizon, k8s-keystone auth, and juju charms. 
  • The Istituto Nazionale di Fisica Nucleare (INFN)—an Italian public research institute for high energy physics (who also collaborates with CERN!)—has a private cloud infrastructure that is OpenStack-based and geographically distributed in three major INFN data centers in Italy. The adoption of Ceph as distributed object storage solution enables INFN to provide both local block storage in each of the interested sites and a ready-to-use disaster recovery solution implemented among the same sites. Collectively, the main data centers have around 50,000 CPU cores, 50 PB of enterprise-level disk space, and 60 PB of tape storage.  
  • While CERN is not based in Italy, their OpenStack and Kubernetes use case provides learnings around the world. Jan van Eldik shared updated stats around CERN’s open infrastructure environment with focuses on OpenStack Magnum, Ironic and Kubernetes. CERN by the numbers: more than 300,000 OpenStack cores, 500 Kubernetes clusters, and 3,300 servers managed by OpenStack Ironic (expected to be 15,000 in the next year). 

Outside of the research sector, other users who shared their open infrastructure story include the city government of Rome’s OpenStack use case, Sky Italia’s creation of a Kubernetes blueprint and network setup that empowers their brand new Sky Q Fibra service, and the SmartME project that is deploying OpenStack at the edge for smart city projects in four cities across Italy. 

What’s next for the open infrastructure community in Italy? Stay tuned on the OpenStack community events page for deadlines and event dates. 

Can’t wait until 2020? Join the global open infrastructure community at the Open Infrastructure Summit Shanghai from November 4-6.

Cover photo courtesy of Frederico Minzoni.

The post From Containers to Edge Computing, Research Organizations Rely on Open Infrastructure appeared first on Superuser.

by Allison Price at October 10, 2019 12:00 PM

October 09, 2019

Mirantis

SUSE OpenStack is no more — but Don’t Panic

SUSE has announced they're discontinuing their OpenStack distro, but it's not the end of the line for their customers.

by Nick Chase at October 09, 2019 08:27 PM

Aptira

Open Source Networking Days Australia

Open Source Networking Days Australia

Coming Soon: Open Source Networking Days Australia

Open Source Networking Day Australia is a one-day mini-summit hosted by Telstra and co-organized by LF Networking (LFN) and Aptira.

This is the first time that LFN has brought an open source networking event to Australia and it will be a unique opportunity to connect and collaborate with like-minded community members that are passionate about open source networking. The event will bring together service providers, the developer community, industry partners and academia for a day of collaboration and idea exchange on all things related to open-source networking, including LF Networking (LFN) projects like ONAP, OpenDaylight, Tungsten Fabric and Open Networking Foundation (ONF) projects like COMAC, Stratum, ONOS and P4, as well as home-grown innovation such as OpenKilda and many more.

To make open source networking viable in Australia, we need to collectively grow awareness, skills and investment. By attending this event, attendees will learn about the state of open source networking adoption globally and locally, how open source is applied in network automation, evolution of software defined networking and how open source enables exciting use cases in edge computing. Attendees will have plenty of opportunities to interact with global experts, industry peers and developers via keynote sessions, panel Q&A, technical deep-dives and business discussions, and more importantly learn how to get involved in open source networking communities going forward. Registration is free, so register today and we hope to see you in Melbourne!

Melbourne, Australia | November 11, 2019
8:30 am – 5:00 pm
Telstra Customer Insight Centre (CIC)
Tickets

In addition to this, there will also be a Next-Gen SDN Tutorial hosted on the 12th of November.

Next-Gen SDN is delivering fine-grained network programmability with zero touch configuration and management, enabling operators’ complete control of their networks. Leveraging P4, P4Runtime, OpenConfig/gNMI and gNOI, NG-SDN is now truly delivering on the ‘software defined’ promise of SDN for future transformation, new applications and unprecedented levels of new value creation.

This tutorial is an opportunity for architects and engineers to learn the basics and to practically experiment with some of the building blocks of the NG-SDN architecture, such as:

  • P4 language
  • Stratum (P4Runtime, OpenConfig over gNMI, gNOI)
  • ONOS

The goal of the tutorial is to answer questions such as:

  • What is P4 and how do I use it?
  • How do I go from a P4 program to a complete network solution?
  • What is Stratum and how can I use its interfaces to control packet forwarding, configure ports, or push software upgrades to my network devices?
  • How can I use ONOS to write control-plane apps for my P4 program?

It is organized around a sequence of introductory presentations, as well as hands-on exercises that show how to build a leaf-spine data center fabric from scratch based on IPv6 using P4 and ONOS.

The tutorial will include an introduction to the P4 language, Stratum, and ONOS. Participants will be provided with starter P4 code and an ONOS app implementation, along with instructions to run a Mininet-emulated leaf-spine topology of Stratum-enabled software switches. Only basic programming and networking knowledge is required to complete the hands-on exercises. Knowledge of Java and Python will be helpful to understand some of the starter code.

Registrations for the tutorial are limited to 50 people, so to secure your place register now.

Open Source Networking Days Australia Sponsors

Ready to move your network into the software defined future?
Automate your network with ONAP.

Find Out How

The post Open Source Networking Days Australia appeared first on Aptira.

by Jessica Field at October 09, 2019 12:12 PM

October 08, 2019

OpenStack Superuser

Open Infrastructure in Germany: Hitting the Road with New and Growing OpenStack Use Cases

A year after we held the OpenStack Summit Berlin, it was great to return to Berlin to see what has changed—hear how OpenStack users had grown their deployments since we last saw them, finding new users sharing their stories, and hearing how companies are integrating open infrastructure projects in innovative ways.

Europe’s first music hotel with photos on the wall of the musicians who have visited in years past welcomed a new audience: 300 Stackers for the 2019 OpenStack Day DOST in Berlin. Community members gathered for two days of breakout sessions, sponsor demos, and waterfront views. Sessions and an evening event cruise along the Spree River were made possible by event organizers and sponsors: B1 Systems, Canonical, Netways Web Services, Noris Network, the OpenStack Foundation, Open Telekom Cloud, Rancher, and SUSE.

In addition to being home to a diverse set of ecosystem vendors, German roads are also home to automakers who rely on OpenStack including Audi and BMW who shared their use cases with conference attendees.

BMW first shared its OpenStack story at the Paris Summit in 2014 and since then, has continued to grow its OpenStack footprint rapidly. Currently sitting at 700 servers, they are expecting their environment to grow by an additional 300 by the end of the year. As of today, almost 400 projects and platforms (rising steadily) rely on their dynamic, flexible and tailor-made instance of the OpenStack at the BMW Group, including autonomous driving.

Andreas Poëschl showing how BMW has grown its OpenStack environment over the years.

Audi was the second automaker of the conference to share its open infrastructure use case, powered by OpenStack and Ceph. Audi AG’s shop floor IT environment is designed for uninterrupted, highly available 24/7 operation, and these requirements make it difficult to test new, not yet evaluated technologies close to production. To quickly bring these technologies into production and make them available, the Audi Production Lab was founded. There, it is possible to incorporate the latest concepts and develop them to the point where they meet the requirements of production.

Through the construction of a self-sufficient, decoupled, independently usable, flexible, and adaptable server infrastructure based on Ceph and OpenStack in the Production Lab, it is now possible to evaluate innovative technologies such as Kubernetes and bring them to production in a timely manner.

Auto makers were not the only ones sharing their open infrastructure integration story.

  • SAP shared its Converged Cloud where the basis is OpenStack orchestrated in a Kubernetes cluster. With the newly developed Kubernikus module, the Converged Cloud enables SAP to offer its customers Kubernetes-as-a-Service, which is provided as a one-button self-service. Kubernikus creates a Kubernetes cluster that operates as a managed service and can be offered for API support. Kubernikus works with the OpenStack API and remains 100% Kubernetes and Open Source. The structure allows the separate operation of Kubernetes API and project-specific nodes.  
  • The Open Telekom Cloud, the public cloud service of Deutsche Telekom, is one of the local members of the OpenStack 100k core club. With over a quarter of a million managed CPU cores, it’s one of the largest fully managed and managed clouds in Europe. Their team presented the DevOps model that enables their OpenStack-powered public cloud to continue to grow. 

What’s next for the open infrastructure community in Germany? The event organizers say the planning for the 2020 event in Hamburg is underway. Stay tuned on the OpenStack community events page for deadlines and event dates. 

Can’t wait until 2020? Join the global open infrastructure community at the Open Infrastructure Summit Shanghai November 4-6. 

Cover photo courtesy of NETWAYS Web Services.

The post Open Infrastructure in Germany: Hitting the Road with New and Growing OpenStack Use Cases appeared first on Superuser.

by Allison Price at October 08, 2019 01:00 PM

October 07, 2019

Mirantis

How to deploy Airship in a Bottle: A quick and dirty guide

Airship in a Bottle is a simple way to create an Airship deployment that includes a compact OpenStack cluster.

by Nick Chase at October 07, 2019 12:55 PM

October 05, 2019

Aptira

Real-World Open Networking. Part 5: Dissonance between Networks and Software Domains

Real-World Open Networking. Part 5: Dissonance between Networks and Software Domains

In our last post we finished up a detailed examination of different aspects of Interoperability. In this post, we will analyse the different mindsets between traditional networking domains and software development domains, explain why there is often built-in dissonance.

Background

Whilst Open Network solutions require the integration of network and software components and practices, at the current time (and historically) these two domains are largely incompatible. Unless carefully managed this incompatibility will cause project stress and impairment.

Given that many (if not most) Open Network solutions originate in the Network Engineering department within a user organisation, this is an important consideration for the entire lifecycle of the solution; especially so if the Network Engineering team does not have established software skills and experience.

The Problem

There are many aspects of the types of dissonance that can be experienced in an Open Networking project due to different paradigms or mindsets. Below we cover the top four aspects of the problem:

  • Design & Production paradigm conflicts
  • Ability to Iterate
  • End user Engagement
  • Expectations of Interoperability

Expectation on Development

We described in Software Interlude Part 6 – Development Paradigms that traditional network engineering aligns more with the production model of work, i.e. that the design and production processes are largely serialised and separate.

Software development on the other hand operates on a different paradigm, in which design and production are largely intermingled: not just parallel but intertwined within the same team and the same resources.

Networks (in general) are designed using discrete components and can be designed and built along fairly pre-determined and predictable steps guided by engineering principles. Networks are highly mechanical and mathematical in nature, following a well-established set of rules. Even the software components of traditional network equipment (configuration) followed rules backed up by years of mathematical research. Network designs can be validated in advance using the same techniques.

Practically, we see the implications of this in the way network projects are executed. Formally, network projects are far more of a plan-based (aka Waterfall) lifecycle model. There are many logical reasons why the plan-based approach is better for this type of project.

Informally, we also see this: it’s typical that a senior, more experienced, person will do the network design and create a specification for how the network is to be built. This network design is typically handed off to other technical personnel for the build.

Expectations on the ability to iterate

Flexibility is a key aspect of software development projects: it underpins everything that a software developer does and thinks. Networks appear to value other things: integrity, security etc. The difference comes down to the relative size of Increments, prototypes and/or MVP’s. Note: the MVP (Minimum Viable Product) is the smallest component that can be deployed to production and which enables at least 1 valuable use case.

Small increments in functionality, prototypes and MVP’s are important parts of the solution development process. These all support the agile principles if inspect and adapt.

For software, these increments can be very small and be produced very rapidly. Traditionally, in the network domain, creating a small instance of some aspect of a solution has a much higher hurdle. Model labs or test environments may exist, but these are typically insufficient for the dynamic changes required by the need to iterate; that is, if they are available at all, and/or have the right or sufficient quantities of hardware.

Expectations on End User Engagement

It is not uncommon for networks projects to be built to very general requirements and not to specific end-user use cases. The logical flow-on from this is that end-users are not actively engaged in the development lifecycle.

Software projects, and in particular Agile software projects, are built on engagement with end-users: the expectation is that end-users will interact with developers on a daily basis. This requires certain skillsets that are well-developed in software engineers (well, to varying degrees), but few Network engineers have this experience.

Expectations of Interoperability

In general, network developers have a much higher expectation on out of the box interoperability than software developers, notwithstanding the softwareisation of the networks.

Experienced software developers typically have a high level of scepticism when it comes to claims of interoperability and will naturally plan in validation process to ensure they understand how the product will actually work. Network engineers and architects appear to be more ready accept claims of operability or standards compliance and don’t necessarily prepare for validation processes, except for first time onboarding of equipment into a network.

But given the different natures of the products, an initial validation for a software product can have a relatively short life (as new updates can break this tested functionality), whereas initial validation of a hardware product has a much longer life.

Conclusion

The existence of these sources of dissonance, and more, can easily lead to project impairment if not anticipated and managed carefully.

In both project planning and execution, problems arise when one party wants to invest time into something (e.g. risk reserves or validation testing) that the other party doesn’t see the need for (and consequently believes is unjustified padding of the estimates) or just doesn’t get leading to misunderstanding and miscommunication.

How do we manage this effectively? We treat everything as a software project.

Let us make your job easier.
Find out how Aptira's managed services can work for you.

Find Out Here

The post Real-World Open Networking. Part 5: Dissonance between Networks and Software Domains appeared first on Aptira.

by Adam Russell at October 05, 2019 01:20 PM

OpenStack Superuser

OpenStack Ironic Bare Metal Program case study: VEXXHOST

The OpenStack Foundation announced in April 2019 that its Ironic software is powering millions of cores of compute all over the world, turning bare metal into automated infrastructure ready for today’s mix of virtualized and containerized workloads.

Some 30 organizations joined for the initial launch of the OpenStack Ironic Bare Metal Program, and Superuser is running a series of case studies to explore how people are using it.

VEXXHOST provides high performance, cloud computing solutions that are cost conscious, complete, and widely flexible. In 2011, VEXXHOST adopted OpenStack software for its infrastructure. Since then, VEXXHOST has been an active contributor and an avid user of OpenStack. Currently, VEXXHOST provides infrastructure-as-a-service OpenStack public cloud, private cloud, and hybrid cloud solutions to customers, from small businesses to enterprises across the world.

Why did you select OpenStack Ironic for your bare metal provisioning in your product?

VEXXHOST has a long history of involvement with OpenStack technology, dating back to the Bexar release. We have since been powering all of our infrastructures using OpenStack. Taking advantage of Ironic for our bare metal provisioning seemed a natural next step in the continuous building out of our system and Ironic fit right in with each of our components, integrating easily with all of our existing OpenStack services.

As we offer multiple architectures, enterprise-grade GPUs, and various hardware options, the actual process of testing software deployments can pose a real challenge when it comes to speed and efficiency. However, we knew that choosing Ironic would resolve these difficulties, with the benefits being passed on to our users, in addition to enabling us to provide them with the option of deploying their private cloud on high-performing bare metal.

What was your solution before implementing Ironic?

Before VEXXHOST implemented OpenStack Ironic, we were using a system that we had built internally. For the most part, this system provided an offering of services that Ironic was already delivering on so it made sense to adopt it as opposed to maintaining our smaller version.

What benefits does Ironic provide your users?

Through Ironic, VEXXHOST’s users have access to fully dedicated and secure physical machines that can live in our data centres or theirs. Due to its physical and dedicated nature, the security provided by bare metal relieves VEXXHOST’s users of any risks associated with environment neighbours and thanks to the isolation factor, users are ensured that their data is never exposed to others. Ironic can also act as an automation tool for the centralized housing and management of all their machines and even enables our users to access certain features that aren’t available in virtual machines, like having multiple levels of virtual machines.

Additionally, VEXXHOST’s users benefit from Ironic’s notably simpler configuration and less complex set-up when compared to virtual machines. Where use cases require it, Ironic can also deliver to our users a higher level of performance than virtual machines. Through the region controller, our users benefit from high availability starting at the data center level and users are able to create and assign physical availability zones to better control critical availability areas. Through the use of Ironic, VEXXHOST can easily run any other OpenStack projects and configure our user’s bare metal specifically for their use cases. Ironic is also easily scaled from a few servers to multiple racks within a data centre and through their distributed gateways, makes it possible to process large parallel deployments. By using OpenStack technology, like Ironic, VEXXHOST ensures that users are never faced with the risks associated with vendor lock-in.

What feedback do you have to provide to the upstream OpenStack Ironic team?

Through our long-standing involvement with the OpenStack Community, based on VEXXHOST’s contributions and our CEO Mohammed Naser‘s role as Ansible PTL and member of the Technical Committee, we regularly connect with the Ironic team and have access to their conversations. Currently, there isn’t any feedback that we haven’t already shared with them.

Learn more

You’ll find an overview of Ironic on the project Wiki.
Discussion of the project takes place in #openstack-ironic on irc.freenode.net. This is a great place to jump in and start your ironic adventure. The channel is very welcoming to new users – no question is a wrong question!

The team also holds one-hour weekly meetings at 1500 UTC on Mondays in the #openstack-ironic room on irc.freenode.netchaired by Julia Kreger (TheJulia) or Dmitry Tantsur (dtantsur).

Stay tuned for more case studies from organizations using Ironic.

Photo // CC BY NC

The post OpenStack Ironic Bare Metal Program case study: VEXXHOST appeared first on Superuser.

by Superuser at October 05, 2019 01:00 PM

October 04, 2019

Chris Dent

Fix Your Debt: Placement Performance Summary

There's a thread on the openstack-discuss mailing list, started in September and then continuing in October, about limiting planned scope for Nova in the Ussuri cycle so that stakeholders' expectations are properly managed. Although Nova gets a vast amount done per cycle there is always some stuff left undone and some people surprised by that. In the midst of the thread, Kashyap points out:

I welcome scope reduction, focusing on fewer features, stability, and bug fixes than "more gadgetries and gongs". Which also means: less frenzy, less split attention, fewer mistakes, more retained concentration, and more serenity. [...] If we end up with bags of "spare time", there's loads of tech-debt items, performance (it's a feature, let's recall) issues, and meaningful clean-ups waiting to be tackled.

Yes, there are.

When Placement was extracted from Nova, one of the agreements the new project team made was to pay greater attention to tech-debt items, performance, and meaningful clean-ups. One of the reasons this was possible was that by being extracted, Placement vastly limited its scope and feature drive. Focused attention is easier and the system is contained enough that unintended consequences from changes are less frequent.

Another reason was that for several months my employer allowed me to devote effectively 100% of my time to upstream work. That meant that there was long term continuity of attention in my work. Minimal feature work combined with maximal attention leads to some good results.

In August I wrote up an analysis of some of that work in Placement Performance Analysis, explaining some of the things that were learned and changed. However that analysis was comparing Placement code from the start of Train to Train in August. I've since repeated some of the measurement, comparing:

  1. Running Placement from the Nova codebase, using the stable/stein branch.
  2. Running Placement from the Placement codebase, using the stable/stein/ branch.
  3. Running Placement from the Placement codebase, using master, which at the moment is the same as what will become stable/train and be released as 2.0.0.

The same database (PostgreSQL) and web server (uwsgi using four processes of ten threads each) is used with each version of the code. The database is pre-populated with 7000 resource providers representing a suite of 1000 compute hosts with a moderately complex nested provider topology that is similar to what might be used for a virtualized network function.

The same query is used, whatever the latest microversion is for that version:

http://ds1:8000/allocation_candidates? \
                resources=DISK_GB:10& \
                required=COMPUTE_VOLUME_MULTI_ATTACH& \
                resources1=VCPU:1,MEMORY_MB:256& \
                required1=CUSTOM_FOO& \
                resources2=FPGA:1& \
                group_policy=none

(This is similar to what is used in the nested-perfload peformance job in the testing gate, modified to work with all available microversions.)

Here are some results, with some discussion after.

10 Serial Requests

Placement in Nova (stein)

Requests per second:    0.06 [#/sec] (mean)
Time per request:       16918.522 [ms] (mean)

Extracted Placement (stein)

Requests per second:    0.34 [#/sec] (mean)
Time per request:       2956.959 [ms] (mean)

Extracted Placement (train)

Requests per second:    1.37 [#/sec] (mean)
Time per request:       730.566 [ms] (mean)

100 Requests, 10 at a time

Placement in Nova (stein)

This one failed. The numbers say:

Requests per second:    0.18 [#/sec] (mean)
Time per request:       56567.575 [ms] (mean)

But of the 100 requests, 76 failed.

Extracted Placement (stein)

Requests per second:    0.41 [#/sec] (mean)
Time per request:       24620.759 [ms] (mean)

Extracted Placement (train)

Requests per second:    2.65 [#/sec] (mean)
Time per request:       3774.854 [ms] (mean)

The improvement between the versions in Stein (16.9s to 2.9s per request) were mostly made through fairly obvious architecture and code improvments found by inspection (or simply knowing it was not ideal when first made, and finally getting around to fixing it). Things like removing the use of oslo versioned objects and changes to cache management to avoid redundant locks.

From Stein to Train (2.9s to .7s per request) the improvements were made by doing detailed profiling and benchmarking and pursuing a very active process of iteration (some of which is described by Placement Performance Analysis).

In both cases this was possible because people (especially me) had the "retained concentration" desired above by Kashyap. As a community OpenStack needs to figure out how it can enshrine and protect that attention and the associated experimentation and consideration for long term health. I was able to do it in part because I was able to get my employer to let me and in part because I overcommitted myself.

Neither of these things are true any more. My employer has called me inside, my upstream time will henceforth drop to "not much". I'm optimistic that we've established a precedent and culture for doing the right things in Placement, but it will be a challenge and I don't think it is there in general for the whole community.

I've written about some of these things before. If the companies making money off OpenStack are primarily focused on features (and being disappointed when they can't get those features into Nova) who will be focused on tech-debt, performance, and meaningful clean-ups? Who will be aware of the systems well enough to effectively and efficiently review all these proposed features? Who will clear up tech-debt enough that the systems are easier to extend without unintended consequences or risks?

Let's hit that Placement performance improvement some more, just to make it clear:

In the tests above, "Placement in Nova (stein)" failed with a concurrency of 10. I wanted to see at what concurrency "Extracted Placement (train)" would fail: At 150 concurrency of 1000 requests some requests fail. At 140 all requests work, albeit slow per request (33s). Based on the error messages seen, the failing at 150 is tied to the sizing and configuration of the web server and nothing to do with the placement code itself. The way to have higher concurrency is to have more or larger web servers.

Remember that the nova version fails at concurrency of 10 with the exact same web server setup. Find the time to fix your debt. It will be worth it.

by Chris Dent at October 04, 2019 01:32 PM

OpenStack Superuser

OpenStack Ironic Bare Metal Program case study: China Mobile

The OpenStack Foundation announced in April 2019 that its Ironic software is powering millions of cores of compute all over the world, turning bare metal into automated infrastructure ready for today’s mix of virtualized and containerized workloads.

Over 30 organizations joined for the initial launch of the OpenStack Ironic Bare Metal Program, and Superuser is running a series of case studies to explore how people are using it.

China Mobile is a leading telecommunications services provider in mainland China. The Group provides full communications services in all 31 provinces, autonomous regions and directly-administered municipalities throughout Mainland China and in Hong Kong Special Administrative Region.

In 2018, the company was again selected as one of “The World’s 2,000 Biggest Public Companies” by Forbes magazine and Fortune Global 500 (100) by Fortune magazine, and recognized for three consecutive years in the global carbon disclosure project CDP’s 2018 Climate A List as the first and only company from Mainland China.

Why did you select OpenStack Ironic for bare metal provisioning in your product?

China Mobile has a large number of businesses running on various types of architectures such as x86 and power servers, which provide high quality services to our business and customers. This large number continues to increase every year by more than 100,000. Recently we have built several cloud solutions based on OpenStack as Gold Members of the OpenStack Foundation. Therefore, our public cloud and private cloud solutions are compatible with OpenStack. Ironic focuses on the compute, storage and network resources that are matched with OpenStack, which is the core requirement of China Mobile’s bare metal cloud.

In addition, China Mobile’s physical IaaS solution has multiple types of vendor hardware and solutions. By OpenStack Ironic’s improved architecture design and rich plug-in commits, we can learn from reliable experiences from the community during building our service process.

What was your solution before implementing Ironic?

Before the promotion of Ironic, the best automation method we used was the PXE + ISO + kickstart to achieve the relevant requirements. Due to its limitations in the network, storage and even operating system compatibility, we would manually manipulate all the processes in it. At the same time, due to the lack of relevant service data at the management level, workflow data could not be recorded well nor transferred in the course of work, reducing the delivery efficiency greatly.

What benefits does Ironic provide your users?

The biggest benefit of Ironic for us or our users is it can increase the efficiency of server delivery. Originally what took a day or even weeks, now takes half an hour to one hour. Based on Ironic, users can choose more operating systems they need, even on arm-linux. The network resources they wanted, such as Virtual Private Cloud (VPC), Load Balancer (LB), Firewall (FW) which can be freely configured through the combination of Ironic and Neutron. The same as Ironic and Cinder, the combination can provide users with Boot From Volume (BFV) and other disk array management or configuration capabilities. In short, through Ironic, we can deliver a complete compute, network and storage server through a top-down process without the operation and maintenance or user-responsible information synchronization and manual configuration.

With Ironic, we built a platform for data center administrators to redefine the access standards of the different hardware they manage. By this, all hardware vendors should comply with the management and data transmission protocol or they should be pushing the plug-in to OpenStack. Then, for administrators, they can focus on management and user serve.

For China Mobile, hardware management or server os delivery sometimes is not enough. We are extending our bare metal cloud to support applications integrated by OpenStack Mistral and Ansible. All in all, we are continuously improving the ecology from Ironic to save our users time.

What feedback do you have to provide to the upstream OpenStack Ironic team?

We hope that Ironic has an operating system agent solution, like qemu guest agent.

Learn more

You’ll find an overview of Ironic on the project Wiki. Discussion of the project takes place in #openstack-ironic on irc.freenode.net. This is a great place to jump in and start your Ironic adventure. The channel is very welcoming to new users – no question is a wrong question!

The team also holds one-hour weekly meetings at 1500 UTC on Mondays in the #openstack-ironic room on irc.freenode.net chaired by Julia Kreger (TheJulia) or Dmitry Tantsur (dtantsur).

Stay tuned for more case studies from organizations using OpenStack Ironic.

 

Photo // CC BY NC

The post OpenStack Ironic Bare Metal Program case study: China Mobile appeared first on Superuser.

by Superuser at October 04, 2019 01:00 PM

Aptira

Real-World Open Networking. Part 4 – Interoperability: Problems with API’s

Real-World Open Networking. Part 4 - Interoperability. Problems with API's

In our last post we looked at different general patterns of standards compliance in Open Network solutions. In this post we drill down another layer to look at interoperability at the Application Program Interface (API) level, which creates issues at a level beyond standards.

Background

As we’ve mentioned previously, network equipment has been focused on interface compatibility and interoperability for many decades and has a history of real interoperability success. Traditional networks exposed communications interfaces and most of the standards for network equipment focus on these interfaces.

But with the advent of network software equivalents to hardware devices, we open up new areas for problems.

Software components may implement the same types of communications interfaces, but also will provide Application Program Interfaces (API’s) for interaction between itself and other software components. These API’s may be the subject of standards, and thus the issues raised in previous article may apply. Or they may be simply proprietary API’s, unique to the vendor.

So we need to take a look at how API’s can support interoperability and also the problems that occur in API implementation that make interoperability more challenging.

API Interoperability

There are a number of levels at which API’s are open and potentially interoperable, or not.

  • Availability of the specification and support by the vendor of third-party implementation (standard or proprietary)
  • Level of compliance with any documentation (standardised or not)
  • Ability of the underlying components to satisfy the exposed API

Previously, we covered the different degrees of compliance and the obstacles that this put in the way of successful Open Network solutions. In this post we’ll elaborate on the other two only.

Availability of the Interface Specification

Open Standards specifications are generally available, but often not freely available. Some organisations restrict specifications to varying levels of membership of their organisation. Sometimes only paid members can access the specifications.

Proprietary interfaces may be available under certain limited conditions or they may not be available at all. Availability is usually higher for de facto standards, because it enables the standards owner to exert some influence over the marketplace. Highly proprietary interfaces often have higher hurdles to obtain access, typically only if an actual customer requests the specification for itself or on behalf of a solution integrator.

Practical Accessibility in a Project

It’s one thing to get access to an API specification document, but its very much another to gain practical accessibility to the information necessary to implement an interface to that API.

An Open Network solution may have hundreds of API’s in its inventory of components, or more. These API’s must be available for use by the solution designers. A typical solution is to publish these API’s in a searchable catalog. This might be ‘open’ in one sense, but not necessarily Interoperable.

Solution integrators must also have access to a support resource to help with issues arising from the implementation (bugs, etc). It is far too common for the API document to be of limited detail, inaccurate, and even out-of-date. The richness of this support resource and the availability of live support specialists will directly translate to implementation productivity.

Ability of the Underlying Components to Satisfy the API

Software has a number of successes at implementing syntactic and representational openness but not semantic openness. Using the REST standard as an example, I can post a correctly formatted and encoded payload to a REST endpoint, but unless the receiving application understands the semantic content then the interface doesn’t work.

And if the underlying components cannot service the request in a common (let alone standard) way, theoretical interoperability becomes difficult and/or constrained.

An NFV example may help.

Consider an NFV Orchestration use case that performs auto-scaling of NFV instances based on some measure of throughput against capacity. Most NFV components make it easy to obtain the required measures of the relevant metric via telemetry.

But it is the range of available metrics and the algorithms used to generate the metrics that introduces complexity and potentially impacts Interoperability.

One NFV vendor might provide this measure in terms of CPU utilisation at a total NFV level. Another might provide the CPU utilisation at a VM level. Or vendors may use different algorithms for calculating the metric that they call “CPU Utilisation” or may vary considerably in the timing of updates. Another vendor might not provide CPU utilisation at all but may provide a metric of packets per second.

Conclusion

API’s play a significant role in the implementation of Open Network solutions and the achievement of interoperability. However, they are not a “silver bullet” and there can be many challenges. As with Standards compliance, API availability, and potentially compliance with a standard, cannot be assumed.

In the last few posts we’ve focused on software-related topics, but it’s time to bring back the Networking side of Open Networking for our last two posts. Leaving technology aside for the moment, how does a solution integrator deal with the different paradigms for solution implementation that can exist in an Open Networking project? We’ll cover that in the next post.

Stay tuned.

Ready to move your network into the software defined future?
Automate your network with ONAP.

Find Out How

The post Real-World Open Networking. Part 4 – Interoperability: Problems with API’s appeared first on Aptira.

by Adam Russell at October 04, 2019 04:41 AM

October 03, 2019

RDO

RDO is ready to ride the wave of CentOS Stream

The announcement and availability of CentOS Stream has the potential to improve RDO’s feedback loop to Red Hat Enterprise Linux (RHEL) development and smooth out transitions between minor and major releases. Let’s take a look at where RDO interacts with the CentOS Project and how this may improve our work and releases.

RDO and the CentOS Project

Because of tight coupling with the operating system, RDO project joined the CentOS SIGs initiative from the beginning. CentOS SIGs are smaller groups within the CentOS Project community focusing on a specific area or software type. RDO was a founding member of the CentOS Cloud SIG that is focusing on cloud infrastructure software stacks and is using the CentOS Community BuildSystem (CBS) to build final releases.

In addition to Cloud SIG OpenStack repositories, during release development RDO Trunk repositories provide packages for new commits in OpenStack projects soon after they are merged upstream. After commits are merged a new package is created and a YUM repository is published in RDO Trunk server, including this new package build and the latest builds for the rest of packages in the same release.This enables packagers to identify packaging issues almost immediately after they are introduced, shortening the feedback loop to the upstream projects.

How CentOS Stream can help

A stable base operating system, on which continuously changing upstream code is built and tested, is a prerequisite. While CentOS Linux did come close to this ideal, there were still occasional changes in the base OS that were breaking OpenStack CI, especially after a minor CentOS Linux release where it was not possible to catch those changes before they were published.

The availability of rolling-release CentOS Stream, announced alongside CentOS Linux 8,  will help enable our developers to provide earlier feedback to the CentOS and RHEL development cycles before breaking changes are published. When breaking changes are necessary, it will help us adjust for them ahead of time.

A major release like CentOS Linux 8 is even more of a challenge, RDO has managed to transition from EL6 to EL7 during the OpenStack Icehouse cycle by doing two distributions in parallel – five years ago, with a much smaller package set than it is now.

For the current OpenStack Train release in development, the RDO project started preparing for the Python 3 transition using Fedora 28, which helped to get this huge migration effort going, at the same time it was only a rough approximation for RHEL 8/CentOS Linux 8 and required complete re-testing on RHEL.

Since CentOS Linux 8 is released very closely to the OpenStack Train release, the RDO project will be able to provide RDO Train initially only on EL7 platform and will add CentOS Linux 8 support in RDO Train soon after.

For the future releases, the RDO project is looking forward to be able to start testing and developing against CentOS Stream updates as they are developed, to provide feedback, and help stabilize the base OS platform for everyone!

About The RDO Project

The RDO project is providing a freely-available, community-supported distribution of OpenStack that runs on Red Hat Enterprise Linux (RHEL) and its derivatives, such as CentOS Linux. RDO also makes the latest OpenStack code available for continuous testing while the release is under development.

In addition to providing a set of software packages, RDO is also a community of users of cloud computing platforms on Red Hat-based operating systems where you can go to get help and compare notes on running OpenStack.

by apevec at October 03, 2019 08:26 PM

Aptira

Real-world Open Networking. Part 3 – Interoperability: Problems with Standards

Real-world Open Networking. Part 3 – Interoperability: Problems with Standards

In our last post we unpacked Interoperability, including Open Standards. Continuing this theme, we will look at how solution developers implement standards compliance and the problems that arise.

Introduction

Mandating that vendors (and internal systems) comply with Open Standards is a strategy used by organisations to drive interoperability. The assumption is that Open Standards compliant components will be interoperable.

In this post we examine the many reasons why that assumption does not always hold in real-world situations. This analysis will be from the Software perspective, since generally network equipment does a better job of component interoperability than software. This post will also cover general aspects of standards compliance in this post, and the specific aspects of API’s in the next post.

Software Implementation & Interoperability based on Standards

Whether the standard is “de jure” or “de facto”, there are three basic approaches to implementing software compliance with the standards:

  • Reference implementation compatible
  • Reference document compatible
  • Architecture pattern or guideline compatible

Reference Implementation Compatible

This approach consists of two parts:

  • The standard is a controlling design input: i.e. compliance overrides other design inputs; and
  • Validation against a “reference implementation” of the standard.

reference implementation” is a software component that is warranted to comply with the standard and is a known reference against which to validate a developed component. This should also include a set of standard test cases that verify compliance and / or highlight issues. 

Vendors often provide the test results as evidence and characterisation of the level of compliance. 

Benefits of this approach

This is the highest level of compliance possible against a standard. Two components that have been validated against the standard will be interoperable at the lowest common level to which they have both passed the test.

Problems with this approach

A reference implementation must exist and be available, however this is not always the case. The reference implementation must be independently developed and certified, often by the standards body themselves.

Reference Document Compatible

This approach is similar to the Reference Implementation approach. Firstly, the documented standard is a controlling design input. However the second part (validation against the standard) is both optional and highly variable. At the most basic level, compliance could be just that the vendor asserts component compliance with the standard. Alternatively, compliance may be validated by comparison between the developed component and the documentation, and there are many ways to do this at varying levels of accuracy and cost.

Benefits of this approach

The main benefits of this approach is that the design is driven to the standard, and at this level it is equivalent to the Reference implementation approach.

Problems with this approach

Validation without a reference implementation is highly manual and potentially subject to interpretation. This type of validation is very expensive which creates cost pressure for vendors to only partially validate, especially on repeat version upgrades and enhancements.

Architecture Pattern Compatible

In this case the standard is used as one input to the design, but not as the controlling input. The intent is not compliance but alignment. The product may use the same or similar underlying technologies as defined in the standards (e.g. REST interfaces or the same underlying data representation standards such as XML or JSON. The vendor may adopt a similar component architecture (e.g. microservices) to the standard.

Benefits of this approach

At best, this approach may provide a foundation for future compliance.

Problems with this approach

In general, the vendor is designing their product to be “not incompatible” with the standard, without taking on the cost of full compliance.

Rationale for Vendors to Implement Standards

Standards compliance is expensive to implement, regardless of the approach taken. So each vendor will take on their own approach, based on their own situations and context. A vendor may:

  • Completely ignore the standards issue
  • Deliberately, e.g. a start-up, whose early target customers don’t care.
  • Accidentally, if they are unaware of the standards.
  • Not see a competitive advantage in their marketplace: not so much as to justify the cost of standards implementation.
  • Adopt a customisation approach: in other words, implement standardisation when required.
  • Have full compliance in their roadmap for future implementation and simply want a foundation to build on.

Problems with compliance

There are a wide range of implementations and the results are highly variable. The important thing to remember is that a claim of “standards compliance” can mean many things.

From a starting point of the intent to comply (or at least claim compliance), and using any of the strategies above, a vendor can be non-compliant in many ways:

  • Partial implementation, e.g. a custom solution for one customer that is “productised”;
  • Defects in implementation, including misinterpretation of the standard;
  • Deliberate forking of the standard, including the implementation of superset functionality (“our solution is better than the standard”);
  • The incompatibility of underlying or related components;
  • Compliance with limited subsets of the standard, e.g. the most often used functions;
  • Some vendors may misrepresent compliance based on tenuous connections: e.g. , a vendor might claim compatibility on the basis that their API’s are REST-based and nothing more.

Conclusion

Nothing can be assumed about standards compliance, other than that each vendor’s claims must be validated. The other part of this issue is Application Program Interfaces (API) interoperability. We will cover this in the next post. Stay tuned.

Become more agile.
Get a tailored solution built just for you.

Find Out More

The post Real-world Open Networking. Part 3 – Interoperability: Problems with Standards appeared first on Aptira.

by Adam Russell at October 03, 2019 03:55 AM

October 02, 2019

OpenStack Superuser

Meet the Shanghai Open Infrastructure Superuser Award nominees

Who do you think should win the Superuser Award for the Open Infrastructure Summit Shanghai?

When evaluating the nominees for the Superuser Award, take into account the unique nature of use case(s), as well as integrations and applications of open infrastructure by each particular team. Rate the nominees before October 8 at 11:59 p.m. Pacific Daylight Time.

Check out highlights from the five nominees and click on the links for the full applications:

  • Baidu ABC Cloud Group and Edge Security Team, who integrated Kata Containers into the fundamental platform for the entire Baidu internal and external cloud services, and who built a secured environment upon Kata Containers for the cloud edge scenario, respectively. Their cloud products (including VMs and bare metal servers) cover 11 regions including North and South China, 18 zones and 15 clusters (with over 5000 physical machines per cluster).
  • FortNebula Cloud, a one-man cloud show and true passion project run by Donny Davis, whose primary purpose is to give back something useful to the community, and secondary purpose is to learn how rapid fire workloads can be optimized on OpenStack. FortNebula has been contributing OpenDev CI resources since mid 2019, and currently provides 100 test VM instances which are used to test OpenStack, Zuul, Airship, StarlingX and much more. The current infrastructure sits in a single rack with one controller, two Swift, one Cinder and 9 compute nodes; total cores are 512 and total memory is just north of 1TB.
  • InCloud OpenStack Team, of Inspur, who has used OpenStack to build a mixed cloud environment that currently provides service to over 100,000 users, including over 80 government units in mainland China. Currently, the government cloud has provided 60,000+ virtual machines, 400,000+ vcpu, 30P+ storage for users, and hosts 11,000+ online applications.
  • Information Management Department of Wuxi Metro, Phase II of  the Wuxi Metro Cloud Platform project involved the evolution of IaaS to PaaS on their private cloud platform based on OpenStack. In order to acquire IT resources on demand and improve overall business efficiency, Wuxi Metro adopted Huayun Rail Traffic Cloud Solution, which was featured by high reliability, high efficiency, ease of management and low cost.
  • Rakuten Mobile Network Organization, of Rakuten Inc., Japan, launched a new initiative to enter the mobile market space in Japan last year as the 4th Mobile Network Operator (MNO), with a cloud-based architecture based on OpenStack and Kubernetes. They selected to run their entire cloud infrastructure on commercial, off-the-shelf (COTS) x86 servers, powered by Cisco Virtualized Infrastructure Manager (CVIM), an OpenStack-based NFV platform. The overall plan is to deploy several thousand clouds running vRAN workloads spread across all of Japan to a target 5M mobile phone users. Their current deployment includes 135K cores, with a target of one million cores when complete.

Each community member can rate the nominees once by October 8 at 11:59 p.m. Pacific Daylight Time.

Previous winners include AT&T, City Network, CERN, China Mobile, Comcast, NTT Group, the Tencent TStack Team, and VEXXHOST.

The post Meet the Shanghai Open Infrastructure Superuser Award nominees appeared first on Superuser.

by Superuser at October 02, 2019 06:11 AM

About

Planet OpenStack is a collection of thoughts from the developers and other key players of the OpenStack projects. If you are working on OpenStack technology you should add your OpenStack blog.

Subscriptions

Last updated:
December 14, 2019 07:52 AM
All times are UTC.

Powered by:
Planet